PCA + K-Means Clustering: Iris Dataset¶
Author: @TodoEconometria | Professor: Juan Marcelo Gutierrez Miranda
Table of Contents¶
- Introduction
- The Iris Dataset: A Machine Learning Classic
- Why Combine PCA + Clustering
- Principal Component Analysis (PCA)
- K-Means Clustering
- Interpretation of Results
- Conclusions and Recommendations
1. Introduction¶
This document presents a complete analysis of the famous Iris dataset combining two fundamental unsupervised Machine Learning techniques:
- PCA (Principal Component Analysis): Dimensionality reduction
- K-Means Clustering: Observation grouping
Analysis Objectives¶
- Reduce the 4 original dimensions to 2 principal dimensions
- Identify natural groups in the data (flower species)
- Visualize patterns and relationships in a 2D space
- Validate whether unsupervised clustering can discover the 3 known species
2. The Iris Dataset: A Machine Learning Classic¶
History and Context¶
The Iris dataset was introduced by Ronald Fisher in 1936 in his seminal paper:
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.
It is one of the most widely used datasets in:
- Machine Learning education
- Classification algorithm validation
- Data visualization examples
Dataset Description¶
| Feature | Description |
|---|---|
| Observations | 150 flowers |
| Species | 3 (Setosa, Versicolor, Virginica) |
| Variables | 4 measurements in centimeters |
| Distribution | 50 flowers per species (balanced) |
Measured Variables¶
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
BOTANICAL NOTE: The sepal is the green part that protects the flower before it opens. The petal is the colorful part of the flower.
Exploratory Data Analysis (EDA)¶

Why Is This Dataset Important?¶
- Manageable Size: 150 observations are sufficient for learning without being overwhelming
- Well Balanced: 50 flowers of each species (no class imbalance)
- Separability: One species (Setosa) is linearly separable, the other two slightly overlap
- Multivariate: 4 variables allow practicing dimensionality reduction techniques
3. Why Combine PCA + Clustering¶
The Dimensionality Problem¶
When we have more than 3 dimensions, it is impossible to visualize the data directly:
- 1D: Line (easy)
- 2D: Plane (easy)
- 3D: 3D space (possible but difficult)
- 4D+: Impossible to visualize
The Solution: PCA + Clustering¶
Original Data (4D)
↓
PCA (Reduction)
↓
Reduced Data (2D) ← Now we can VISUALIZE
↓
K-Means (Grouping)
↓
Identified Clusters
Advantages of This Combination¶
| Advantage | Explanation |
|---|---|
| Visualization | PCA reduces to 2D for plotting |
| Noise Reduction | PCA removes non-informative variance |
| Better Clustering | K-Means works better in lower-dimensional spaces |
| Interpretability | We can see and understand clusters in 2D |
4. Principal Component Analysis (PCA)¶
What is PCA?¶
PCA is a technique that:
- Finds the directions of maximum variance in the data
- Projects the data onto those directions (principal components)
- Reduces dimensionality while retaining the most information possible
PCA Results on Iris¶
Explained Variance¶
| Dimension | Eigenvalue | Variance (%) | Cumulative Variance (%) |
|---|---|---|---|
| Dim.1 | ~2.92 | ~73% | ~73% |
| Dim.2 | ~0.91 | ~23% | ~96% |
| Dim.3 | ~0.15 | ~4% | ~99% |
| Dim.4 | ~0.02 | ~1% | ~100% |
INTERPRETATION: The first 2 dimensions capture ~96% of the total variance. This means we can reduce from 4D to 2D losing only ~4% of information.
Kaiser Rule¶
The Kaiser Rule states: Retain components with eigenvalue > 1
- Dim.1: Eigenvalue = 2.92 (Retain)
- Dim.2: Eigenvalue = 0.91 (Close to 1, retain for visualization)
- Dim.3: Eigenvalue = 0.15 (Discard)
- Dim.4: Eigenvalue = 0.02 (Discard)
Interpretation of the Dimensions¶
Dimension 1 (~73% of variance)¶
Most contributing variables:
- Petal Length (~42%)
- Petal Width (~42%)
Interpretation:
Dim.1 represents "petal size". Flowers with high Dim.1 values have large petals; low values have small petals.
Dimension 2 (~23% of variance)¶
Most contributing variables:
- Sepal Width (~72%)
Interpretation:
Dim.2 represents "sepal width". Flowers with high Dim.2 values have wide sepals; low values have narrow sepals.
Correlation Circle¶
The correlation circle shows how original variables relate to the principal dimensions:
Dim.2 (Sepal Width)
↑
|
Sepal Width |
↑ |
| |
─────────┼───────┼─────────→ Dim.1 (Petal Size)
| |
| Petal Length →
| Petal Width →
|
Observations:
- Petal Length and Petal Width are highly correlated (arrows in the same direction)
- Sepal Width is nearly perpendicular to petal measurements (low correlation)
- Sepal Length is between both dimensions
5. K-Means Clustering¶
What is K-Means?¶
K-Means is a clustering algorithm that:
- Divides the data into K groups (clusters)
- Minimizes the distance of each point to its centroid
- Iterates until convergence
Determining the Optimal Number of Clusters¶
Elbow Method¶
We plot inertia (sum of squared distances) vs K:
Interpretation: The "elbow" is at K=3, suggesting 3 clusters.
Silhouette Score¶
The Silhouette Score measures how well separated the clusters are:
- Value: Between -1 and 1
- Interpretation:
- Close to 1: Well separated clusters
- Close to 0: Overlapping clusters
- Negative: Misassigned points
Result for Iris: Silhouette Score ~ 0.55 (good separation)
Clustering Results¶
Confusion Matrix: Clusters vs Real Species¶
| Cluster 0 | Cluster 1 | Cluster 2 | |
|---|---|---|---|
| Setosa | 50 | 0 | 0 |
| Versicolor | 0 | 48 | 2 |
| Virginica | 0 | 14 | 36 |
Observations:
- Setosa: Perfectly separated (100% in Cluster 0)
- Versicolor: Mostly in Cluster 1 (96%)
- Virginica: Mostly in Cluster 2 (72%), but with overlap with Versicolor
Cluster Purity¶
Purity measures the percentage of correctly grouped observations:
INTERPRETATION: The K-Means algorithm correctly identified the species in 89.3% of cases, without knowing the real labels. This is excellent for an unsupervised method.
Cluster Visualization¶
In the PCA 2D space, the clusters look like this:
Dim.2
↑
│ ● Cluster 2 (Virginica)
│ ●●●
│ ●●●●
│ ●●●●
│ ●●●● ■■■ Cluster 1 (Versicolor)
│●●● ■■■■
───────┼■■■■■■■■■──────→ Dim.1
│
│ ▲▲▲
│ ▲▲▲▲▲
│▲▲▲▲▲▲ Cluster 0 (Setosa)
│
Centroids (marked with X):
- Cluster 0: (-2.7, 0.3) → Setosa
- Cluster 1: (0.3, -0.5) → Versicolor
- Cluster 2: (1.7, 0.2) → Virginica
6. Interpretation of Results¶
Full Panel: PCA + K-Means Clustering¶

Analysis by Species¶
Setosa (Cluster 0)¶
Characteristics:
- Petal Length: Very small (~1.5 cm)
- Petal Width: Very small (~0.2 cm)
- Sepal Width: Relatively large
PCA Position:
- Dim.1: Very negative values (small petals)
- Dim.2: Positive values (wide sepals)
Separability: Perfect (100% correctly grouped)
Versicolor (Cluster 1)¶
Characteristics:
- Petal Length: Medium (~4.3 cm)
- Petal Width: Medium (~1.3 cm)
- Sepal Width: Medium
PCA Position:
- Dim.1: Values close to 0 (medium petals)
- Dim.2: Slightly negative values
Separability: Good (96% correctly grouped, 4% confused with Virginica)
Virginica (Cluster 2)¶
Characteristics:
- Petal Length: Large (~5.5 cm)
- Petal Width: Large (~2.0 cm)
- Sepal Width: Medium
PCA Position:
- Dim.1: Very positive values (large petals)
- Dim.2: Values close to 0
Separability: Moderate (72% correctly grouped, 28% confused with Versicolor)
Evaluation Metrics¶
| Metric | Value | Interpretation |
|---|---|---|
| Silhouette Score | 0.55 | Good separation between clusters |
| Davies-Bouldin Index | 0.66 | Compact and separated clusters (lower is better) |
| Calinski-Harabasz Index | 561.63 | High separation between clusters (higher is better) |
| Purity | 89.3% | High agreement with real species |
Why Do Versicolor and Virginica Overlap?¶
Biological Reason:
- Versicolor and Virginica are evolutionarily closer species
- They share similar morphological characteristics
- Setosa is more distinct (probably from a different lineage)
Statistical Reason:
- The petal measurements of Versicolor and Virginica have overlapping ranges
- There is no clear boundary in the 4-dimensional space
7. Conclusions and Recommendations¶
Main Conclusions¶
- PCA is Effective:
- Reduces from 4D to 2D while retaining 96% of the information
-
The first 2 dimensions are sufficient for visualization and clustering
-
Petal Measurements are Key:
- Petal Length and Petal Width are the most discriminating variables
-
Dim.1 (which represents petal size) explains 73% of the variance
-
K-Means Works Well:
- Correctly identifies the 3 species in 89.3% of cases
- Setosa is perfectly separable
-
Versicolor and Virginica have some natural overlap
-
Validation of the Unsupervised Method:
- Without knowing the labels, K-Means discovers the 3 natural groups
- This validates that the species have real morphological differences
Lessons for Students¶
Lesson 1: The Importance of Dimensionality Reduction¶
BEFORE PCA: 4 variables → Hard to visualize → Hard to interpret
AFTER PCA: 2 dimensions → Easy to visualize → Clear patterns
Takeaway: You don't always need all the variables. Sometimes, less is more.
Lesson 2: Unsupervised Clustering Can Discover Real Structure¶
WITHOUT LABELS: K-Means finds 3 groups
WITH LABELS: There are 3 real species
MATCH: 89.3%
Takeaway: Data has natural structure. Algorithms can find it.
Lesson 3: Not All Groups Are Perfectly Separable¶
Setosa: 100% separable
Versicolor/Virginica: Natural overlap
Takeaway: In real data, overlap is normal. Don't expect perfect clusters.
Lesson 4: Validate, Validate, Validate¶
Elbow Method: Suggests K=3
Silhouette Score: Confirms K=3
Purity: Validates that K=3 is correct
Takeaway: Use multiple metrics to validate your decisions.
Practical Recommendations¶
For Iris Species Classification¶
- Focus on petal measurements (they are the most discriminating)
- Use PCA for visualization (reduces complexity without losing information)
- K=3 is optimal (validated by multiple metrics)
For Similar Data Analyses¶
- Always do EDA first (understand distributions and correlations)
- Standardize before PCA (variables on different scales bias results)
- Validate the number of clusters (don't assume K, use Elbow + Silhouette)
- Compare with ground truth (if available, as in this case)
Possible Extensions¶
- Other Clustering Algorithms:
- DBSCAN (for arbitrarily shaped clusters)
- Hierarchical Clustering (for dendrograms)
-
Gaussian Mixture Models (for probabilistic clusters)
-
Supervised Classification:
- Use the known species to train a classifier
-
Compare with unsupervised clustering
-
Supplementary Variable Analysis:
- Add geographic location information
- Add collection season information
References¶
Original Papers¶
- Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.
-
The original paper that introduced the Iris dataset
-
Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Society, 59, 2-5.
- The botanist who collected the original data
Reference Books¶
- Husson, F., Le, S., & Pages, J. (2017). Exploratory Multivariate Analysis by Example Using R. CRC Press.
-
Main reference for FactoMineR-style PCA
-
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
- Chapters on PCA and Clustering
Technical Articles¶
- Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12, pp. 2825-2830.
-
Documentation for the libraries used
-
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.
- Silhouette Score method
Additional Resources¶
Online Tutorials¶
Similar Datasets¶
- Wine Dataset: 178 wines, 13 chemical variables, 3 classes
- Breast Cancer Dataset: 569 tumors, 30 variables, 2 classes (malignant/benign)
- Digits Dataset: 1797 digit images, 64 pixels, 10 classes
Author: @TodoEconometria Professor: Juan Marcelo Gutierrez Miranda Date: January 2026 License: Educational use with attribution
Frequently Asked Questions (FAQ)¶
Why standardize before PCA?¶
Answer: Because PCA is sensitive to the scale of variables. If one variable has much larger values than another (e.g., income in thousands vs age in tens), it will dominate the variance and bias the results.
How many components should I retain?¶
Answer: It depends on the objective:
- Visualization: 2-3 components
- Kaiser Rule: Components with eigenvalue > 1
- Cumulative Variance: Retain until reaching 80-95% of variance
Does K-Means always find the correct clusters?¶
Answer: No. K-Means has limitations:
- Assumes spherical clusters
- Sensitive to initialization (use high
n_init) - Requires specifying K in advance
What if I have more than 3 species?¶
Answer: The process is the same:
- Use Elbow + Silhouette to determine the optimal K
- Validate with metrics (purity, confusion matrix)
- Visualize in 2D with PCA (even with more than 3 clusters)
---
END OF DOCUMENT