Introduction

The data was collected in the Gaspé in Quebec and is comprised of 50 4-variate measurements on each of 3 Iris species (setosa, versicolor, virginica). The four measurements were the length and width of the petal and sepals of the flower. So there were a total of 150 observations in all.

Iris Petal and Sepal

This dataset is available in the datasets package with R so it is one of the “built-in” datasets. Here are the first four lines,

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa

The lattice boxplot display summarizes the distributions of each variable for each Iris species. In this display

SL: sepal length
SW: sepal width
PL: petal length
PW: pental width

Notice that simple using PL we can discriminate perfectly between setosa vs. the other two. But the differences of PL between the versicolor and viginica are statistically strong but the distributions of all four measurements slightly overlap, so a simple discrimination method is not evident.

The unit of measurement in centimeters (cm) so perhaps rescaling is not needed. But larger measurements will exert a stronger influence if we just use the raw data – in this case SL (sepal length) will dominate but from the boxplots in Figure 1 PL (petal length) may work better.

Figure 1. Lattice Boxplots of Fisher’s Iris Data

K-Means Clustering

If we ignore the data labels or in other words don’t use the output variable in the analysis, we can treat the Iris Dataset as a clustering problem.

Let’s assume that we know that there are 3 Species of Iris flowers in our sample of 150 observations of Sepal/Petal length/width. So if we use the K-means algorithm we may try K=3. Running kmeans() with data matrix X,

(ans <- kmeans(X, centers=3, nstart=100, iter.max=100)$cluster)

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3

Notice that each observation has been placed in a group labelled from 1 to 3. The lattice parallelplot() can be used to provide a high-dimensional data visualization. The colors are used only to help show the distinct observations in each group but don’t encode any special information such as Species. Visually inspecting this parallel plot we can see that all three clusters differ in Petal Width/Length. There is some overlap with at Sepal Width but Cluster 1 is more pointed to the left than Cluster 2 which points right. At Sepal Width Clusters 2 and 3 point in opposite directions. There are similar differences as well at Sepal Length.

To evaluate how well the clustering has worked, we can do a post-hoc analysis using the labels. The table below shows the number of cases in each cluster.

##          Species
## ClusterID setosa versicolor virginica
##         1     50          0         0
##         2      0          2        36
##         3      0         48        14

From the table we see that ClusterID 1 contains mostly Virginica flowers while ClusterID 2 is comprised entirely of Setosa and ClusterID 3 mostly has Versicolor. The number of misclassifications is 2 + 14 = 16 and this serves as a measure of how well our clustering has worked.

Now we repeat the K-means clustering but using the rescaled Iris variables. For comparison, the crosstabs comparing how successful the clustering has been in separating the Species is shown below. This time we see that there are 14 + 11 = 25 errors. So clustering using the raw data worked better.

##          Species
## ClusterID setosa versicolor virginica
##         1      0         39        14
##         2      0         11        36
##         3     50          0         0

Clustering by Variable

Often clustering of observations is most natural but it is sometimes of interest to examine the cluster of variables. This can be done simply by using the clustering algorithm on the transpose of the data matrix.

kmeans(t(X),centers=2, nstart=100, iter.max=100)$cluster

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##            2            1            2            1

The clustering algorithm groups together the length measurements in one group and the width measurements in another group. To see why this makes sense, consider the table showing the Tukey-five-number summary for each variable. In terms of median and range both length measurements are more similar to each other than the width measurements.

##        Sepal.Length Sepal.Width Petal.Length Petal.Width
## min             4.3         2.0         1.00         0.1
## Q1              1.3         0.5         3.50         1.5
## median          5.8         3.0         4.35         1.3
## Q2              1.3         0.5         3.50         1.5
## max             7.9         4.4         6.90         2.5

Identifying the Number of Clusters with K-Means

One way to identify the number of clusters is to examine a plot of the percentage change in the within sum-of-squares. The plot below suggests we may try K = 4 or perhaps K = 3. Since there are 3 Species, K = 3 is the better answer!

The graph was redone using standardized data and it remains essentially the same - the graph is not shown but the code is included in the Rmd file.

Hierarchival Clustering

The base R function for hierarchical clustering is ##hclust()## that does agglomerative clustering and the default setting is with complete linkage. Given a data matrix, the first step is to compute a dissimilarity matrix using the function dist(). The default is to use Euclidean distances and this is usually the best choice but other choices are provided so one can experiment.

Running hclust() and using plot() the dendogram is obtained.

ans <- hclust(dist(X))
plot(ans, labels=FALSE)

We can use cutree() to effectively prune the tree and see how well the tree performs with respect to the known 3 Species.

ansC <- cutree(ans, k=3)
plot(ansC, yaxp=c(0,4,4), xlab="Observation Number", ylab="Group")

The cophenetic correlation coefficient, like the coefficient of determination in multiple linear regression, measures the goodness of fit of the dendrogram tree to the data. Reference: https://en.wikipedia.org/wiki/Cophenetic_correlation. This can be used to compare hierachical dendrogram trees fit with various other settings such as rescaled or not rescaled or different distance or linkage functions. The code below shows how the cophenetic correlations can be computed and compared. From the result we see that using rescaled data produces a slightly better fit.

DX <- dist(X)
DXS <- dist(scale(X))
CX <- cophenetic(hclust(DX))
CXS <- cophenetic(hclust(DXS))
R <- c(cor(DX, CX), cor(DXS, CXS))
names(R) <- c("original scale", "rescaled")
R

## original scale       rescaled 
##      0.7269857      0.7514592

However when compared with how well the rescaled data clusters agree with the 3 known species, the original dendrogram tree is preferable.

#how well does the tree match the Species classes
ans <- cutree(hclust(dist(scale(X))), k=3)
plot(ans, yaxp=c(0,4,4), xlab="Observation Number", ylab="Group")

Clustering with the Fisher Iris Data

Ian McLeod

March 31, 2017