The oil spill dataset is comprised of 17 observations denoted a, b, …, q corresponding to 17 oil spills. For each oil spill a spectroscopy analysis determined to concentration of 173 ions but only 84 of these ions correspond to inputs with a non-zero interquartile range. The spectroscopy ion decomposition acts like a fingerprint to identify which ship was responsible for the engine oil dumping. Our problem is to identify how many ships were involved in these 17 oil spills.
We can plot the percentage change in the Within-Sum-of-Squares as the number of centres K is varied in the kmeans() algorithm. The plot below suggests K=4.
Running kmeans() with 4 centers, we obtain the clustering shown below. We see that observations {a, b, c, i} are in cluster 1; {d, e, o, q} are in 2; {f, g, h, k, l, m} are in 4; while j is the only observation in class 3.
set.seed(7775555) #keep group labels 1,2,3,4 the same
ans <- kmeans(X,centers=4, nstart=10^4, iter.max=1000)$cluster
ans
## a b c d e f g h i j k l m n o p q
## 1 1 1 2 2 4 4 4 1 3 4 4 4 4 2 4 2
Note that in the above code we used a 10,000 restarts and we fixed the initial seed. In theory, k-means only finds a local minimum of the within sum-of-squares function so setting the random seed is necessary if you want to obtain a reproducible result. I can the algorithm with different random seeds and also got the same value of the minimum within sum-of-squares but often the indicies were permuted even though the effective grouping remained the same.
We can use a parallel plot to visualize this high-dimensional clustering. From this plot we see that cluster 4 includes many spills while cluster 1 includes only 1 spill. Cluster’s 2 and 3 look similar but there are some differences especially near the top and bottom part of the panels.
We use hclust() with Ward’s minimum variance dissimilarity linkage to obtain the cluster dendrogram below. This dendrogram suggests that there are three main cluster’s corresponding to {j}, {n,k,l,g,h,m,f,p} and {b,a,i,c,o,q,d,e}.