NCI microarray data.There are 64 columns corresponding to patients with 14 different cancers. Each patient’s microarray has 6830 genes. Orignal Source: http://genome-www.stanford.edu/nci60/. This dataset was discussed in the textbook ISL and is also available in the R package ISRL.
In addition it is available in csv format nci.csv on our webpage. The dataset is in expression matrix form so the dimensions are 6830 rows and 64 columns.
The dotchart below summarzing the number of cancers and patients.
To identify the number of clusters K, a frequently used method is to plot of the percentage change in Within SS and select the number of clusters after the percentage change diminishes to a small value. This is similar to the use of the Skree Plot to select the number of principal components. From the plot below we can try k=4 and k=5.
Selecting k=4 the clusters can be summarized in a crosstabs.
## ClusterID
## Cancer 1 2 3 4
## BREAST 3 2 2 0
## CNS 5 0 0 0
## COLON 0 0 7 0
## K562A 0 0 0 1
## K562B 0 0 0 1
## LEUKEMIA 0 0 0 6
## MCF7A 0 0 1 0
## MCF7D 0 0 1 0
## MELANOMA 1 7 0 0
## NSCLC 6 0 3 0
## OVARIAN 4 0 2 0
## PROSTATE 1 0 1 0
## RENAL 9 0 0 0
## UNKNOWN 1 0 0 0
Crosstabs are best understood using a Mosaic Plot. This plot shows that all Colon and MCF are clustered together in group 4 while all Renal and CNS are in the same group (ID=1). There are many other groups that may be of interest. In practice, we would also examine the cluster with k=5 centers.
The hierarchical clustering using complete linkage with the Eucidean distance function shows many of the cancer group together such as colon, melanoma and renal.
With microarray datasets it is often of interest examine how the genes across patients are clustered. Usually there are so many genes that we use some feature selection method to reduce the number of genes so a manageable dendrogram can be produced. In the example below we selected the 40 most variable genes.