NCI microarray data.There are 64 columns corresponding to patients with 14 different cancers. Each patient’s microarray has 6830 genes. Orignal Source: http://genome-www.stanford.edu/nci60/. This dataset was discussed in the textbook ISL and is also available in the R package ISRL.
In addition it is available in csv format nci.csv on our webpage. The dataset is in expression matrix form so the dimensions are 6830 rows and 64 columns.
The dotchart below summarzing the number of cancers and patients.
Next we compute the Within SS and use the BIC to select the number of clusters.
The method suggested in the ISL textbook is to look at the a plot of the percentage change in Within SS and select the number of clusters after using the largest values. This is similar to the use of the Skree Plot to select the number of principal components. From the plot below we can try k=4 and k=5.
Selecting k=4 the clusters can be summarized in a crosstabs.
## ClusterID
## Cancer 1 2 3 4
## BREAST 3 2 0 2
## CNS 5 0 0 0
## COLON 0 0 0 7
## K562A 0 0 1 0
## K562B 0 0 1 0
## LEUKEMIA 0 0 6 0
## MCF7A 0 0 0 1
## MCF7D 0 0 0 1
## MELANOMA 1 7 0 0
## NSCLC 6 0 0 3
## OVARIAN 4 0 0 2
## PROSTATE 1 0 0 1
## RENAL 9 0 0 0
## UNKNOWN 1 0 0 0
Crosstabs are best understood using a Mosaic Plot.