In the initial setup, we attached the glmnet package.
High-quality expression profiles were successfully derived from 52 prostate tumors and 50 non-tumor prostate samples from patients undergoing surgery. Microarrays containing probes for approximately 12600 genes. The goal here is to classify tumor and non-tumor samples. The training set consists of 102 prostate tissues of which 50 are normal and 52 tumor samples. The test set consists of 34 tissues of which 9 are normal and 25 tumor samples. The number of gene expression levels is 12600.
This dataset was published in a paper by Pochet et al. (2004) and is available from my homepage in the R workspace, Singh.Rdata, http://www.stats.uwo.ca/faculty/aim/2017/sdm/data/. Because of its size you should download this file to your computer before trying to use it.
Data source: Nathalie Pochet, Frank De Smet, Johan A.K. Suykens and Bart L.R. De Moor (2004). Systematic benchmarking of microarray data classification: assessing the role of nonlinearity and dimensionality reduction. Bioinformatics. Published July 1, 2004.
Since the data is saved as a list in an R workspace, simply use attach().
#attach("d:/dropbox/r/2017/sdm/data/singh.rda")
attach("e:/dropbox/r/2017/sdm/data/singh.rda")
search()
## [1] ".GlobalEnv"
## [2] "file:e:/dropbox/r/2017/sdm/data/singh.rda"
## [3] "package:glmnet"
## [4] "package:foreach"
## [5] "package:Matrix"
## [6] "package:stats"
## [7] "package:graphics"
## [8] "package:grDevices"
## [9] "package:utils"
## [10] "package:datasets"
## [11] "package:methods"
## [12] "Autoloads"
## [13] "package:base"
objects(2)
## [1] "Singh"
names(Singh)
## [1] "XTrain" "yTrain" "XTest" "yTest"
By default attach() puts the object in position 2 in the search path. The objects(2) tells us the names of the objects inside the object in position 2 is Singh. Using names, we deduce this is a list with four named components, XTrain, yTrain, XTest and yTest.
It is easier to work with variables and my computer has enough memory to deal with extra copies, so each component is put in a separate variable. Using dim() and table() we deduce that XTr and XTe are expression matricies with 12600 rows and 102 columns. And that cancer/healthy are encoded in the output variables as 1/2 respectively.
XTr <- Singh$XTrain #expression matrix
yTr <- Singh$yTrain #1=cancer, 2=healthy
XTe <- Singh$XTest #expression matrix
yTe <- Singh$yTest #1=cancer, 2=healthy
dim(XTr)
## [1] 12600 102
table(yTr)
## yTr
## 1 2
## 50 52
We use glmnet() and cv.glmnet() to a logistic Lasso regression model using 10-fold CV with Breiman’s one standard deviation rule. Then we predict on the test data and compute the confusion matrix.
ans <- glmnet(x=t(XTr), y=yTr, family="binomial")
ansCV <- cv.glmnet(x=t(XTr), y=yTr, family="binomial")
s <- ansCV$lambda.1se
XTe <- Singh$XTest #expression matrix
yTe <- Singh$yTest #1=cancer, 2=healthy
yH <- as.vector(predict(ans, newx=t(XTe), s=s, type="class"))
table(yTe, yH)
## yH
## yTe 1 2
## 1 9 0
## 2 1 24
We see there is only one misclassification. A healthy tissue has been misclassified. The misclassficiation rate is 2.94%.
Many genes are just noise, so if we are able to remove this noise, we may obtain a better classifier. This is the general motivation for feature selection. So we select the top 5% of genes, \(0.05 \times 12600 = 630\), with the largest variance.
genesVar <- apply(XTr,1,var) #variances of genes
indBest <- genesVar>quantile(genesVar,0.95)
sum(indBest)
## [1] 630
Next, we change from gene expression matrix to its transpose as a data matrix. The response variable is defined as a factor rather than using the {1, 2} encoding required by glmnet().
ZTr <- t(XTr[indBest,])
ZTe <- t(XTe[indBest,])
statusTr <- factor(c("cancer", "healthy")[yTr])
statusTe <- factor(c("cancer", "healthy")[yTe])
ZTrStatus <- cbind(as.data.frame.matrix(ZTr), status=statusTr)
ZTeStatus <- cbind(as.data.frame.matrix(ZTe), status=statusTe)
Fitting CART we obtain the confusion matrix below.
ans <- rpart::rpart(status~., data=ZTrStatus)
statusH <- predict(ans, newdata=ZTeStatus, type="class")
table(statusTe, statusH)
## statusH
## statusTe cancer healthy
## cancer 9 0
## healthy 2 23
Comparing with Lasso Logistic Regression, we see that CART mis-classified 2 healthy tissues as opposed to only one with Lasso Logistic Regression.
Fitting C5.0 we obtain the confusion matrix below.
ans <- C50::C5.0(status~., data=ZTrStatus)
statusH <- predict(ans, newdata=ZTeStatus)
table(statusTe, statusH)
## statusH
## statusTe cancer healthy
## cancer 9 0
## healthy 1 24
This agrees exactly with Lasso Logistic Regression.
Random Forest also has only one mis-classification but this time a cancer tissue has been incorrectly classified as healthy.
set.seed(8713321)
ans <- randomForest::randomForest(status~., data=ZTrStatus, ntree=2000)
statusH <- predict(ans, newdata=ZTeStatus)
table(statusTe, statusH)
## statusH
## statusTe cancer healthy
## cancer 8 1
## healthy 0 25
Notice that we fixed the random seed to in order to obtain reproducible results. For reasons discussed in class, randomForest(), can produce slightly different results each time it is run due to the resampling algorithms used. I tried to mitigate this effect by setting ntree=2000 an increase from the default ntree=500 and this did reduce the impact of randomness but it also increase the computer time required. The chosen seed produces the most frequent result. By choosing a different seed, I could have shown a perfect test prediction!