Smoker Microarray Experiment and Dataset

Spira (2004) reports on a microarray analysis using 9968 genes from the lung tissue in a group of 75 patients belonging to one of three designated groups C, F and N defined by the cigarette smoking status: current smoker, former smoker and non-smoker. See paper for precise definitions. The number of individuals in each group are shown in the table below.

## y
##  C  F  N 
## 34 18 23

The purpose of the experiment was to identify genes which changed depending on the smoking status. Patient covariates including age, gender, race, year smoking started and year smoking ended are also available.

The raw CEL data was obtained from the microarray analysis and converted to a gene expression matrix using software available from Bioconductor https://www.bioconductor.org/. Following common practice, the data was standardized by patient, a practice sometimes called “row-standardization”. Note that this is unlike “column-standardization” which is frequently used with penalized regression and principal component analysis.

Using a simple statistical analysis Spira (2004) found that there were clear differences in the gene expressions among groups C and N. It appeared that group F was roughly midway between C and N. The results were summarized by Spira (2004) in the heatmap shown in Figure 1.

Figure 1. Heatmap from Spira (2004)

Figure 1. Heatmap from Spira (2004)

In our analysis we will explore how well the various classification algorithms perform. Our analysis at this point is just confined to the training data and we would expect that algorithms using strong feature selection may overtrain. Further analysis using cross-validation methods is needed to determine the expected performance of test data.

Reference

Avrum Spira et al. (2004). Effects of cigarette smoke on the human airway epithelial cell transcriptome PNAS, Jul 2004; 101: 10143 - 10148. http://www.pnas.org/cgi/citemap?id=pnas;101/27/10143

Comparison of Classification Algorithms

Since \(n=75\), the maximum 95% MOE is about 11.32%.

To run the Rmd file that produced this report you need the following CRAN packages installed: glmnet, HiDimDA, sfsmisc and gencve. You will also need to download the compressed data file SmokerSpira.rda and adjust the initialization code to attach to it.

Lasso Multinomial Logistic Regression

Figure 2. Lasso Tuning Parameter Estimation using Cross-Validation

##    yH
## y    C  N
##   C 33  1
##   F  9  9
##   N  1 22

The confusion table supports the findings of Spira (2004) that the groups C and N have distinct expressions.

The misclassification rate on the training data was 26.67%.

It is interesting that the fitted multi-logistic model was very simple. There was only a intercept parameter corresponding to each of the three classes and three slope parameters for each of the classes C and N. The slope parameters for F were all zero. The slope parameters for C and corresponding genes are:

##      g2036      g3436      g9415 
## 1.03775110 0.60068669 0.06638312

and for N,

##      g44    g3566    g9585 
## 0.335119 2.925551 1.384871

So multi-logistic Lasso regression pinpoints which genes contribute the most to the differences in gene expressions between C and N. Simple feature selection methods do not do so well.

Diagonal LDA using HiDimDA

The CRAN package HiDimDA provides a recent implementation of diagnonal discriminant analysis in the function Dlda().

##    yH
## y    C  F  N
##   C 26  7  1
##   F  4  9  5
##   N  0  2 21

Some feature selection was used in the algorithm. Of the 9968 genes in the data matrix only 285 genes were used in the linear discriminant.

The misclassification rate on the training data was 25.33%.

Diagonal LDA using sfsmisc

##    yH
## y    C  F  N
##   C 21  4  9
##   F  4  7  7
##   N  7  3 13

The misclassification rate on the training data was 45.33%. Random guessing would correspond to a misclassification rate of 66% or accuracy rate of 33%, so we are doing better than that!

No feature selection was used in this algorithm. In practice with microarray data feature selection is often used since it is felt better to remove some genes which are just noise. This is clearly important with methods that don’t use regularization to control the variability of the estimates.

LDA with Strong Feature Selection

My package provides gencve::featureSelect() which ranks the genes according to their one-way ANOVA F-ratio. This type of feature selection is still often used but it may be expected to overfit the data so the classifier does not generalize well, that is, it does not perform as well as expected on new test data.

We use featureSelect() to find the 20 “best” genes and the use LDA.

##    yH
## y    C  F  N
##   C 32  2  0
##   F  4 11  3
##   N  1  2 20

The misclassification rate on the training data was 16%.

LDA with Weak Feature Selection

Under weak feature selection, the genes with maximum variance are used.

##    yH
## y    C  F  N
##   C 30  2  2
##   F  3 11  4
##   N  1  2 20

The misclassification rate on the training data was 18.67%.