prostateHowToPCA

The purpose of this R markdown notebook is to illustrate basic PCA functions. In the initialization we have input the prostate dataset into the dataframe X. This dataframe has 97 rows and 9 columns. In regression, we took the last column as an output variable. In our PCA illustrative analysis all variables are used.

If we were just interested in principal component regression, we might just to the PCA on the first eight columns. This is left as an optional exercise.

There are two built-in functions in R for PCA. The older one is princomp() that uses the eigendecomposition.

The preferred function for principal component analysis is prcomp() that uses the SVD method. This is preferred because the SVD algorithm is numerically more accurate that the eigendecomposition of the sample covariance matrix. Note that the SVD works directly with data matrix and so avoids computation of the covariance matrix. The prcomp() also can be used for wide data as in the Oil Spill dataset to be discussed in another notebook.

PCA analysis using prcomp() for the dataframe X starts with invoking the function and print synopsis.

ans <- prcomp(X, scale=TRUE)
ans

## Standard deviations:
## [1] 1.9780632 1.2867393 1.0537126 0.8020255 0.7133805 0.6728987 0.5768927
## [8] 0.4769276 0.3950234
## 
## Rotation:
##                PC1         PC2         PC3         PC4         PC5
## lcavol  0.41133937 -0.05877888  0.26470176 -0.10224000  0.45767865
## lweight 0.20044626  0.54146807  0.30985976  0.04962151  0.08800128
## age     0.19224705  0.44086804 -0.35211211 -0.76920885  0.01773391
## lbph    0.09016529  0.62083485 -0.11135494  0.48088358 -0.36432487
## svi     0.36661275 -0.21595643  0.26240670 -0.23978976 -0.63821599
## lcp     0.42112120 -0.21115993  0.05562278  0.04070973 -0.27236511
## gleason 0.35063078 -0.10927498 -0.53612912  0.23015044  0.27985307
## pgg45   0.38690585 -0.12319335 -0.44845150  0.17187288 -0.10132522
## lpsa    0.40143051  0.07220476  0.37019279  0.14995826  0.28222682
##                 PC6         PC7         PC8         PC9
## lcavol  -0.37989220  0.27848945 -0.10159402  0.55241232
## lweight  0.71521700  0.16790137 -0.11846166  0.08157605
## age     -0.14663461 -0.02905132  0.07590459 -0.15681060
## lbph    -0.46315681  0.01937517 -0.06865977  0.10222019
## svi      0.05223337 -0.36962901 -0.30600496  0.22866766
## lcp     -0.04546217  0.67334647  0.04926192 -0.49116746
## gleason  0.12060030 -0.16712858 -0.61724800 -0.15140723
## pgg45    0.25854291 -0.06426341  0.62997037  0.35549196
## lpsa    -0.14814394 -0.52052912  0.30206999 -0.45860047

The print method is print.prcomp() and as mentioned it does not show the full result. The full result is a list with named components.

names(ans)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

The principal components are returned as the named component ‘x’ in the list output. The principal components may also be computed by centering/scaling the input matrix and multiplying by the rotation matrix.

PC <- scale(X, center=ans$center, scale=ans$scale) %*% ans$rotation
head(PC)

##         PC1        PC2         PC3        PC4         PC5        PC6
## 1 -3.792877 -2.1137093 -0.45696881  0.3315788 -0.93212118 -0.0290659
## 2 -3.380233 -0.9071801 -0.44444449 -0.3604348 -0.89539623  0.8310058
## 3 -2.332467 -1.0167302 -2.60750694 -1.6876420 -0.48295634 -0.3386575
## 4 -3.470661 -0.9432794 -0.51818796 -0.3465093 -0.98440083  0.8371045
## 5 -2.429223 -0.5814667  0.01125666 -0.8426905 -0.05425802  0.3092477
## 6 -3.325969 -1.4348498  0.15336168  0.5809606 -0.72779460  0.7357973
##           PC7        PC8         PC9
## 1  0.39507964 -0.2121082  0.42978612
## 2  0.36058874 -0.1767518  0.06522886
## 3 -0.11089959 -0.2895734 -0.12242235
## 4  0.29659825 -0.1484886 -0.04007986
## 5  0.56080298 -0.1778574  0.60844570
## 6 -0.07537406  0.0144236 -0.17828584

Verifying this is the same as the named component,

head(ans$x)

##         PC1        PC2         PC3        PC4         PC5        PC6
## 1 -3.792877 -2.1137093 -0.45696881  0.3315788 -0.93212118 -0.0290659
## 2 -3.380233 -0.9071801 -0.44444449 -0.3604348 -0.89539623  0.8310058
## 3 -2.332467 -1.0167302 -2.60750694 -1.6876420 -0.48295634 -0.3386575
## 4 -3.470661 -0.9432794 -0.51818796 -0.3465093 -0.98440083  0.8371045
## 5 -2.429223 -0.5814667  0.01125666 -0.8426905 -0.05425802  0.3092477
## 6 -3.325969 -1.4348498  0.15336168  0.5809606 -0.72779460  0.7357973
##           PC7        PC8         PC9
## 1  0.39507964 -0.2121082  0.42978612
## 2  0.36058874 -0.1767518  0.06522886
## 3 -0.11089959 -0.2895734 -0.12242235
## 4  0.29659825 -0.1484886 -0.04007986
## 5  0.56080298 -0.1778574  0.60844570
## 6 -0.07537406  0.0144236 -0.17828584

It is often useful to look at plots of the principal components including boxplots, density plots and histograms, scatterplots. Next we show a notched boxplot of the principal components.

boxplot(ans$x, notch=TRUE, main="Boxplot of the principal components")

As expected the PC’s are centered about zero and the variance of each principal component is decreasing.

The plot method shows the scree plot.

plot(ans, main="Screeplot, Prostate Data with 9 Variables")

A method is also implemented for biplot.

biplot(ans, cex=c(0.5, 1), col=c("blue", "red"),
       main="Biplot, Prostate Data with 9 Variables")

The left and bottom axes refer to the scaled observations while the right and top axes refer to the scaled variables.

prostateHowToPCA

Ian McLeod

February 22, 2017