Using bestglm::pcreg with the Pollution Data

Introduction

The function bestglm::pcreg() provides full regression capability with both PCR and PLS using automatic model selection with the AIC/BIC criteria. Methods functions print(), summary(), resid(), fitted(), plot() and predict() are all provided.

The air pollution dataset was published by McDonald and Schwing (1973) in Technometrics to illustrate the application of ridge regression. Some researchers favour the use of PLS with near-multicollinearity or strong correlation among predictions while others prefer penalized regression methods. The notebook is concluded with a comparison of the EPE for various regression models.

Data Overview

The are \(n=60\) observations and \(p=15\) inputs.

A terse summary of the data is shown below. Note that PCR and PLS should not be used when there are factor or categorical variables present in the input variables and from the summary we confirm that all variables are indeed quantitative.

str(mcdonald)

## 'data.frame':    60 obs. of  16 variables:
##  $ PREC : num  36 35 44 47 43 53 43 45 36 36 ...
##  $ JANT : num  27 23 29 45 35 45 30 30 24 27 ...
##  $ JULT : num  71 72 74 79 77 80 74 73 70 72 ...
##  $ OVR65: num  8.1 11.1 10.4 6.5 7.6 7.7 10.9 9.3 9 9.5 ...
##  $ POPN : num  3.34 3.14 3.21 3.41 3.44 3.45 3.23 3.29 3.31 3.36 ...
##  $ EDUC : num  11.4 11 9.8 11.1 9.6 10.2 12.1 10.6 10.5 10.7 ...
##  $ HOUS : num  81.5 78.8 81.6 77.5 84.6 66.8 83.9 86 83.2 79.3 ...
##  $ DENS : num  3243 4281 4260 3125 6441 ...
##  $ NONW : num  8.8 3.5 0.8 27.1 24.4 38.5 3.5 5.3 8.1 6.7 ...
##  $ WWDRK: num  42.6 50.7 39.4 50.2 43.7 43.1 49.2 40.4 42.5 41 ...
##  $ POOR : num  11.7 14.4 12.4 20.6 14.3 25.5 11.3 10.5 12.6 13.2 ...
##  $ HC   : num  21 8 6 18 43 30 21 6 18 12 ...
##  $ NOX  : num  15 10 6 8 38 32 32 4 12 7 ...
##  $ SOx  : num  59 39 33 24 206 72 62 4 37 20 ...
##  $ HUMID: num  59 57 54 56 55 54 56 56 61 59 ...
##  $ MORT : num  922 998 962 982 1071 ...

Algorithm Overview

The pcreg() uses lm() to fit \(p\) regressions using the first \(k\) principal components or latent vectors for \(k=1,\ldots,p\). The principal components are obtained using prcomp() while the latent vectors are from pls::plsr(). For each regression the AIC or BIC is evaluated and the best model is selected.

The first component in the output from pcreg() is best regression fit using lm(). The methods functions print.pcreg(), summary.pcreg(), residuals.pcreg(), fitted.pcreg() and plot.pcreg() simply use the corresponding lm-object functions. The predict.pcreg() simply uses pls::predict.mvr() since this automatically handles scaling.

The sequential fitting approach contrasts with best subset regression where any subsets on the inputs may be used. The use of only the first \(1 \le k \le p\) principal components and/or latent vectors is standard practice. One could of course use a subset approach but this invites criticism of “data dredging” or “p-hacking” so some caution is needed. Interesting in yesterday’s seminar “Neural cryptography” we heard that with the analysis of brain neuron activity using electroencephalography (EEG), the first few principal components contain features not of interest for detecting subtle activity such as speech or the movement of limbs.

PCR

For the purpose of demonstrating how to use pcreg(), the code is displayed and we divide the into 57 observations for fitting and 3 observations for prediction. And we start by fitting a PCR. The default settings for pcreg() include scale=TRUE, method=PC" and ic=BIC**.

set.seed(7775511) #fix seed
XyList <- trainTestPartition(mcdonald, trainFrac=0.95)
XyTr <- XyList$XyTr
ansPCR <- pcreg(XyTr)
ansPCR

## 
## Call:
## lm(formula = MORT ~ ., data = Zy[, c(1:mIC, p + 1)])
## 
## Coefficients:
## (Intercept)          PC1          PC2          PC3          PC4  
##     940.281       15.794       -3.822      -22.583        5.036  
##         PC5          PC6          PC7          PC8  
##       7.546      -20.334      -13.446      -14.456

The output from pcreg is a list with four components: lmfit, PLSFit, Z, method. These components are respectively the lm-fit, the pls-fit, the principal components or latent vectors and model.

Using summary() we get more details but note that the p-values and standard errors of the estimates are too small. Similarly the standard errors of predictions would also be underestimated. Notice that the first 8 principal components are used in the regression so some dimensionality reduction has been achieved.

summary(ansPCR)

## 
## Call:
## lm(formula = MORT ~ ., data = Zy[, c(1:mIC, p + 1)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.381 -10.906   2.256  14.410  91.532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  940.281      4.686 200.639  < 2e-16 ***
## PC1           15.794      2.194   7.200 3.63e-09 ***
## PC2           -3.822      2.847  -1.343   0.1857    
## PC3          -22.583      3.267  -6.911 1.01e-08 ***
## PC4            5.036      4.055   1.242   0.2203    
## PC5            7.546      4.439   1.700   0.0956 .  
## PC6          -20.334      4.706  -4.321 7.78e-05 ***
## PC7          -13.446      6.143  -2.189   0.0335 *  
## PC8          -14.456      7.621  -1.897   0.0639 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.38 on 48 degrees of freedom
## Multiple R-squared:  0.7347, Adjusted R-squared:  0.6904 
## F-statistic: 16.61 on 8 and 48 DF,  p-value: 1.798e-11

plot(ansPCR)

Finally the predictions are:

XyTe <- XyList$XyTe
predict(ansPCR, newdata=XyTe)

## [1]  950.0822  916.6044 1060.6350

PLS

Fitting PLS model with pcreg() is just as simple. The following code snippet below fits PLS and displays the result.

(ansPLS <- pcreg(XyTr, method="LV"))

## 
## Call:
## lm(formula = MORT ~ ., data = Zy[, c(1:mIC, p + 1)])
## 
## Coefficients:
## (Intercept)          LV1          LV2          LV3  
##      940.28        25.17        16.79        13.23

We see that PLS has achieved a further dimensionality reduction since only the first three latent vectors are used. The coefficients of determinations for the PCR and PLS are respectively 73.5% and 75.7%. So PLS has achieved a slightly better fit.

In practice, complete model diagnostic checking could be done. For example, perhaps some additional inputs, such as, categorical variables should be used or perhaps a Box-Cox transformation may be used to stablize the variance.

The predictions for the new observations are computed using predict().

predict(ansPLS, newdata=XyTe)

## [1]  947.7966  917.8723 1036.2773

EPE Comparisons with Other Regression Methods

The PLS is somewhat controversial among some statisticians because it does not define an actual statistical component. It is of interest to compare the PLS model with other regression methods using cross-validation to estimate the EPE. The most widely used EPE is the MSE so we use this cost function. To obtain accurate results, \(10^5\) cross-validation simulations using a random sample of \(2/3\) of the data for training and the remainder for validation. The computations required about 2.4 hours on an i7-class PC. As a check, the computations were performed a second time using a different random seed. The results were in close agreement which confirms that the MOE given are reasonable and that reasonable convergence has been obtained. The R script for this cross-validation computation using the gencve package are included in the Rmd file accompanying this document but to save time are not evaluated inside the markdown file.

Tables 1 and 2 below

Table 1. First Cross-Validation

##                      epe   MOE
## lm               3206.69 20.09
## step/BIC         2943.75 18.64
## ridge/CV-Breiman 2451.94  6.34
## lasso/CV-Breiman 2287.53  6.12
## MCP/BIC          2261.94 10.86
## PCR/BIC          2746.10 16.45
## PLS/BIC          2146.28  9.95

######Table 2. Second Cross-Validation######

##                      epe   MOE
## lm               3200.03 19.25
## step/BIC         2932.03 16.74
## ridge/CV-Breiman 2452.87  6.33
## lasso/CV-Breiman 2288.63  5.94
## MCP/BIC          2271.29 11.24
## PCR/BIC          2746.57 16.39
## PLS/BIC          2142.98  9.85

The dotchart below shows clearly that PLS with BIC works best followed by MCP with BIC and LASSO with Breiman’s one-standard-devation rule. Since the MOE is very small, the effect of sampling error can be neglected.

e<-as.vector(t(tb[,1]))
names(e) <- row.names(tb)
dotchart(e, pch=18, cex=1.5)

The complete R script for the cross-validation experiments reported in Tables 1 and 2 is given below.

#Code snippet to evaluate EPE for various regressions
#McDonald et al. Air Pollution Dataset
#
#Source: mcdonaldGcvPCRetc.R
#
setwd("d:/dropbox/r/2017/sdm/work/gencve/mcdonald")
library(gencve)
data(mcdonald, package="bestglm")
X <- mcdonald[,-ncol(mcdonald)]
y <- mcdonald[, ncol(mcdonald)]
tb <- matrix(numeric(0), nrow=7, ncol=3)
row.names(tb) <- c("lm", "step", "ridge", "lasso", "MCP", "PCR", "PLS")
colnames(tb) <- c("epe", "MOE", "pcorr")
MX = 10^5 #about 2.4 hrs
date()
startTime <- proc.time()[3]
tb[1, ] <- gcv(X, y, MaxIter=MX, NCores=8, yhat=yhat_lm)
tb[2, ] <- gcv(X, y, MaxIter=MX,  NCores=8, yhat=yhat_step) #with BIC
tb[3, ] <- gcv(X, y, MaxIter=MX,  NCores=8, yhat=yhat_gel, alpha=0) #ridge
tb[4, ] <- gcv(X, y, MaxIter=MX,  NCores=8, yhat=yhat_gel, alpha=1) #lasso
tb[5, ] <- gcv(X, y, MaxIter=MX,  NCores=8, yhat=yhat_grpreg,penalty="MCP")#MCP/BIC
tb[6, ] <- gcv(X, y, MaxIter=MX,  NCores=8, yhat=yhat_pcreg, method="PC")
tb[7, ] <- gcv(X, y, MaxIter=MX,  NCores=8, yhat=yhat_pcreg, method="LV")
totTime <- proc.time()[3] - startTime
write.csv(tb, file="tb-ThurMidnight.csv")
save.image(file="mcdonaldGcvThurMidnight.Rdata")
dump("tb", file="mcdonaldGcvThurMidnightDump.R")
round(tb, 2)
totTime