Prostate glmnet Analysis

Prostate Data

Recall that this dataset, which we refer to simply as prostate, has been partitioned into 67 observations for training and 30 observations for testing. The partitioning was done at random by HTF for their ESL book.

Regularization Path

Using glmnet() and its associated plot method we obtain the regularization path which is the modern version of the ridge trace plot.

Regularized 10-fold Cross-Validation

Using the cv.glmnet() the regularization path for RR is computed. This is done four times and each time the plot is slightly different because the one-standard-deviation rule depends on the initial random seed that is used to generate the folds. This is illustrated in Figure 1 below.

Figure 1. Four 10-fold CV Plots

In many cases this dependence may not be important and essentially the same result is given in each case. But Table 1 below shows this is not the case here. The last row of the Table 1 compares the RMSE corresponding to each choice of Lambda in the plots in Figure 1. There seems to be considerable variation in the RMSE among the selected RR models. Two of the four are slightly worse than OLS.

The second last line shows the norm and we verify empirically that the Step and RR estimators perform shrinkage.

Table 1. Estimates of the parameters their norm and test RMSE

##             OLS StepAIC StepBIC   RR-1    RR-2   RR-3   RR-4
## lcavol   0.7164  0.7132  0.7799 0.1985  0.2756 0.2360 0.1985
## lweight  0.2926  0.2951  0.3519 0.3535  0.4605 0.4096 0.3535
## age     -0.1425 -0.1461  0.0000 0.0022 -0.0016 0.0006 0.0022
## lbph     0.2120  0.2114  0.0000 0.0713  0.0968 0.0843 0.0713
## svi      0.3096  0.3115  0.0000 0.3799  0.4827 0.4331 0.3799
## lcp     -0.2890 -0.2877  0.0000 0.0595  0.0415 0.0538 0.0595
## gleason -0.0209  0.0000  0.0000 0.0773  0.0717 0.0758 0.0773
## pgg45    0.2773  0.2621  0.0000 0.0033  0.0039 0.0036 0.0033
## NORM     0.9596  0.9541  0.8556 0.5686  0.7330 0.6533 0.5686
## RMSE     0.7411  0.7389  0.7405 0.7460  0.7156 0.7284 0.7460

CV Averaging

A fundamental principle in forecasting and predictions is that it may be useful to combine predictions. With bootstraping it has been found that averaging bootstrap predictions is useful and this technique is known as bagging (boostrap aggregating). Cross-validation (CV) is quite similar to bootstrapping in many respects and so this is the approach that I recommend with k-fold CV using the one-standard devation rule and the technique may be called CV averaging. Using r NREP iterations to estimate the predictions for the test data we find the RMSE = 0.7281 is improved and it doesn’t depend on the initial RNG seed.

Comparison of Prediction and Observed

Since we know the test values, it is of interest to examine the plot of forecast and actuality. The \(45^\circ\)-line is shown on the plot since it would be expected if the model holds, the plot of predictions vs. observed should be scattered around this line. In fact, it can be shown that the higher the coefficient of determination, \(R^2\), is the tighter the concentration of data about this line will be - see “Visualizing R-Squared in Statistics”, http://demonstrations.wolfram.com/VisualizingRSquaredInStatistics/). In this case \(R^2\) = 52.8%.

Figure 2. Predicted and Observed Comparison

Bland-Altman/Tukey Mean-Difference Plot

The Bland-Altman plot, https://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot, is widely used in Biostatistics for visualizing the differences between pairs of measurements. If the pairs agree the usual x-y plot should follow a the \(45^\circ\) line but since it is easier to visualize the difference from the horizontal we plot \(((y+x)/2, y-x)\) rather than \((x,y)\). As an exercise, you might line to show after a \(45^\circ\) rotation, \((x,y) \rightarrow ((y+x)/2, y-x)\). Hint: https://en.wikipedia.org/wiki/Rotation_(mathematics)#Two_dimensions. Figure 3 shows the Bland-Altman/Tukey mean-difference plot for comparing predictions and observations. To enhance trend visualization, a loess smooth has been added to the plot. The two outliers are still very noticeable but now we see that there is a systematic error in our forecasts due to increasing variance with the output variable lpsa.