The full dataset is comprised of 97 observations with 8 input variables and output variable ‘lpsa’. The data were obtained from 97 men with prostate cancer. This dataset is used in the ESL textbook to illusrate many linear regression methods. The plot in Figure 1 shows the VIF corresponding to each of the 8 inputs. In this plot values over 5 or 10 are indicative of strong near multicollinearity, so we see the inputs are just fine in this respect.

First we re-scale the data so each variable has mean 0 and scale 1. This step is not necessary with simple OLS and stepwise regression methods but is required for many other methods including penalized regression and principal component regression.

In our analysis with this data we will split the dataset into a training sample and test sample by randomly allocating about \(\frac{2}{3}\) of the data to the training sample and the rest to the test sample.

In practical applications of regression it is important to conduct an exploratory analysis (EDA) - also sometimes called initial data analysis. One of the traditional plots suggested for this purpose is the scatterplot matrix which is implemented in traditional graphics in R with the function pairs(). Figure 2 shows the scatterplot matrix obtained using splom() from the lattice package.

For a large number of inputs the resolution of the scatterplot matrix can be problematic. To ameliorate this concernt, I have used the R function densCols() to produce a color smooth variant of the scatterplot in which the color of the points indicates the density of points in that region. The density is estimated using a two-dimensional kernel density estimator.

Another useful plot that I recommend is a simple dependency plot of the output variable vs. each input with a loess curve included in the plot.

Next we use 1-NN to predict the test values. Note that 1-NN does a perfect job on the training data! But we will see it doesn’t work so well in this case on the test data.

Using the training data we fit OLS and backward stagewise regression using both the AIC and BIC criteria. The RMSE for the predictions are summarized in Table 1.

Table 1. RMSE on Test Data
Root.Mean.Square.Error ……..St..Dev.RMSE
1NN 1.0190 0.0450
OLS 0.2061 0.0091
Step/AIC 0.2705 0.0120
Step/BIC 0.2705 0.0120