We consider the linear regression model \[y_i = \beta_0 + \sum_{j=1}^p\beta_j x_{i,j} + e_i,\] where \(i=1,\ldots,n\) and \(e_i\) are assumed normally distributed with mean zero, constant variance equal to \(\sigma_e^2\) and are statistically independent. Let \(X = (x_{i,j})\) be the \(n \times p\) design matrix and let \(x_i \in \Re^p\) be the \(i\)-th row, \(i=1,\ldots, n\). Then \(x_i \sim N(0, \Omega)\), where \(\Omega = (0.9^{|i-j|})\) is a \(p \times p\) covariance matrix. The columns of the resulting design matrix \(X\) are nearly multicollinear as can be seen from the barplot variance inflation factor (Wikipedia, https://en.wikipedia.org/wiki/Variance_inflation_factor) shown in Figure 1. By using a nearly-multicollinear design matrix, I have made this regression more challenging.
Next we simulated this model with the following parameter settings \(n=100\), \(p=9\), \(\beta_0 = 1\), \(\beta_2 = \beta_5 = -0.3\), \(\beta_8 = \beta_9 = 0.3\), \(\beta_1 = \beta_3 = \beta_4 = \beta_6 = \beta_7 = 0\) and \(\sigma_e^2=1\). Then the model was fit with several popular algorithms:
In variable selection algorithm the term oracle is used for an algorithm that asymptotically will select the correct variables. Best subset selection using the BIC is an oracle algorithm but the others considered here are not. Two well known penalized regression methods MCP and SCAD are both oracle algorithms.
The lattice barplots in Figure 2 compare the parameter estimates for the simulated training data. In examing Figure 2, keep in mind the coefficients for variables \(x_2\) and \(x_5\) are \(-0.3\) while for \(x_8\) and \(x_9\) the coefficients are \(0.3\). Note that following:
In practice our interest is often on how well the performs on new data. The new data is referred to as test data while the data used to fit the model is the training data.
From multiple linear regression theory, the prediction variance given a single new test example input vector \(x^{(0)} = (1, x^{(0)}_1, \ldots, x^{(0)}_9)^\prime\), is given by, \[{\rm var}(\hat y) = \sigma_e^2 (1 + {x^{(0)}}^\prime ({\cal X}^\prime {\cal X})^{-1} x^{(0)} ), \qquad ... (1)\] where \(\cal X\) is the design matrix specified earlier with the column corresonding to the intercept term included. In eqn. (1), we hold the design matrix \(X\) fixed and choose random test example inputs from the normal distribution \({\rm N}(0, \Omega), \Omega=(0.9^{|i-j|})_{9 \times 9}\). The expected prediction error conditional on the original observed input matrix \(X\) may be written, \[{\rm EPE}_X = {\cal E}_X \{{\rm var}(\hat y) \}, \] where \({\cal E}_X\) denotes mathematical expectation for fixed \(X\). In other words, \({\cal E}_X\) is the expected prediction error conditional on the training sample we have been given. This conditional EPE tells how well our regression will work with future data that is generated in the same way as the original training sample. From Table 1 we see that BIC works best and backward stagewise and best subset produced the same result. It seems strange but true that best subset AIC slightly worse than backward stagewise AIC. Unfortunately I didn’t have time to compute the standard devations but I believe the difference is statistically signficant at 5% or less. OLS is the worst. The Oracle suggests that there is room for improvement!
| Root.Mean.Square.Error | ……..St..Dev.RMSE | |
|---|---|---|
| Oracle | 1.03601 | 0.00015 |
| OLS | 1.07890 | 0.00015 |
| Step/AIC | 1.06845 | 0.00015 |
| Step/BIC | 1.05751 | 0.00015 |
| Best/AIC | 1.07167 | 0.00015 |
| Best/BIC | 1.05751 | 0.00015 |
As expected the standard deviations for the estimated RMSE’s are very small since the test sample is very large.
In general, it can be shown that for a linear regression with \(p < n\) inputs plus an intercept term, the unconditional prediction variance for a new data value is \[{\rm EPE} = \sigma_e^2 (1+\frac{p+1}{n}). \qquad .... (2)\] For the current regression example \(p = 4\), \(\sigma_e^2 = 1\) and \(n = 100\). Plugging these values in and taking the square root we obtain root-mean-square EPE equal to 1.0247. This agrees well with the result in Table 2 for the Oracle estimator shown below in Table 2. The unconditional EPE given in eqn. (2) is estimated by simulation using the following steps:
Using the R parallel package on an I7 8-core PC the computations take about 5.6 minutes whereas without parallelization it takes 23.4 minutes - speedup by a factor of about 4.2.
| Root-Mean-Square Error | …….Standard Deviation | |
|---|---|---|
| Oracle | 1.02587 | 0.00018 |
| OLS | 1.05404 | 0.00026 |
| Step.AIC | 1.05042 | 0.00025 |
| Step.BIC | 1.05051 | 0.00026 |
| Best.AIC | 1.04942 | 0.00025 |
| Best.BIC | 1.04935 | 0.00025 |
Eqn. (2) is important because it illustrates the both the bias-variance tradeoff and the principle of parsimony. When the model is underfit then eqn. (2) does not apply but we would expect the residual variance to be large so the prediction error will be too large. When the choice is just right, as in the case of the oracle selection, eqn. (2) gives the EPE for the best model. For example, in the current example, the oracle model has \(p=3\) but an overfit model will have \(p>3\) and hence by eqn. (2) the EPE is inflated.
The R code chunk for doing the EPE simulations is included in the associated Rmd file but it is not evaluated due to its length. Instead the this document uses results that were previously generated.