Simulated Regression, Variable Selection and EPE

Purpose

We consider the linear regression model \[y_i = \beta_0 + \sum_{j=1}^p\beta_j x_{i,j} + e_i,\] where \(i=1,\ldots,n\) and \(e_i\) are assumed normally distributed with mean zero, constant variance equal to \(\sigma_e^2\) and are statistically independent. Let \(X = (x_{i,j})\) be the \(n \times p\) design matrix and let \(x_i \in \Re^p\) be the \(i\)-th row, \(i=1,\ldots, n\). Then \(x_i \sim N(0, \Omega)\), where \(\Omega = (0.9^{|i-j|})\) is a \(p \times p\) covariance matrix. The columns of the resulting design matrix \(X\) are nearly multicollinear as can be seen from the barplot variance inflation factor (Wikipedia, https://en.wikipedia.org/wiki/Variance_inflation_factor) shown in Figure 1. By using a nearly-multicollinear design matrix, I have made this regression more challenging.

Next we simulated this model with the following parameter settings \(n=100\), \(p=9\), \(\beta_0 = 1\), \(\beta_2 = \beta_5 = -0.3\), \(\beta_8 = \beta_9 = 0.3\), \(\beta_1 = \beta_3 = \beta_4 = \beta_6 = \beta_7 = 0\) and \(\sigma_e^2=1\). Then the model was fit with several popular algorithms:

Oracle. This is not really an algorithm since we just fit the linear regression as if we knew which are the correct variables to use. So in this case we include variables 2, 5, 8 and 9 only as well as the intercept.
OLS (Ordinary least squares). Multiple linear with all 9 inputs included.
Step/AIC which is backward stagewise regression using R step() function.
Step/BIC which is backward stagewise regression using R step() function with the argument k = log(100) so BIC selection is used.
Best/AIC best subset regression using the bestglm() function in the bestglm package with argument setting IC=“AIC”.
Best/BIC best subset regression using the bestglm() function in the bestglm.

In variable selection algorithm the term oracle is used for an algorithm that asymptotically will select the correct variables. Best subset selection using the BIC is an oracle algorithm but the others considered here are not. Two well known penalized regression methods MCP and SCAD are both oracle algorithms.

The lattice barplots in Figure 2 compare the parameter estimates for the simulated training data. In examing Figure 2, keep in mind the coefficients for variables \(x_2\) and \(x_5\) are \(-0.3\) while for \(x_8\) and \(x_9\) the coefficients are \(0.3\). Note that following:

Oracle estimator shows the correct estimates but the estimate for \(\beta_8\) is too large while for \(\beta_9\) it is too small. Why?
OLS estimates seem inflated. This is due to the multicollinearity.
As expected StepBIC is more parsimonious than StepAIC.
StepBIC and BestBIC select the same model.
StepAIC and BestAIC select different models.

In practice our interest is often on how well the performs on new data. The new data is referred to as test data while the data used to fit the model is the training data.

From multiple linear regression theory, the prediction variance given a single new test example input vector \(x^{(0)} = (1, x^{(0)}_1, \ldots, x^{(0)}_9)^\prime\), is given by, \[{\rm var}(\hat y) = \sigma_e^2 (1 + {x^{(0)}}^\prime ({\cal X}^\prime {\cal X})^{-1} x^{(0)} ), \qquad ... (1)\] where \(\cal X\) is the design matrix specified earlier with the column corresonding to the intercept term included. In eqn. (1), we hold the design matrix \(X\) fixed and choose random test example inputs from the normal distribution \({\rm N}(0, \Omega), \Omega=(0.9^{|i-j|})_{9 \times 9}\). The expected prediction error conditional on the original observed input matrix \(X\) may be written, \[{\rm EPE}_X = {\cal E}_X \{{\rm var}(\hat y) \}, \] where \({\cal E}_X\) denotes mathematical expectation for fixed \(X\). In other words, \({\cal E}_X\) is the expected prediction error conditional on the training sample we have been given. This conditional EPE tells how well our regression will work with future data that is generated in the same way as the original training sample. From Table 1 we see that BIC works best and backward stagewise and best subset produced the same result. It seems strange but true that best subset AIC slightly worse than backward stagewise AIC. Unfortunately I didn’t have time to compute the standard devations but I believe the difference is statistically signficant at 5% or less. OLS is the worst. The Oracle suggests that there is room for improvement!

Table 1. Conditional EPE.
	Root.Mean.Square.Error	……..St..Dev.RMSE
Oracle	1.03601	0.00015
OLS	1.07890	0.00015
Step/AIC	1.06845	0.00015
Step/BIC	1.05751	0.00015
Best/AIC	1.07167	0.00015
Best/BIC	1.05751	0.00015

As expected the standard deviations for the estimated RMSE’s are very small since the test sample is very large.

Expected Prediction Error (EPE)

In general, it can be shown that for a linear regression with \(p < n\) inputs plus an intercept term, the unconditional prediction variance for a new data value is \[{\rm EPE} = \sigma_e^2 (1+\frac{p+1}{n}). \qquad .... (2)\] For the current regression example \(p = 4\), \(\sigma_e^2 = 1\) and \(n = 100\). Plugging these values in and taking the square root we obtain root-mean-square EPE equal to 1.0247. This agrees well with the result in Table 2 for the Oracle estimator shown below in Table 2. The unconditional EPE given in eqn. (2) is estimated by simulation using the following steps:

Step 1 Generate training data. The design matrix \(X\) is generated from a normal distribution with mean zero and covariance matrix \(\Omega\). Using the parameters \(\beta = (0,-0.3,0,0,-0.3,0,0,0.3,0.3)^\prime\), \(n=100\) observations are generated using the equation \(y = 1_n + X \beta + e\), where \(e = (e_1, \ldots, e_n)^\prime\), the errors \(e_i\) are independent normal random values with mean 0 and variance 1 and \(1_n\) is an n-dimensional vector comprised of 1’s.
Step 2 Fit the training data with using each of the 6 methods.
Step 3 Generate test data with a sample size of \(10^4\) using a new design matrix and then new random errors and new output values. Compute the predictions and the average RMSE.
Step 4 Repeat Steps 1-4, 10^4 times and average the RMSE’s for each method.

Using the R parallel package on an I7 8-core PC the computations take about 5.6 minutes whereas without parallelization it takes 23.4 minutes - speedup by a factor of about 4.2.

Table 2. Unconditional EPE.
	Root-Mean-Square Error	…….Standard Deviation
Oracle	1.02587	0.00018
OLS	1.05404	0.00026
Step.AIC	1.05042	0.00025
Step.BIC	1.05051	0.00026
Best.AIC	1.04942	0.00025
Best.BIC	1.04935	0.00025

Eqn. (2) is important because it illustrates the both the bias-variance tradeoff and the principle of parsimony. When the model is underfit then eqn. (2) does not apply but we would expect the residual variance to be large so the prediction error will be too large. When the choice is just right, as in the case of the oracle selection, eqn. (2) gives the EPE for the best model. For example, in the current example, the oracle model has \(p=3\) but an overfit model will have \(p>3\) and hence by eqn. (2) the EPE is inflated.

The R code chunk for doing the EPE simulations is included in the associated Rmd file but it is not evaluated due to its length. Instead the this document uses results that were previously generated.

Simulated Regression, Variable Selection and EPE

Ian McLeod

January 31, 2017

Purpose

Expected Prediction Error (EPE)