We compare the functions cv.glm() and gencve::gcv() for estimating the MSE for McDonald’s pollution data for the OLS model.

In the intialization step we attached the library gencve. Next we load the mcdonald data and check a brief summary.

data(mcdonald, package="bestglm")
str(mcdonald)
## 'data.frame':    60 obs. of  16 variables:
##  $ PREC : num  36 35 44 47 43 53 43 45 36 36 ...
##  $ JANT : num  27 23 29 45 35 45 30 30 24 27 ...
##  $ JULT : num  71 72 74 79 77 80 74 73 70 72 ...
##  $ OVR65: num  8.1 11.1 10.4 6.5 7.6 7.7 10.9 9.3 9 9.5 ...
##  $ POPN : num  3.34 3.14 3.21 3.41 3.44 3.45 3.23 3.29 3.31 3.36 ...
##  $ EDUC : num  11.4 11 9.8 11.1 9.6 10.2 12.1 10.6 10.5 10.7 ...
##  $ HOUS : num  81.5 78.8 81.6 77.5 84.6 66.8 83.9 86 83.2 79.3 ...
##  $ DENS : num  3243 4281 4260 3125 6441 ...
##  $ NONW : num  8.8 3.5 0.8 27.1 24.4 38.5 3.5 5.3 8.1 6.7 ...
##  $ WWDRK: num  42.6 50.7 39.4 50.2 43.7 43.1 49.2 40.4 42.5 41 ...
##  $ POOR : num  11.7 14.4 12.4 20.6 14.3 25.5 11.3 10.5 12.6 13.2 ...
##  $ HC   : num  21 8 6 18 43 30 21 6 18 12 ...
##  $ NOX  : num  15 10 6 8 38 32 32 4 12 7 ...
##  $ SOx  : num  59 39 33 24 206 72 62 4 37 20 ...
##  $ HUMID: num  59 57 54 56 55 54 56 56 61 59 ...
##  $ MORT : num  922 998 962 982 1071 ...

Using boot::cv.glm()

Then we fit using glm() and use boot::cv.glm() to estimate the cross-validated MSE.

ans <- glm(MORT ~ ., data=mcdonald )
set.seed(777555112)
boot::cv.glm(mcdonald, ans, K=3 )$delta[1]
## [1] 3645.234

The cv.glm() function using only one iteration of plain K-fold cross-validation. Taking K=3 ensures that about 1/3 of the data are used for validation and 2/3 for training.

Repeating the cross-valiation, we get quite a different answer. Note that I have set the seeds to ensure this is the case! But you can experiment and you will find this is generally true.

set.seed(725534441)
boot::cv.glm(mcdonald, ans, K=3 )$delta[1]
## [1] 2721.638

Some cross-validation software sets the seeds internally so the user always gets consistent results.

*****Bonus Problem***** Discuss next class if you think this is a good idea.

Another way to get consistent results is to do many interations. Shao recommends at least 1000 iterations. For this dataset it takes about 5 seconds to do 1000 iterations using a simple for loop. Actually since each iteration of cv.glm(…, K=3) corresponds to 3 iterations, we only do 334 iterations. So that we can better compare with 1000 itertions of our function gencve::gcv().

NSIM <- 334 #since K=3
set.seed(74455333)
epes <- numeric(NSIM)
for (i in 1:NSIM) {
  epes[i] <- boot::cv.glm(mcdonald, ans, K=3 )$delta[1]
}
sdEpes <- sqrt(var(epes)/length(epes))
mEpes <- mean(epes)
ans <- c(mEpes, 1.96*sdEpes)
names(ans) <- c("EPE", "MOE")
ans
##       EPE       MOE 
## 3332.4580  202.9682

Recall the MOE (Margin-of-Error) is defined as 1.96 times the standard deviation. From the large MOE we see that even after 1000 iterations, the estimate is not very accurate. This is confirmed with the statistically summary,

summary(epes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1540    2503    2889    3332    3578   25790
Bonus Problem 2

Approximately how many simulations do we need to achieve a given accuracy in the estimate of EPE. Given numerical example.

Using gencve::gcv()

The function yhat_lm() is provided in the gencve package along with many other such functions for models we discuss in our course. By default, it randomly partitions the training/validation sets into the proportions 2/3 and 1/3 but an argument is provided for other possible choices. If you experiment with other choices, you will find the estimate of the EPE will change. Another important adustable parameter is the number of iterations or simulations which has a default setting of 1000.

Xy <- mcdonald
n <- nrow(Xy)
p <- ncol(Xy)-1
y <- Xy[,p+1]
X <- data.matrix(Xy[,1:p])
gcv(X, y,  NCores=8, yhat=yhat_lm)
##           epe     moe     pcorr
## [1,] 2293.966 113.645 0.6381111