We compare the functions cv.glm() and gencve::gcv() for estimating the MSE for McDonald’s pollution data for the OLS model.
In the intialization step we attached the library gencve. Next we load the mcdonald data and check a brief summary.
data(mcdonald, package="bestglm")
str(mcdonald)
## 'data.frame': 60 obs. of 16 variables:
## $ PREC : num 36 35 44 47 43 53 43 45 36 36 ...
## $ JANT : num 27 23 29 45 35 45 30 30 24 27 ...
## $ JULT : num 71 72 74 79 77 80 74 73 70 72 ...
## $ OVR65: num 8.1 11.1 10.4 6.5 7.6 7.7 10.9 9.3 9 9.5 ...
## $ POPN : num 3.34 3.14 3.21 3.41 3.44 3.45 3.23 3.29 3.31 3.36 ...
## $ EDUC : num 11.4 11 9.8 11.1 9.6 10.2 12.1 10.6 10.5 10.7 ...
## $ HOUS : num 81.5 78.8 81.6 77.5 84.6 66.8 83.9 86 83.2 79.3 ...
## $ DENS : num 3243 4281 4260 3125 6441 ...
## $ NONW : num 8.8 3.5 0.8 27.1 24.4 38.5 3.5 5.3 8.1 6.7 ...
## $ WWDRK: num 42.6 50.7 39.4 50.2 43.7 43.1 49.2 40.4 42.5 41 ...
## $ POOR : num 11.7 14.4 12.4 20.6 14.3 25.5 11.3 10.5 12.6 13.2 ...
## $ HC : num 21 8 6 18 43 30 21 6 18 12 ...
## $ NOX : num 15 10 6 8 38 32 32 4 12 7 ...
## $ SOx : num 59 39 33 24 206 72 62 4 37 20 ...
## $ HUMID: num 59 57 54 56 55 54 56 56 61 59 ...
## $ MORT : num 922 998 962 982 1071 ...
Then we fit using glm() and use boot::cv.glm() to estimate the cross-validated MSE.
ans <- glm(MORT ~ ., data=mcdonald )
set.seed(777555112)
boot::cv.glm(mcdonald, ans, K=3 )$delta[1]
## [1] 3645.234
The cv.glm() function using only one iteration of plain K-fold cross-validation. Taking K=3 ensures that about 1/3 of the data are used for validation and 2/3 for training.
Repeating the cross-valiation, we get quite a different answer. Note that I have set the seeds to ensure this is the case! But you can experiment and you will find this is generally true.
set.seed(725534441)
boot::cv.glm(mcdonald, ans, K=3 )$delta[1]
## [1] 2721.638
Some cross-validation software sets the seeds internally so the user always gets consistent results.
*****Bonus Problem***** Discuss next class if you think this is a good idea.
Another way to get consistent results is to do many interations. Shao recommends at least 1000 iterations. For this dataset it takes about 5 seconds to do 1000 iterations using a simple for loop. Actually since each iteration of cv.glm(…, K=3) corresponds to 3 iterations, we only do 334 iterations. So that we can better compare with 1000 itertions of our function gencve::gcv().
NSIM <- 334 #since K=3
set.seed(74455333)
epes <- numeric(NSIM)
for (i in 1:NSIM) {
epes[i] <- boot::cv.glm(mcdonald, ans, K=3 )$delta[1]
}
sdEpes <- sqrt(var(epes)/length(epes))
mEpes <- mean(epes)
ans <- c(mEpes, 1.96*sdEpes)
names(ans) <- c("EPE", "MOE")
ans
## EPE MOE
## 3332.4580 202.9682
Recall the MOE (Margin-of-Error) is defined as 1.96 times the standard deviation. From the large MOE we see that even after 1000 iterations, the estimate is not very accurate. This is confirmed with the statistically summary,
summary(epes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1540 2503 2889 3332 3578 25790
Approximately how many simulations do we need to achieve a given accuracy in the estimate of EPE. Given numerical example.
The function yhat_lm() is provided in the gencve package along with many other such functions for models we discuss in our course. By default, it randomly partitions the training/validation sets into the proportions 2/3 and 1/3 but an argument is provided for other possible choices. If you experiment with other choices, you will find the estimate of the EPE will change. Another important adustable parameter is the number of iterations or simulations which has a default setting of 1000.
Xy <- mcdonald
n <- nrow(Xy)
p <- ncol(Xy)-1
y <- Xy[,p+1]
X <- data.matrix(Xy[,1:p])
gcv(X, y, NCores=8, yhat=yhat_lm)
## epe moe pcorr
## [1,] 2293.966 113.645 0.6381111