In the intialization, we loaded the grpreg package and the dataset Iowa which is available in the bestglm package.
There are n=133 average test scores for schools in the K=6 largest cities. The test score offers a standardized measure of academic achievement. The purpose of the study is to investigate if there is a relationship between academic achievement, as measured by the test, and poverty. It is expected that students from economically disadvantaged backgrounds will do less well. Data on the average income in the school district was not available so a proxy variable for poverty was used. The percentage of students who received subsidized meals was available so this was used as the “Poverty” variable.
df <- as.data.frame.table(table(Iowa$City))
names(df) <- c("Factor Level", "Sample Size")
row.names(df) <- NULL
x <- xtable(df)
caption(x) <- "Table 1. Sample Size in Each Category "
print(x, type="html", caption.placement="top", caption.width=90,
include.rownames=FALSE)
| Factor Level | Sample Size |
|---|---|
| CedarRapids | 21 |
| Davenport | 22 |
| DesMoines | 38 |
| IowaCity | 17 |
| SiouxCity | 21 |
| Waterloo | 14 |
From Table 1 we see the sample sizes are adequate in each category so we can use stratified random sampling to assign about 2/3 of the data in each factor level to the training dataset and the remainder to the test data.
In Figure 1 we present a dotchart showing the sample sizes for each factor level. The visualization makes it clear that Des Moines has the largest number of schools while Waterloo has the smallest.
#split the data randomly into training/test using stratified random sampling
# to get the same proportion in each category
#re-order in ascending order
ind <- order(as.character(Iowa$City))
Iowa2 <- Iowa[ind,]
n <- table(Iowa2$City)
J <- length(n)
nTr <- round(trainFrac*n)
nTe <- n-nTr
m <- matrix(c(nTr,nTe), ncol=2)
row.names(m) <- names(n)
dimnames(m)[[2]] <- c("Train", "Test")
dotchart(m, col=c("red", "blue"), pch=18, pt.cex=2, xlim=c(0,26),
xlab="Sample size in each category",
main="Figure 1. Train/Test Sample Sizes")
## Warning in mtext(labs, side = 2, line = loffset, at = y, adj = 0, col =
## color, : "pt.cex" is not a graphical parameter
## Warning in mtext(glabels, side = 2, line = goffset, at = gpos, adj = 0, :
## "pt.cex" is not a graphical parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "pt.cex" is
## not a graphical parameter
r <- Map(function(n) sample(1:n), n)
indTr <- vector("list", J)
for (i in 1:J) {
indTr[[i]] <- unlist(Map(function(x) ifelse(x<=nTr[i], TRUE, FALSE), r[[i]]))
}
indTr <- unlist(indTr)
Iowa2 <- cbind(Iowa2, train=indTr)
XyTr <- Iowa2[Iowa2$train,][,-4]
XyTe <- Iowa2[!Iowa2$train,][,-4]
XTr <- as.matrix.data.frame(XyTr[,-3])
XTe <- as.matrix.data.frame(XyTe[,-3])
yTr <- XyTr[,3]
yTe <- XyTe[,3]
The dotchart in Figure 2 compares the average test score and average value of the poverty variable in each city. The six cities have been ordered in terms of their test scores reading from the bottom, Iowa City has the highest average test score while Waterloo has the lowest. It is strikingly clear that there is indeed a relationship between poverty and test score.
# For the dotchart, let's order in terms of Test score
tb <- aggregate(XyTr[,-1], by=list(City = XyTr$City), FUN=mean, data=XyTr)
ind <- order(tb$Test)
tb <- tb[ind,]
m <- as.matrix.data.frame(tb[,-1])
row.names(m) <- tb[,1]
dotchart(t(m), pch=18, col="blue", lcolor="red",
main="Figure 2. Dot Chart. Poverty/Test by City, Training Data.",
xlab="Average Percentage Poverty or Test Score")
The lattice plot in Figure 2 of the scatterplot of test score vs poverty for each city.
The dashed line shows the fitted simple linear regression in each panel. Basically the slope and intercept term are identical to the regression we fit in the next section for the full multiple linear regression in which not only is Poverty and City included but all interactions as well. In multiple linear regression theory, we have shown how we can test if the slopes in each panel are the same and/or if the intercepts are the same.
The solid red curve shows the linear loess smooth with \(\alpha=0.8\) in each panel. A straightline regression seems OK except perhaps with Cedar Rapids. Using generalized additive models (GAM) makes it easy to fit separate loess or smoothing splines to each panel. There are two packages in R that I recommend for GAM, gam and mgcv.
#lattice style plot
xyplot(Test ~ Poverty | City, data=XyTr, panel=function(x, y){
panel.xyplot(x, y, pch=16, col="blue", cex=1.5)
panel.loess(x, y, span=0.8, degree=1, col=rgb(1,0,0,0.7), lwd=2)
panel.lmline(x, y, lwd=2, lty=2)
}, main="Figure 3. Test Scores vs Poverty, Schools, Six Iowa Cities")
RMSE <- NULL
#Full model - separate slope and intercept for each city
ans <- lm(Test ~ Poverty*City, data=XyTr)
bOLS <- coef(ans)
yhF <- predict(ans, newdata=XyTe)
RMSE[1] <- rmse(yTe, yhF)
#backward stagewise regression
ansStep <- step(ans, k=log(nrow(XyTr)), trace=0)
bStep <- coef(ansStep)
bStepBIC <- numeric(length(bOLS))
names(bStepBIC) <- names(bOLS)
bStepBIC[names(bStep)] <- bStep
yhS <- predict(ansStep, newdata=XyTe)
RMSE[2] <- rmse(yTe, yhS)
#get model matrix for use with grpreg
#separate slope and intercept - remove 1st column
XFTr <- model.matrix(ans)[,-1]
XFTe <- model.matrix(Test ~ Poverty*City, data=XyTe)[,-1]
#MCP
ansMCP <- grpreg(XFTr, yTr, penalty="grMCP")
lambdaBIC <- select(ansMCP, "BIC")$lambda
bMCP <- select(ansMCP, "BIC")$beta
yhM <- predict(ansMCP, X=XFTe, lambda=lambdaBIC, type="response")
RMSE[3] <- rmse(yTe, yhM)
#Group MCP
varNames <- names(bMCP)
ansMCPGroup <- grpreg(XFTr, yTr, penalty="grMCP",
group=c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3))
lambdaBIC <- select(ansMCPGroup, "BIC")$lambda
bMCPGroup <- select(ansMCPGroup, "BIC")$beta
yhMPen <- predict(ansMCPGroup, X=XFTe, lambda=lambdaBIC, type="response")
RMSE[4] <- rmse(yTe, yhMPen)
names(RMSE) <- c("OLS", "Step", "MCPX", "GroupMCP")
#table
tb <- matrix(c(bOLS, bStepBIC, bMCP, bMCPGroup), ncol=4)
tb <- rbind(tb, RMSE)
rownames(tb) <- c(names(bOLS), "RMSE")
colnames(tb) <- c("OLS (incl. interactions)", "Step/BIC", "MCP", "GroupMCP")
#round(tb,2)
In fitting the final group MCP model, the argument group is specified as
group=c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)
This specification obtained by grouping the terms in the regression,
Poverty, CityDavenport, CityDesMoines, CityIowaCity, CitySiouxCity, CityWaterloo, Poverty:CityDavenport, Poverty:CityDesMoines, Poverty:CityIowaCity, Poverty:CitySiouxCity, Poverty:CityWaterloo
Note that the intercept is not included because it is not included in the penalty function.
Table 2 compares the regression coefficients for the four models were fit and their corresponding RMSE’s on the test data.
x <- xtable(tb)
caption(x) <- "Table 2. Comparison of Models."
print(x, type="html", caption.placement="top")
| OLS (incl. interactions) | Step/BIC | MCP | GroupMCP | |
|---|---|---|---|---|
| (Intercept) | 71.55 | 76.88 | 76.00 | 75.97 |
| Poverty | -0.33 | -0.51 | -0.46 | -0.53 |
| CityDavenport | -9.90 | -10.53 | -11.91 | -5.86 |
| CityDesMoines | 2.90 | -5.83 | 0.00 | -3.14 |
| CityIowaCity | 9.73 | -1.34 | 0.00 | -0.86 |
| CitySiouxCity | 9.90 | 2.64 | 0.00 | 1.73 |
| CityWaterloo | -2.14 | -7.05 | -8.76 | -3.70 |
| Poverty:CityDavenport | -0.08 | 0.00 | 0.00 | 0.00 |
| Poverty:CityDesMoines | -0.25 | 0.00 | -0.14 | 0.00 |
| Poverty:CityIowaCity | -0.44 | 0.00 | -0.06 | 0.00 |
| Poverty:CitySiouxCity | -0.22 | 0.00 | 0.00 | 0.00 |
| Poverty:CityWaterloo | -0.17 | 0.00 | 0.00 | 0.00 |
| RMSE | 10.26 | 10.15 | 10.00 | 9.33 |