This assignment is optional. If you do this assignment then both Assignment 1 and Assignment 2 will each count for 10% of your final grade. Otherwise Assignment 1 will count 20% toward your final grade.
This optional assignment is also open to graduate students.
You task is to use regression methods discussed in our course including: linear regression and variable selection to develop a model for prediction on the test data that is provided. It is important that you use only the training data in developing your model.
You must provide both a PDF and Rmd-file.
These files must be uploaded to OWL no later than December 21, 2017. Please send me an email if you upload a file.
The PDF document must be carefully prepared. All figures and tables must be numbered and have a title. The report must be well organized and not too long (less than 10 pages). No computer code should be included in your PDF report. But the Rmd-file is essential and no report will be accepted without both files.
A brief description of the Boston Housing dataset was discussed in my lecture note on Prediction and more details are given in the documentation in the R package MASS:
help(Boston, package=MASS)
The training and test data used in my lecture note are available in csv format from my webpage and may be input to R with the read.csv() function as shown below:
tr <- read.csv("http://www.stats.uwo.ca/faculty/aim/2017/3859/data/BostonTrain.csv")
te <- read.csv("http://www.stats.uwo.ca/faculty/aim/2017/3859/data/BostonTest.csv")
str(tr)
## 'data.frame': 338 obs. of 14 variables:
## $ crim : num 0.0273 0.0273 0.0324 0.1446 0.2112 ...
## $ zn : num 0 0 0 12.5 12.5 12.5 12.5 12.5 0 0 ...
## $ indus : num 7.07 7.07 2.18 7.87 7.87 7.87 7.87 7.87 8.14 8.14 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.469 0.469 0.458 0.524 0.524 0.524 0.524 0.524 0.538 0.538 ...
## $ rm : num 6.42 7.18 7 6.17 5.63 ...
## $ age : num 78.9 61.1 45.8 96.1 100 94.3 82.9 39 84.5 56.5 ...
## $ dis : num 4.97 4.97 6.06 5.95 6.08 ...
## $ rad : int 2 2 3 5 5 5 5 5 4 4 ...
## $ tax : int 242 242 222 311 311 311 311 311 307 307 ...
## $ ptratio: num 17.8 17.8 18.7 15.2 15.2 15.2 15.2 15.2 21 21 ...
## $ black : num 397 393 395 397 387 ...
## $ lstat : num 9.14 4.03 2.94 19.15 29.93 ...
## $ medv : num 21.6 34.7 33.4 27.1 16.5 15 18.9 21.7 18.2 19.9 ...
I notice that the variable rad
is numeric but perhaps it should be considered a categorical variable. The boxplots below support this idea.
#table(tr$rad) #factor variable
BostonF <- tr
BostonF$rad <- factor(BostonF$rad)
bwplot(rad ~ medv, data=BostonF, panel=function(x,y){
panel.grid(v=-1, h=0, col=rgb(0.5,0.5,0.5,0.5))
panel.bwplot(x, y, col="blue", fill="yellow", pch="|", notch=TRUE,
notch.frac=0.8)
})
As a check, I use R to compute the RMSE’s for the training and test datasets using OLS and RF. The code below requires that you download and install the R package randomForest.
ansRF <- randomForest::randomForest(medv ~ ., data=tr, ntree=1000)
yhRFte <- predict(ansRF, newdata=te)
rmseRFte <- rmse(yhRFte, te$medv)
yhRFtr <- predict(ansRF, newdata=tr)
rmseRFtr <- rmse(yhRFtr, tr$medv)
#
ansOLS <- lm(medv ~ ., data=tr)
yhOLSte <- predict(ansOLS, newdata=te)
rmseOLSte <- rmse(yhOLSte, te$medv)
rmseOLStr <- sqrt(mean(resid(ansOLS)^2))
#
tb <- matrix(c(rmseOLStr, rmseOLSte, rmseRFtr, rmseRFte), ncol=2)
dimnames(tb) <- list(c("train", "test"), c("OLS", "RF"))
tb
## OLS RF
## train 4.224414 1.54143
## test 5.645434 3.62326