Assignment 2. Statistics 3859B (Regression)

Boston Housing Data Problem

You task is to use regression methods discussed in our course including: linear regression and variable selection to develop a model for prediction on the test data that is provided. It is important that you use only the training data in developing your model.

You must provide both a PDF and Rmd-file.

These files must be uploaded to OWL no later than December 21, 2017. Please send me an email if you upload a file.

The PDF document must be carefully prepared. All figures and tables must be numbered and have a title. The report must be well organized and not too long (less than 10 pages). No computer code should be included in your PDF report. But the Rmd-file is essential and no report will be accepted without both files.

Brief Discussion of the Data

A brief description of the Boston Housing dataset was discussed in my lecture note on Prediction and more details are given in the documentation in the R package MASS:

help(Boston, package=MASS)

The training and test data used in my lecture note are available in csv format from my webpage and may be input to R with the read.csv() function as shown below:

tr <- read.csv("http://www.stats.uwo.ca/faculty/aim/2017/3859/data/BostonTrain.csv")
te <- read.csv("http://www.stats.uwo.ca/faculty/aim/2017/3859/data/BostonTest.csv")
str(tr)

## 'data.frame':    338 obs. of  14 variables:
##  $ crim   : num  0.0273 0.0273 0.0324 0.1446 0.2112 ...
##  $ zn     : num  0 0 0 12.5 12.5 12.5 12.5 12.5 0 0 ...
##  $ indus  : num  7.07 7.07 2.18 7.87 7.87 7.87 7.87 7.87 8.14 8.14 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.469 0.469 0.458 0.524 0.524 0.524 0.524 0.524 0.538 0.538 ...
##  $ rm     : num  6.42 7.18 7 6.17 5.63 ...
##  $ age    : num  78.9 61.1 45.8 96.1 100 94.3 82.9 39 84.5 56.5 ...
##  $ dis    : num  4.97 4.97 6.06 5.95 6.08 ...
##  $ rad    : int  2 2 3 5 5 5 5 5 4 4 ...
##  $ tax    : int  242 242 222 311 311 311 311 311 307 307 ...
##  $ ptratio: num  17.8 17.8 18.7 15.2 15.2 15.2 15.2 15.2 21 21 ...
##  $ black  : num  397 393 395 397 387 ...
##  $ lstat  : num  9.14 4.03 2.94 19.15 29.93 ...
##  $ medv   : num  21.6 34.7 33.4 27.1 16.5 15 18.9 21.7 18.2 19.9 ...

I notice that the variable rad is numeric but perhaps it should be considered a categorical variable. The boxplots below support this idea.

#table(tr$rad) #factor variable
BostonF <- tr
BostonF$rad <- factor(BostonF$rad)
bwplot(rad ~ medv, data=BostonF, panel=function(x,y){
  panel.grid(v=-1, h=0, col=rgb(0.5,0.5,0.5,0.5))
  panel.bwplot(x, y, col="blue", fill="yellow", pch="|", notch=TRUE, 
               notch.frac=0.8)
})

As a check, I use R to compute the RMSE’s for the training and test datasets using OLS and RF. The code below requires that you download and install the R package randomForest.

ansRF <- randomForest::randomForest(medv ~ ., data=tr, ntree=1000)
yhRFte <- predict(ansRF, newdata=te)
rmseRFte <- rmse(yhRFte, te$medv)
yhRFtr <- predict(ansRF, newdata=tr)
rmseRFtr <- rmse(yhRFtr, tr$medv)
#
ansOLS <- lm(medv ~ ., data=tr)
yhOLSte <- predict(ansOLS, newdata=te)
rmseOLSte <- rmse(yhOLSte, te$medv)
rmseOLStr <- sqrt(mean(resid(ansOLS)^2))
#
tb <- matrix(c(rmseOLStr, rmseOLSte, rmseRFtr, rmseRFte), ncol=2)
dimnames(tb) <- list(c("train", "test"), c("OLS", "RF"))
tb

##            OLS      RF
## train 4.224414 1.54143
## test  5.645434 3.62326

Assignment 2. Statistics 3859B (Regression)

A. I. McLeod

November 24, 2017

Optional Assignment 2

Boston Housing Data Problem

Brief Discussion of the Data