Prologue

When I chose the dataset, I thought it looked interesting because it gave unexpected results with my simple cross-validation functions trainTestPartition() and glmnetGridTable() it was apparent that OLS beats RR and Lasso. I think this dataset is a bit of “curve-ball” but it something you could well have to deal with if you use regression! And regression is by far the most widely used technique in data science. Table 1 displays a typical result. There are some interesting features in Table 1. We see that all Lasso iterations agree that ‘crim’, ‘zn’, ‘age’, ‘rad’ ‘tax’ and ‘black’ are not very important. But we will show because the model has major mis-specification problems these conclusions are incorrect.

We will find that this is indeed an interesting and amazing dataset!

##            OLS StepAIC StepBIC    RR1    RR2    RR3 LASSO1 LASSO2 LASSO3
## crim    -1.181  -1.185  -1.236 -0.079 -0.080 -0.076  0.000  0.000 -0.012
## zn       1.293   1.264   1.273  0.026  0.026  0.024  0.000  0.000  0.010
## indus    0.146   0.000   0.000 -0.078 -0.077 -0.079  0.000  0.000  0.000
## chas     0.588   0.599   0.000  2.363  2.399  2.234  0.047  1.327  1.634
## nox     -2.402  -2.322  -2.255 -5.127 -5.368 -4.544  0.000 -1.388 -5.237
## rm       2.360   2.365   2.373  3.046  3.107  2.839  3.381  3.671  3.650
## age      0.131   0.000   0.000 -0.009 -0.009 -0.010  0.000  0.000  0.000
## dis     -3.367  -3.442  -3.517 -0.454 -0.488 -0.358  0.000 -0.170 -0.484
## rad      3.019   2.940   3.051  0.031  0.036  0.016  0.000  0.000  0.000
## tax     -1.958  -1.842  -1.964 -0.002 -0.002 -0.003  0.000  0.000  0.000
## ptratio -2.216  -2.194  -2.247 -0.669 -0.684 -0.621 -0.695 -0.806 -0.822
## black    0.937   0.941   0.994  0.007  0.008  0.007  0.002  0.006  0.006
## lstat   -4.165  -4.121  -4.148 -0.344 -0.355 -0.311 -0.563 -0.584 -0.586
## NORM     7.887   7.804   7.915  6.475  6.713  5.858  3.498  4.264  6.684
## RMSE     4.284   4.281   4.359  4.596  4.555  4.739  4.698  4.505  4.403
Table 1.

Purpose

The data is collected from 506 districts in the area around Boston. Possibly there could be spatial correlation but this is ignored in the analysis and it is assumed the observations are statistically independent.

The dataset is comprised of 13 input variables and one output variable which is the median house price.

Our goal is to look for the important variables related to house price and to develop a statistical model for predicting house price.

This notebook is prepared for my students so I will go into more detail. For the purposes of writing a report for submission, students ideally would do this analysis and then writeup a brief report since it is not necessary to show all the steps involved in the final report.

Analysis

Initial Data Analysis

The variables are summarized below.

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Table 2.

From this summary it appears that ‘chas’ and ‘rad’ are categorical variables. We check this by using table() to create a crosstabs. First for ‘chas’ we find:

##   River Proximity Number of Houses
## 1            near               35
## 2             not              471
Table 3.

Since ‘chas’ has only two levels, we can don’t need to use a factor parameterization since it is correctly parameterized using 0/1 encoding. From Figure 1 we see that being near the river has some effect on median house price but it is not clear how statistically signficant this effect will be if other variables like number of rooms, etc. are taken into account.

Next in Table 4 we show the crosstab for ‘rad’. We see there are nine levels for this variable.

## 
##   1   2   3   4   5   6   7   8  24 
##  20  24  38 110 115  26  17  24 132
Table 4.

The definition of ‘rad’ is that it is an index of accessibility to highways. Some indices may be treated as continuous variables, for example, the consumer price index (CPI). To determine if this might be the case we look at the notched boxplots by level of ‘rad’ in Figure 2. These boxplots suggest there are some important variations in house price associated with ‘rad’ and moreover than the relationship is not linear or some simple non-linear function.

Multiple linear regression

Fitting the multiple linear regression. Since there are factors we should use the R anova() function to check the p-values. From Table 5 we see all variables are highly significant except for ‘nox’ and ‘age’ which are both near the 5% level. This is a bit surprising because we would assume ‘nox’ pollution would lower the house price. Since there inputs are strongly correlated this can happen.

## Analysis of Variance Table
## 
## Response: medv
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## crim        1  6440.8  6440.8 292.3589 < 2.2e-16 ***
## zn          1  3554.3  3554.3 161.3378 < 2.2e-16 ***
## indus       1  2551.2  2551.2 115.8053 < 2.2e-16 ***
## chas        1  1529.8  1529.8  69.4426 8.149e-16 ***
## nox         1    76.2    76.2   3.4610  0.063437 .  
## rm          1 10938.1 10938.1 496.5010 < 2.2e-16 ***
## age         1    90.3    90.3   4.0974  0.043496 *  
## dis         1  1779.5  1779.5  80.7748 < 2.2e-16 ***
## rad         8   948.7   118.6   5.3831 1.672e-06 ***
## tax         1   154.3   154.3   7.0021  0.008406 ** 
## ptratio     1   953.6   953.6  43.2844 1.229e-10 ***
## black       1   608.5   608.5  27.6205 2.216e-07 ***
## lstat       1  2406.1  2406.1 109.2194 < 2.2e-16 ***
## Residuals 485 10684.7    22.0                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 5. ANOVA for OLS Regression

We should look at diagnostic checks. First the basic regression diagnostic checks are shown in Figure 3. The residuals are skewed strongly to the left. This suggests that it might be better to use log house prices.

Figure 3. OLS Regression Diagnostics

After taking logs, we find that there are still outliers but the distribution of the residuals is more symmetric.

Figure 4. Diagnostics After Logs

The statistical analysis using least-squares is easier to interpret with a symmetric error distribution, so we replace the variable ‘medv’ with ‘lmedv’ its log value. The ANOVA using output ‘lmedv’ is shown in Table 6. Using log’s has improved the sensitivity - now all the variables are highly significant.

## Analysis of Variance Table
## 
## Response: log(medv)
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## crim        1 23.5180 23.5180 658.1388 < 2.2e-16 ***
## zn          1  5.8293  5.8293 163.1291 < 2.2e-16 ***
## indus       1  5.7358  5.7358 160.5118 < 2.2e-16 ***
## chas        1  2.2772  2.2772  63.7267 1.043e-14 ***
## nox         1  0.6093  0.6093  17.0501 4.287e-05 ***
## rm          1 14.1742 14.1742 396.6560 < 2.2e-16 ***
## age         1  0.4317  0.4317  12.0821 0.0005548 ***
## dis         1  2.2343  2.2343  62.5249 1.790e-14 ***
## rad         8  1.7234  0.2154   6.0287 2.120e-07 ***
## tax         1  0.4605  0.4605  12.8869 0.0003646 ***
## ptratio     1  1.3625  1.3625  38.1275 1.402e-09 ***
## black       1  1.3821  1.3821  38.6771 1.080e-09 ***
## lstat       1  7.3072  7.3072 204.4871 < 2.2e-16 ***
## Residuals 485 17.3311  0.0357                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 6.

Next we examine residual dependency plots for the continuous variables in the model. The relationship with the residual appears flat in most cases with the exceptions being ‘rm’ and ‘crim’. In the case of ‘crim’ there appear to be a few values on right that are dragging the curve up so this relationship is probably reasonably flat. For ‘rm’, the average number of rooms, a quadratic term could be used.

Figure 5. Residual Dependency Plot

The model was refit using \({\rm rm}^2\) and all variables are highly significant and the diagnostic checks were acceptable. For brievity the results are not shown here but are included in the Rmd file.

Cross-validation experiment using glmnet

In the prologue, it was noted that RR and Lasso did not improve on OLS using the functions trainTestPartition() and glmnetGridTable() to perform a simple cross-validation experiment. Now we repeat this experiment using our improved multiple linear regression model. Essentially the same result! Also note that Lasso regression suggests that ‘indus’, ‘rm’ and ‘tax’ are not important although the squared-value of ‘rm’ is included.

##           OLS LASSO1 LASSO2 LASSO3 LASSO4   RR1   RR2   RR3   RR4
## crim    -0.94  -0.08  -0.09  -0.08  -0.09 -0.06 -0.06 -0.07 -0.06
## zn       0.31   0.01   0.01   0.00   0.01  0.01  0.01  0.01  0.01
## indus    0.14   0.00   0.00   0.00   0.00 -0.02 -0.01 -0.01 -0.01
## chas     0.26   0.94   0.95   0.79   1.01  1.02  1.05  1.07  1.06
## nox     -0.93  -6.19  -6.36  -4.78  -7.01 -2.76 -2.98 -3.23 -3.10
## rm      -5.78   0.00   0.00   0.00  -1.61  0.60  0.57  0.54  0.56
## age     -0.24   0.00  -0.01   0.00  -0.01 -0.01 -0.01 -0.01 -0.01
## dis     -1.06  -0.44  -0.45  -0.26  -0.50 -0.16 -0.19 -0.22 -0.20
## rad2     0.18   0.02   0.03   0.00   0.15  0.33  0.34  0.35  0.34
## rad3     0.37   0.56   0.58   0.39   0.69  0.53  0.56  0.60  0.58
## rad4     0.32  -0.03  -0.02   0.00   0.00 -0.13 -0.12 -0.10 -0.11
## rad5     0.47   0.00   0.00   0.00   0.12  0.26  0.26  0.26  0.26
## rad6     0.14  -0.32  -0.34  -0.07  -0.30 -0.22 -0.23 -0.24 -0.24
## rad7     0.33   1.04   1.10   0.48   1.33  0.62  0.70  0.77  0.73
## rad8     0.32   0.57   0.60   0.34   0.69  0.73  0.75  0.77  0.76
## rad24    0.94   0.15   0.20   0.00   0.51 -0.14 -0.08 -0.01 -0.04
## tax     -0.44   0.00   0.00   0.00   0.00  0.00  0.00  0.00  0.00
## ptratio -0.68  -0.35  -0.36  -0.34  -0.36 -0.22 -0.24 -0.25 -0.24
## black    0.24   0.00   0.00   0.00   0.00  0.00  0.00  0.00  0.00
## lstat   -1.71  -0.23  -0.23  -0.24  -0.23 -0.14 -0.15 -0.15 -0.15
## rmSq     6.49   0.09   0.09   0.09   0.21  0.07  0.07  0.07  0.07
## NORM     9.16   6.44   6.62   4.92   7.51  3.25  3.47  3.71  3.59
## RMSE     2.07   2.16   2.16   2.18   2.13  2.41  2.37  2.34  2.35
Table 7.

Cross-validation experiment using grpreg

Since ‘rad’ is a categorical variable, perhaps LASSO or some other penalized regression method will do better with a group penalty function. We compare OLS with backward-stagewise regression using AIC/BIC with group Lasso using AIC/BIC and Lasso and MCP penalties using 25 cross-validation iterations. The notched boxplots in Figure 6 below suggest OLS is still the best although the difference between OLS and the best penalty MCP method is not signficant.

Figure 6. Notched Boxplots OLS vs Penalized Regression

These cross-validation computations took about 16 seconds.

To notches in the boxplot are quite large, so to essentially remove the effect of randomness the cross-validation iterations were increased to 10,000 and the results are shown in the Figure below. Using R parallel this takes about 25 minutes. The results are similar. Interesting now group Lasso with AIC selection is best followed by OLS and group MCP with AIC and BIC. From the plot we see the differences are very small even if they are statistically significant.

Figure 6B. Notched Boxplots OLS vs Penalized Regression

Random Forest - Insight and Prediction

Random Forest Variable Importance Plot

Random Forest does well in many prediction problems both for class prediction in classification as well as regression. It builds on the CART methodology so it handles non-linearity including interactions among the variables. Special methods are used for categorical variables. For direct comparison with OLS we will use the ‘lmedv’, log of the median house price and treat ‘rad’ as a factor.

Another advantage is that the R function randomForest() is very easy to use. I recommend setting the argument ntree=1000 instead of the default setting that only uses 500 trees.

Figure 7. Dotchart for Random Forest Importance

From Figure 7, the most important variables for random forest prediction are, in order, ‘lstat’ (lower status of population), ‘rm’ (average number of rooms), ‘nox’ (air pollution) and ‘crim’ (per capita crime rate).

Random Forest Cross-Validation Experiment

The prediction performance of RF is assessed using cross-validation. As in the last experiment, trainTestPartition() is used to randomly partition the dataset into 2/3 for training and 1/3 for testing. The variable ‘rad’ is used as a factor variable with both OLS and RF and logged house price is used.

Based on only 25 CV iterations it is clear that RF performs much better than OLS.

Figure 8. Notched Boxplot RF VS OLS

The computations took about 23 seconds.

Just to ‘test by fire’ the R algorithm randomForest() we ran the above simulation 10,000 times. Everything worked fine and the plot shown below in Figure 8B was produced.

Figure 8B. Notched Boxplot RF VS OLS

Conclusion

For most of the boxplots and dotcharts presented in the notebook, a Pareto-type ordering was used. This simple technique often enhances visualization.

A cross-validation experiment was also performed using Gradient Boosting Machines with the **gbm* package but this performed worse than OLS. There are many adjustable parameters so there is possible scope for improvement. Script included in the Rmd file.

It is very interesting that RF performed so much better than OLS or any other penalized or subset regression methods. In principle, one might believe that it should be possible to construct a type of regression model that would compete better with RF. My first approach might be to look cross-products or interactions among the variables. But I have little enthusiasm for what might be a futile quest.