Fitting best subset regression to the prostate data using BIC subset selection the best predictors are ‘lcavol’, ‘lweight’ and ‘svi’
Remark: The t-ratio statistic does not as useful a measure of variable importance as the RF-importance statistic. The t-ratio statistic indicates how significantly different from zero the regression coefficient is whereas the RF-importance indicates how important the input variable is in prediction.
## BIC
## BICq equivalent for q in (0.056139873352981, 0.759703853311213)
## Best Model:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7771566 0.62299945 -1.247444 2.153670e-01
## lcavol 0.5258519 0.07486323 7.024168 3.493565e-10
## lweight 0.6617699 0.17563516 3.767867 2.887126e-04
## svi 0.6656666 0.20708985 3.214385 1.797619e-03
Lasso also does variable selection. Fitting a Lasso regression using glmnet produces a result in agreement with best subset.
## lcavol lweight age lbph svi lcp gleason
## 0.4409790 0.2432206 0.0000000 0.0000000 0.3064360 0.0000000 0.0000000
## pgg45
## 0.0000000
The Random Forest Importance plot shows which variables are most important in predicting ‘lpsa’. We see ‘lcavol’ is the most important one followed by ‘svi’ and then ‘lweight’.