To fit Ridge Regression (RR) from first principles.
We may obtain the prostate data from the R CRAN package ElemStatLearn. In our analysis we will ignore column 10 which was used in the examples in the ESL textbook.
For the record the input variables are shown below in Table 1. Recall that the output variable is lpsa (logarithm of PSA).
| Variable.. | …Description |
|---|---|
| lcavol | log cancer volume |
| lweight | log prostate weight |
| age | age in years |
| lbph | log benign prostatic hyperplasia |
| svi | seminal vesicle invasion |
| lcp | log of capsular penetration |
| gleason | Gleason score |
| pgg45 | percent Gleascores 4/5 |
I have updated the CRAN package bestglm (Version 0.35) with two new functions vifx and dgrid() for computing the VIF and for lattice style variance dependency plots.
Recall that VIF’s exceeding 5 or 10 indicate the presence of a degree of near multicollinearity in the input variables. Interesting we see from the table below this is not an important consideration for this dataset.
## lcavol lweight age lbph svi lcp gleason pgg45
## 2.102650 1.453325 1.336099 1.385040 1.955928 3.097954 2.468891 2.974075
The dependency plots reveal interesting features. The variables svi and gleason appear to be categorical variables. Treating gleason as a quantiative variable may be inappropriate. Variables lcp, pgg45 and lbph exhibit left-censoring. Nonlinearity is possibly indicated with pgg45. The variables lcavol and lweight both have a strong relationship with the output lpsa.
From first principles we compute the Ridge Regression Trace plot.