Purpose

To fit Ridge Regression (RR) from first principles.

Prostate dataset and initial data analysis

We may obtain the prostate data from the R CRAN package ElemStatLearn. In our analysis we will ignore column 10 which was used in the examples in the ESL textbook.

For the record the input variables are shown below in Table 1. Recall that the output variable is lpsa (logarithm of PSA).

Table 1. Prostate data input variables.
Variable.. …Description
lcavol log cancer volume
lweight log prostate weight
age age in years
lbph log benign prostatic hyperplasia
svi seminal vesicle invasion
lcp log of capsular penetration
gleason Gleason score
pgg45 percent Gleascores 4/5


I have updated the CRAN package bestglm (Version 0.35) with two new functions vifx and dgrid() for computing the VIF and for lattice style variance dependency plots.

Recall that VIF’s exceeding 5 or 10 indicate the presence of a degree of near multicollinearity in the input variables. Interesting we see from the table below this is not an important consideration for this dataset.

Table 2. VIF for prostate inputs.
##   lcavol  lweight      age     lbph      svi      lcp  gleason    pgg45 
## 2.102650 1.453325 1.336099 1.385040 1.955928 3.097954 2.468891 2.974075


The dependency plots reveal interesting features. The variables svi and gleason appear to be categorical variables. Treating gleason as a quantiative variable may be inappropriate. Variables lcp, pgg45 and lbph exhibit left-censoring. Nonlinearity is possibly indicated with pgg45. The variables lcavol and lweight both have a strong relationship with the output lpsa.

Figure 1. Dependency Plots for Prostate Data.


Ridge trace plot

From first principles we compute the Ridge Regression Trace plot.

Figure 2. Ridge trace for prostate data.



Modern Style for Ridge Trace plot

Figure 3. Ridge trace with degrees of freedom (DF)