---
title: "RR and OLS with prostate dataset"
subtitle: R Notebook
author: "A. I. McLeod"
date: "March 6, 2018"
output:
  pdf_document: 
fig_caption: yes
header-includes:
  - \usepackage{color}
  - \usepackage{titling}
  - \usepackage{float}
---

```{r setup, include=FALSE}
library(knitr)
library(caret)
library(lattice) #xyplot.resamples()
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.pos = "H")
```

##Prostate dataset and initial data analysis##

We may obtain the prostate data from the R CRAN package *ElemStatLearn*. In our analysis
we will ignore column 10 which was used in the examples in the ESL textbook.

For the record the input variables are shown below in Table 1.
Recall that the output variable is lpsa (logarithm of PSA).

######Table 1. Prostate Data Variables
```{r Table1-ProstateVars, results="asis", echo=FALSE}
vars <- c("lcavol", "lweight", "age", "lbph", "svi", "lcp", 
   "gleason", "pgg45")
varsD <- c("log cancer volume", "log prostate weight", "age in years", 
   "log benign prostatic hyperplasia", "seminal vesicle invasion", 
   "log of capsular penetration" , "Gleason score", 
   "percent Gleascores 4/5")
dvf <- matrix(c(vars, varsD), ncol=2)
dvf <- as.data.frame.matrix(dvf)
names(dvf) <- c("Variable", "Description")
kable(dvf, format="latex", row.names=FALSE, 
      col.names=c("Variable..", "...Description"))
```

Recall that VIF's exceeding 5 or 10 indicate the presence of a degree of near
multicollinearity in the input variables.
From the table below this is not an important consideration for this datase

######Table 2. VIF Prostate Dataset
```{r ProstateVIF, echo=FALSE, results="asis", fig.pos="H", out.extra = ''}
prostate <- ElemStatLearn::prostate[,-10] #remove 10th column
prostate <- scale(prostate) #to compare with RR
out <- matrix(round(bestglm::vifx(prostate[,-9]),1), nrow=1)
kable(out, format="latex",  col.names=vars)
```


##Using caret to compare RR and OLS

```{r FitLmETC, echo=FALSE, cache=TRUE}
data <- prostate
X <- data[,1:8]
y <- data[,9]
set.seed(333477)
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7)
ansOLS <- train(x=X, y=y,
                   method="lm",
                   trControl=ctrl)
```

I show the how to use the caret::train for RR in the code snippet below.

```{r FitRR, cache=TRUE}
set.seed(333477)
tuneGrid <- data.frame(lambda=seq(0, 0.1, length=25))
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7)
ansRR <- train(x=X, y=y,
                   method="ridge",
                   tuneGrid=tuneGrid,
                   preProc = c("center", "scale"),
                   trControl=ctrl) 
```

The optimal tuning parameter using 10-fold CV with 7 repeats is 
lambda =  `r ansRR$bestTune`.

Often a suitable visualization is better than reporting tables or p-values.
Here we use the Bland-Altman plot.

```{r CompareMethods, echo=FALSE}
ansCompare <- resamples(list(OLS = ansOLS,
                            RR = ansRR))
pvalue <- (summary(diff(ansCompare)))$table$RMSE[2,1]
trellis.par.set(plot.symbol = list(pch = 18, cex=1.5)) #hack!
xyplot(ansCompare, what="BlandAltman", metric="RMSE")
```

We can also use summary.resamples() to produce a more detailed
table but this is not as important as the plot.

######Table 3. RMSE Comparisons
```{r CompareTable, echo=FALSE,  results="asis"}
tb <- round(summary(ansCompare)$statistics$RMSE, 3)
kable(tb, format="latex")
```

The p-value for the difference in RMSE between RR and OLS
is  `r 100*round(as.numeric(pvalue),3)`%.
The caret package also supports p-values using diff.resamples().
The Bland-Altman plot is much more informative than merely quoting
a p-value.
George Box as stated that perhaps the only good use of p-values
is for model diagnostic checking since the goal for statistical
analysis should be to develop and model and make inferences such
as confidence intervals on the parameters -- no p-values!
Also I recommend you read the
[ASA statement on p-values](https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf).