--- title: "RR and OLS with prostate dataset" subtitle: R Notebook author: "A. I. McLeod" date: "March 6, 2018" output: pdf_document: fig_caption: yes header-includes: - \usepackage{color} - \usepackage{titling} - \usepackage{float} --- ```{r setup, include=FALSE} library(knitr) library(caret) library(lattice) #xyplot.resamples() knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(fig.pos = "H") ``` ##Prostate dataset and initial data analysis## We may obtain the prostate data from the R CRAN package *ElemStatLearn*. In our analysis we will ignore column 10 which was used in the examples in the ESL textbook. For the record the input variables are shown below in Table 1. Recall that the output variable is lpsa (logarithm of PSA). ######Table 1. Prostate Data Variables ```{r Table1-ProstateVars, results="asis", echo=FALSE} vars <- c("lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45") varsD <- c("log cancer volume", "log prostate weight", "age in years", "log benign prostatic hyperplasia", "seminal vesicle invasion", "log of capsular penetration" , "Gleason score", "percent Gleascores 4/5") dvf <- matrix(c(vars, varsD), ncol=2) dvf <- as.data.frame.matrix(dvf) names(dvf) <- c("Variable", "Description") kable(dvf, format="latex", row.names=FALSE, col.names=c("Variable..", "...Description")) ``` Recall that VIF's exceeding 5 or 10 indicate the presence of a degree of near multicollinearity in the input variables. From the table below this is not an important consideration for this datase ######Table 2. VIF Prostate Dataset ```{r ProstateVIF, echo=FALSE, results="asis", fig.pos="H", out.extra = ''} prostate <- ElemStatLearn::prostate[,-10] #remove 10th column prostate <- scale(prostate) #to compare with RR out <- matrix(round(bestglm::vifx(prostate[,-9]),1), nrow=1) kable(out, format="latex", col.names=vars) ``` ##Using caret to compare RR and OLS ```{r FitLmETC, echo=FALSE, cache=TRUE} data <- prostate X <- data[,1:8] y <- data[,9] set.seed(333477) ctrl <- trainControl(method="repeatedcv", number=10, repeats=7) ansOLS <- train(x=X, y=y, method="lm", trControl=ctrl) ``` I show the how to use the caret::train for RR in the code snippet below. ```{r FitRR, cache=TRUE} set.seed(333477) tuneGrid <- data.frame(lambda=seq(0, 0.1, length=25)) ctrl <- trainControl(method="repeatedcv", number=10, repeats=7) ansRR <- train(x=X, y=y, method="ridge", tuneGrid=tuneGrid, preProc = c("center", "scale"), trControl=ctrl) ``` The optimal tuning parameter using 10-fold CV with 7 repeats is lambda = `r ansRR$bestTune`. Often a suitable visualization is better than reporting tables or p-values. Here we use the Bland-Altman plot. ```{r CompareMethods, echo=FALSE} ansCompare <- resamples(list(OLS = ansOLS, RR = ansRR)) pvalue <- (summary(diff(ansCompare)))$table$RMSE[2,1] trellis.par.set(plot.symbol = list(pch = 18, cex=1.5)) #hack! xyplot(ansCompare, what="BlandAltman", metric="RMSE") ``` We can also use summary.resamples() to produce a more detailed table but this is not as important as the plot. ######Table 3. RMSE Comparisons ```{r CompareTable, echo=FALSE, results="asis"} tb <- round(summary(ansCompare)$statistics$RMSE, 3) kable(tb, format="latex") ``` The p-value for the difference in RMSE between RR and OLS is `r 100*round(as.numeric(pvalue),3)`%. The caret package also supports p-values using diff.resamples(). The Bland-Altman plot is much more informative than merely quoting a p-value. George Box as stated that perhaps the only good use of p-values is for model diagnostic checking since the goal for statistical analysis should be to develop and model and make inferences such as confidence intervals on the parameters -- no p-values! Also I recommend you read the [ASA statement on p-values](https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf).