---
title: "Fitting MLP and MARS to rmix Dataset"
author: "A. I. McLeod"
date: "March 25, 2018"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#Remark: assume xtable package has been installed
library("caret")
library("NeuralNetTools")
#library("nnet")
XyTe <- read.csv(
 "http://www.stats.uwo.ca/faculty/aim/2018/4850G/data/rmixTe.csv", 
 header=TRUE)
XyTr <- read.csv(
 "http://www.stats.uwo.ca/faculty/aim/2018/4850G/data/rmixTr.csv", 
 header=TRUE)
#
XyTr[,3] <- factor(XyTr[,3])
XyTe[,3] <- factor(XyTe[,3])
```

In our previous lecture note, *15_2_kNNApplied_Mixture*, we compared fitting
a logistic classifier and a kNN classifier for predicting with a sample
of size $n=200$ from the data generated by our **gencve::rmix()**.
We evaluated the predictors on a test sample of size 20,000 and found
that the prediction errors were 27.10% and 26.15% when $k=7$ was selected
using the pseudo-MLE method.
Since the test sample is so large the 95\% MOE is *less than* 0.0045 so 
difference is not due to randomness -- it is more than 200 standard deviations 
away from zero!
By comparison the theoretical optimum misclassification error rate
(Bayes error rate) was shown to be 20.76\%.

In this lecture we fit a multilayer perceptron (MLP) to this data using 
the functions **nnet::nnet()** and **RSNNS::mlp()**.
Both of these functions are available in the **caret** package
with methods **nnet** and **mlpWeightDecay**.
When using caret, you should also examine the arguments for the underlying 
functions that caret uses since these may be helpful in fine-tuning the
model. 

On persual for the documentation provided by caret::train() for the 
"nnet" method we set the tuning parameter size and decay.
I experimented with several settings to find ones that worked well
since we don't want a tuning parameter setting that is on the boundary!

\newpage
##Fitting using caret and nnet::net()

Also setting **metric="Accuracy"** and the output variable a factor
ensures that caret/nnet recognizes that this is a classification problem.
Additional setting for nnet are **trace=FALSE**, **skip==TRUE** and 
**maxit=2000**. The skip-layer setting is needed so the neural net
includes a bias correction at every node. The fitted model is
summarized below.

```{r NNET, echo=FALSE, cache=TRUE}
#ctrl <- trainControl(method="repeatedcv", number=10, repeats=7,
#                     summaryFunction=twoClassSummary) #produces error
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7)
set.seed(3334)
tuneGrid <- expand.grid(.size=1:4,
                        .decay=c(0, 0.1, 0.25, 0.5, 0.75))
ans1 <- train(x=XyTr[,1:2], y=XyTr[,3],
              #preProc=c("center", "scale"), #doesn't hurt, but not used here!
              method="nnet",
              metric="Accuracy",
              tuneGrid=tuneGrid,
              trControl=ctrl,
#these args are passed thru to nnet
              trace=FALSE,
              skip=TRUE, #usually set to TRUE
              maxit = 2000)
summary(ans1)
```

Figure 1 below shows a schematic diagram of the fitted MLP using the library 
NeuralNetTools.

```{r PLOT-nnet, echo=FALSE, fig.height=4, fig.pos="H", fig.cap="Fitted MLP using nnet", cache=TRUE}
yH1 <- predict(ans1, newdata=XyTe[,1:2])
accuracy1 <- mean(yH1==factor(XyTe[,3]))
eta1 <- 1-accuracy1
cftb1 <- table(XyTe[,3], yH1, dnn=c("Truth", "Predicted"))
MOE1 <- 1.96*sqrt(eta1*(1-eta1)/nrow(XyTe))
ci1 <- eta1+c(-1,1)*MOE1
plotnet(ans1$finalModel, x_names=c("x1", "x2"))#black=+wt, gray=-ve, 
```

\newpage
The fitted model was used to predict on the 20,000 test instances
and the observed misclassiciation rate was $\hat{\eta}$ =
`r round(100*eta1,2)`\%.
The confusion matrix is shown in Table 1 below.

```{r ConfusionMatrix1, echo=FALSE, results="asis"}
out <- xtable::xtable(cftb1, caption="Model fitted using nnet")
print(out, comment=FALSE, type="latex")
```

\newpage
##Fitting using caret and RSNNS::mlp()

From Figure 2 we see a different model was selected.
This is not surprising since a completely different optimization algorithm
was used and even more different local minima may be expected using the
same algorithm since initial weights are chosen randomly.
Most local minima are just as good.

```{r MLP, echo=FALSE, cache=TRUE}
ctrl <- trainControl(method="repeatedcv", number=10, repeats=7)
set.seed(7357351)
tuneGrid <- expand.grid(.size=1:4,
                        .decay=c(0, 0.1, 0.25, 0.5, 0.75))
ans2 <- train(x=XyTr[,1:2], y=XyTr[,3],
              #preProc=c("center", "scale"), #doesn't hurt
              method="mlpWeightDecay",
              metric="Accuracy",
              tuneGrid=tuneGrid,
              trControl=ctrl)
#summary(ans2)
```

```{r PLOT-mlp, echo=FALSE, fig.height=4, fig.width=4, fig.pos="H", fig.cap="Fitted MLP using RSNNS::mlp(). O1 and O2 are class probabilities.", cache=TRUE}
#predict(ans2) #show both outputs
yH2 <- predict(ans2, newdata=XyTe[,1:2])
cftb2 <- table(XyTe[,3], yH2, dnn=c("Truth", "Predicted"))
accuracy2 <- mean(yH2==factor(XyTe[,3]))
eta2 <- 1-accuracy2
MOE2 <- 1.96*sqrt(eta2*(1-eta2)/nrow(XyTe))
ci2 <- eta2+c(-1,1)*MOE2
plotnet(ans2$finalModel, y_names=c("", ""), 
        x_names=c("x1", "x2")) 
#black=+wt, gray=-ve
```


The fitted model was used to predict on the 20,000 test instances
and the observed misclassiciation rate was $\hat{\eta}$ =
`r round(100*eta2,2)`\%.
The confusion matrix is shown in Table 2 below.

```{r ConfusionMatrix2, echo=FALSE, results="asis"}
out <- xtable::xtable(cftb2, caption="Confusion Matrix, fitted ")
print(out, comment=FALSE, type="latex")
```

##MARS classifier

```{r MARS, echo=FALSE}
XyTrain <- XyTr
XyTrain[,3] <- factor(XyTrain[,3])
ans <- earth::earth(y ~ ., data=XyTrain, glm=list(family=binomial(link="logit")))
XyTest <- XyTe
XyTest[,3] <- factor(XyTest[,3])
pHTe <- predict(ans, newdata=XyTest, type="response")
yHTest <- predict(ans, newdata=XyTest, type="class")
rTest <- mean(yHTest!=XyTest$y)
cftbMARS <- table(XyTest[,3], yHTest, dnn=c("Truth", "Predicted"))
```

The observed misclassiciation rate was $\hat{\eta}$ =
`r round(100*rTest,2)`\%.
The confusion matrix is shown in Table 3 below.

```{r ConfusionMatrixMARS, echo=FALSE, results="asis"}
out <- xtable::xtable(cftbMARS, caption="MARS Model")
print(out, comment=FALSE, type="latex")
```