---
title: "Churn Prediction with SVM"
author: "Ian McLeod"
date: "April 6, 2018"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(e1071)
library(C50)
library(randomForest)
data("churn", package="C50")
```

This notebook shows how SVM may be trained and used to predict
on the **churn** data that is provided in the **C50** package.
We will use the well-tested functions **svm()** and **tune.svm()**
that are included in the **e1071** package.

The churn data has been divided into training and test portions
with respective samples sizes `r nrow(churnTrain)` and `r nrow(churnTrain)`

A terse summary of the training data shows the variables are
a heterogenous mix of numeric, integer and categorical (factor).
Most categorical variables have only two levels but **state** has
51 levels.

```{r STR, echo=FALSE}
str(churnTrain)
```

##Training##

With SVM the continuous input variables should always be centered and scaled but
this should be internally using the option **scale=TRUE** with tune.svm().
It is important that the same scaling factors, location and scale, used in 
the training data be used when predicting in the test data.

The most important tuning parameters to estimate initially are
the **gamma** and **cost** parameters.
The default kernel is the radial basis function but other kernels may be
tested later to see if an improvement can be found.

The function **tune.svm()** uses ten-fold CV and determines the minimum 
CV error for each combination of tuning parameters.
A good recommendation is to use a grid search over the values 11 settings
for gamma and cost shown below.

```{r SETTINGS}
(gamma = 2^(seq(-15,15,3)))
(cost = 2^(seq(-5,15,2)))
```

This grid search requires $11 \times 11 = 121$ separate cross-validations.
To save time I used smaller grids combined with an interactive search
for a bounding interval.
At the final stage, I used the search shown below.
which itself takes several minutes.
The results are summarized below and we see the best values
correspond to **gamma=0.05** and **cost=4**.

```{r TRAING2, echo=TRUE}
set.seed(777555333) #takes several minutes for 10-fold CV.
obj <- tune.svm(churn~., data = churnTrain, scale=TRUE, 
          gamma = c(0.005, 0.01, 0.05, 0.09), cost = c(1.0,2.0,4.0,6.0))
summary(obj)
```

The figure below shows a level plot which is type of 3D visualization.
It shows how CV error depends on the **gamma** and **cost**.
This type of plot is similar to a contour
plot but takes its inspiration from geographical maps which indicate
land elevation or ocean depth using shades of brown or blue.
The level plot helps to visualize the surface and may be useful refining the
search.
But it is important to remember the level plot shown is stochastic so it
exhibits substantial variation due to the randomness in 10-fold CV.

```{r PLOT, echo=TRUE, eval=TRUE}
plot(obj)
```

Selecting **gamma=0.05** and **cost=4** the SVM is fit
and the confusion matrix on the *test* sample is shown.

```{r FITSVM, echo=TRUE}
#fit SVM and predict on test data
ans <- svm(churn~., data=churnTrain, gamma=0.05, cost=4, scale=TRUE)
yH <- predict(ans, newdata=churnTest)
yTe <- churnTest$churn
eta <- mean(yTe!=yH)
(tb <- table(yTe, yH))
```

From the table we see that there are `r tb[1,2]+tb[2,1]` misclassifications
producing a misclassification rate `r round(eta, 4)` with
a 95% level MOE `r round(MOE <- 1.96*sqrt(eta*(1-eta)/length(yH)),4)`.

Next we compare with the classifiers C5.0 and Random Forest.
We find that SVM has slightly poorer performance that either
C5.0 or Random Forest and
as previously found Random Forest slightly outperforms C5.0.
It is possible that SVM may be improved by using a different
kernel.
In particular **polynomial** and **sigmoid** are also popular kernels.

```{r COMPARE, echo=FALSE}
#Compare with C5.0 and RF
ETA <- numeric(3)
CI <- matrix(numeric(0), nrow=3, ncol=2)
MOE <- 1.96*sqrt(eta*(1-eta)/length(yH))
CI[1,] <- round(100*(eta+c(-1,1)*MOE), 2)
ETA[1] <- eta
#fit C5.0
ans <- C5.0(churn~., data=churnTrain)
yH <- predict(ans, newdata=churnTest)
eta <- mean(churnTest$churn!=yH)
CI[2,] <- round(100*(eta+c(-1,1)*MOE), 2)
ETA[2] <- eta
#fit RF
set.seed(7788833) #make reproducible since RF using bootstrap
ans <- randomForest(churn~., data=churnTrain, ntree=1000)
yH <- predict(ans, newdata=churnTest)
yH <- predict(ans, newdata=churnTest)
eta <- mean(churnTest$churn!=yH)
CI[3,] <- round(100*(eta+c(-1,1)*MOE), 2)
ETA[3] <- eta
tb <- cbind(matrix(round(100*ETA,2), nrow=3), CI)
colnames(tb) <- c("Misclassifcation%", "95% CI lower",  "95% CI upper")
row.names(tb) <- c("SVM", "C5.0", "RF")
tb
```

##Conclusion##

We were able to obtain reasonable results with SVM for the churn dataset
and it is possible that with further tuning we may improve the predictions
for this test dataset.
The most frequent difficulty with SVM classifiers is the difficulty
of choosing adequate tuning parameters while avoiding the problem
of overtraining.

On the other hand methods such as C5.0 and RandomForest take
almost no effort to train and often generalize very well.

The caret package provides better software for tuning using
cross-validation. See [LINK](http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines)









<br><br>