--- title: "Churn Prediction with SVM" author: "Ian McLeod" date: "April 6, 2018" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(e1071) library(C50) library(randomForest) data("churn", package="C50") ``` This notebook shows how SVM may be trained and used to predict on the **churn** data that is provided in the **C50** package. We will use the well-tested functions **svm()** and **tune.svm()** that are included in the **e1071** package. The churn data has been divided into training and test portions with respective samples sizes `r nrow(churnTrain)` and `r nrow(churnTrain)` A terse summary of the training data shows the variables are a heterogenous mix of numeric, integer and categorical (factor). Most categorical variables have only two levels but **state** has 51 levels. ```{r STR, echo=FALSE} str(churnTrain) ``` ##Training## With SVM the continuous input variables should always be centered and scaled but this should be internally using the option **scale=TRUE** with tune.svm(). It is important that the same scaling factors, location and scale, used in the training data be used when predicting in the test data. The most important tuning parameters to estimate initially are the **gamma** and **cost** parameters. The default kernel is the radial basis function but other kernels may be tested later to see if an improvement can be found. The function **tune.svm()** uses ten-fold CV and determines the minimum CV error for each combination of tuning parameters. A good recommendation is to use a grid search over the values 11 settings for gamma and cost shown below. ```{r SETTINGS} (gamma = 2^(seq(-15,15,3))) (cost = 2^(seq(-5,15,2))) ``` This grid search requires $11 \times 11 = 121$ separate cross-validations. To save time I used smaller grids combined with an interactive search for a bounding interval. At the final stage, I used the search shown below. which itself takes several minutes. The results are summarized below and we see the best values correspond to **gamma=0.05** and **cost=4**. ```{r TRAING2, echo=TRUE} set.seed(777555333) #takes several minutes for 10-fold CV. obj <- tune.svm(churn~., data = churnTrain, scale=TRUE, gamma = c(0.005, 0.01, 0.05, 0.09), cost = c(1.0,2.0,4.0,6.0)) summary(obj) ``` The figure below shows a level plot which is type of 3D visualization. It shows how CV error depends on the **gamma** and **cost**. This type of plot is similar to a contour plot but takes its inspiration from geographical maps which indicate land elevation or ocean depth using shades of brown or blue. The level plot helps to visualize the surface and may be useful refining the search. But it is important to remember the level plot shown is stochastic so it exhibits substantial variation due to the randomness in 10-fold CV. ```{r PLOT, echo=TRUE, eval=TRUE} plot(obj) ``` Selecting **gamma=0.05** and **cost=4** the SVM is fit and the confusion matrix on the *test* sample is shown. ```{r FITSVM, echo=TRUE} #fit SVM and predict on test data ans <- svm(churn~., data=churnTrain, gamma=0.05, cost=4, scale=TRUE) yH <- predict(ans, newdata=churnTest) yTe <- churnTest$churn eta <- mean(yTe!=yH) (tb <- table(yTe, yH)) ``` From the table we see that there are `r tb[1,2]+tb[2,1]` misclassifications producing a misclassification rate `r round(eta, 4)` with a 95% level MOE `r round(MOE <- 1.96*sqrt(eta*(1-eta)/length(yH)),4)`. Next we compare with the classifiers C5.0 and Random Forest. We find that SVM has slightly poorer performance that either C5.0 or Random Forest and as previously found Random Forest slightly outperforms C5.0. It is possible that SVM may be improved by using a different kernel. In particular **polynomial** and **sigmoid** are also popular kernels. ```{r COMPARE, echo=FALSE} #Compare with C5.0 and RF ETA <- numeric(3) CI <- matrix(numeric(0), nrow=3, ncol=2) MOE <- 1.96*sqrt(eta*(1-eta)/length(yH)) CI[1,] <- round(100*(eta+c(-1,1)*MOE), 2) ETA[1] <- eta #fit C5.0 ans <- C5.0(churn~., data=churnTrain) yH <- predict(ans, newdata=churnTest) eta <- mean(churnTest$churn!=yH) CI[2,] <- round(100*(eta+c(-1,1)*MOE), 2) ETA[2] <- eta #fit RF set.seed(7788833) #make reproducible since RF using bootstrap ans <- randomForest(churn~., data=churnTrain, ntree=1000) yH <- predict(ans, newdata=churnTest) yH <- predict(ans, newdata=churnTest) eta <- mean(churnTest$churn!=yH) CI[3,] <- round(100*(eta+c(-1,1)*MOE), 2) ETA[3] <- eta tb <- cbind(matrix(round(100*ETA,2), nrow=3), CI) colnames(tb) <- c("Misclassifcation%", "95% CI lower", "95% CI upper") row.names(tb) <- c("SVM", "C5.0", "RF") tb ``` ##Conclusion## We were able to obtain reasonable results with SVM for the churn dataset and it is possible that with further tuning we may improve the predictions for this test dataset. The most frequent difficulty with SVM classifiers is the difficulty of choosing adequate tuning parameters while avoiding the problem of overtraining. On the other hand methods such as C5.0 and RandomForest take almost no effort to train and often generalize very well. The caret package provides better software for tuning using cross-validation. See [LINK](http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines)