Churn Prediction with SVM

This notebook shows how SVM may be trained and used to predict on the churn data that is provided in the C50 package. We will use the well-tested functions svm() and tune.svm() that are included in the e1071 package.

The **churn## data has been divided into training and test portions with respective samples sizes 3333 and 3333

A terse summary of the training data shows the variables are a heterogenous mix of numeric, integer and categorical (factor). Most categorical variables have only two levels but state has 51 levels.

## 'data.frame':    3333 obs. of  20 variables:
##  $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ account_length               : int  128 107 137 84 75 118 121 147 117 141 ...
##  $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
##  $ international_plan           : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
##  $ voice_mail_plan              : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
##  $ number_vmail_messages        : int  25 26 0 0 0 0 24 0 0 37 ...
##  $ total_day_minutes            : num  265 162 243 299 167 ...
##  $ total_day_calls              : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ total_day_charge             : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ total_eve_minutes            : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ total_eve_calls              : int  99 103 110 88 122 101 108 94 80 111 ...
##  $ total_eve_charge             : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ total_night_minutes          : num  245 254 163 197 187 ...
##  $ total_night_calls            : int  91 103 104 89 121 118 118 96 90 97 ...
##  $ total_night_charge           : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ total_intl_minutes           : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ total_intl_calls             : int  3 3 5 7 3 6 7 6 4 5 ...
##  $ total_intl_charge            : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ number_customer_service_calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn                        : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

Training

With SVM the continuous input variables should always be centered and scaled but this should be internally using the option scale=TRUE with tune.svm(). It is important that the same scaling factors, location and scale, used in the training data be used when predicting in the test data.

The most important tuning parameters to estimate initially are the gamma and cost parameters. The default kernel is the radial basis function but other kernels may be tested later to see if an improvement can be found.

The function tune.svm() uses ten-fold CV and determines the minimum CV error for each combination of tuning parameters. A good recommendation is to use a grid search over the values 11 settings for gamma and cost shown below.

(gamma = 2^(seq(-15,15,3)))

##  [1] 3.051758e-05 2.441406e-04 1.953125e-03 1.562500e-02 1.250000e-01
##  [6] 1.000000e+00 8.000000e+00 6.400000e+01 5.120000e+02 4.096000e+03
## [11] 3.276800e+04

(cost = 2^(seq(-5,15,2)))

##  [1] 3.1250e-02 1.2500e-01 5.0000e-01 2.0000e+00 8.0000e+00 3.2000e+01
##  [7] 1.2800e+02 5.1200e+02 2.0480e+03 8.1920e+03 3.2768e+04

This grid search requires \(11 \times 11 = 121\) separate cross-validations. To save time I used smaller grids combined with an interactive search for a bounding interval. At the final stage, I used the search shown below. which itself takes several minutes. The results are summarized below and we see the best values correspond to gamma=0.05 and cost=4.

set.seed(777555333) #takes several minutes for 10-fold CV.
obj <- tune.svm(churn~., data = churnTrain, scale=TRUE, 
          gamma = c(0.005, 0.01, 0.05, 0.09), cost = c(1.0,2.0,4.0,6.0))
summary(obj)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##   0.05    4
## 
## - best performance: 0.08100466 
## 
## - Detailed performance results:
##    gamma cost      error dispersion
## 1  0.005    1 0.14461348 0.02024087
## 2  0.010    1 0.12391254 0.01861701
## 3  0.050    1 0.08970527 0.01581927
## 4  0.090    1 0.08850857 0.01733489
## 5  0.005    2 0.13621136 0.01897863
## 6  0.010    2 0.10741220 0.01326667
## 7  0.050    2 0.08160526 0.01808895
## 8  0.090    2 0.08100735 0.01865069
## 9  0.005    4 0.11581342 0.01712226
## 10 0.010    4 0.09960709 0.01237262
## 11 0.050    4 0.08100466 0.01759456
## 12 0.090    4 0.08130676 0.01268190
## 13 0.005    6 0.10861071 0.01165150
## 14 0.010    6 0.09570499 0.01076463
## 15 0.050    6 0.08160616 0.01309471
## 16 0.090    6 0.08400946 0.01282314

The figure below shows a level plot which is type of 3D visualization. It shows how CV error depends on the gamma and cost. This type of plot is similar to a contour plot but takes its inspiration from geographical maps which indicate land elevation or ocean depth using shades of brown or blue. The level plot helps to visualize the surface and may be useful refining the search. But it is important to remember the level plot shown is stochastic so it exhibits substantial variation due to the randomness in 10-fold CV.

plot(obj)

Selecting gamma=0.05 and cost=4 the SVM is fit and the confusion matrix on the test sample is shown.

#fit SVM and predict on test data
ans <- svm(churn~., data=churnTrain, gamma=0.05, cost=4, scale=TRUE)
yH <- predict(ans, newdata=churnTest)
yTe <- churnTest$churn
eta <- mean(yTe!=yH)
(tb <- table(yTe, yH))

##      yH
## yTe    yes   no
##   yes  136   88
##   no    24 1419

From the table we see that there are 112 misclassifications producing a misclassification rate 0.0672 with a 95% level MOE 0.012.

Next we compare with the classifiers C5.0 and Random Forest. We find that SVM has slightly poorer performance that either C5.0 or Random Forest and as previously found Random Forest slightly outperforms C5.0. It is possible that SVM may be improved by using a different kernel. In particular polynomial and sigmoid are also popular kernels.

##      Misclassifcation% 95% CI lower 95% CI upper
## SVM               6.72         5.52         7.92
## C5.0              5.28         4.08         6.48
## RF                4.80         3.60         6.00

Conclusion

We were able to obtain reasonable results with SVM for the churn dataset and it is possible that with further tuning we may improve the predictions for this test dataset. However this tuning step can easily turn into data dredging https://en.wikipedia.org/wiki/Data_dredging resulting in an overfit model that does not generalize well. The most frequent difficulty with SVM classifiers is the difficulty of choosing adequate tuning parameters while avoiding the problem of overtraining.

On the other hand methods such as C5.0 and RandomForest take almost no effort to train and often generalize very well.

Churn Prediction with SVM

Ian McLeod

April 4, 2017

Training

Conclusion