This notebook shows how SVM may be trained and used to predict on the churn data that is provided in the C50 package. We will use the well-tested functions svm() and tune.svm() that are included in the e1071 package.
The **churn## data has been divided into training and test portions with respective samples sizes 3333 and 3333
A terse summary of the training data shows the variables are a heterogenous mix of numeric, integer and categorical (factor). Most categorical variables have only two levels but state has 51 levels.
## 'data.frame': 3333 obs. of 20 variables:
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
## $ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
## $ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
## $ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
## $ total_day_minutes : num 265 162 243 299 167 ...
## $ total_day_calls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ total_eve_minutes : num 197.4 195.5 121.2 61.9 148.3 ...
## $ total_eve_calls : int 99 103 110 88 122 101 108 94 80 111 ...
## $ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ total_night_minutes : num 245 254 163 197 187 ...
## $ total_night_calls : int 91 103 104 89 121 118 118 96 90 97 ...
## $ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ total_intl_minutes : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ total_intl_calls : int 3 3 5 7 3 6 7 6 4 5 ...
## $ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
With SVM the continuous input variables should always be centered and scaled but this should be internally using the option scale=TRUE with tune.svm(). It is important that the same scaling factors, location and scale, used in the training data be used when predicting in the test data.
The most important tuning parameters to estimate initially are the gamma and cost parameters. The default kernel is the radial basis function but other kernels may be tested later to see if an improvement can be found.
The function tune.svm() uses ten-fold CV and determines the minimum CV error for each combination of tuning parameters. A good recommendation is to use a grid search over the values 11 settings for gamma and cost shown below.
(gamma = 2^(seq(-15,15,3)))
## [1] 3.051758e-05 2.441406e-04 1.953125e-03 1.562500e-02 1.250000e-01
## [6] 1.000000e+00 8.000000e+00 6.400000e+01 5.120000e+02 4.096000e+03
## [11] 3.276800e+04
(cost = 2^(seq(-5,15,2)))
## [1] 3.1250e-02 1.2500e-01 5.0000e-01 2.0000e+00 8.0000e+00 3.2000e+01
## [7] 1.2800e+02 5.1200e+02 2.0480e+03 8.1920e+03 3.2768e+04
This grid search requires \(11 \times 11 = 121\) separate cross-validations. To save time I used smaller grids combined with an interactive search for a bounding interval. At the final stage, I used the search shown below. which itself takes several minutes. The results are summarized below and we see the best values correspond to gamma=0.05 and cost=4.
set.seed(777555333) #takes several minutes for 10-fold CV.
obj <- tune.svm(churn~., data = churnTrain, scale=TRUE,
gamma = c(0.005, 0.01, 0.05, 0.09), cost = c(1.0,2.0,4.0,6.0))
summary(obj)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.05 4
##
## - best performance: 0.08100466
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.005 1 0.14461348 0.02024087
## 2 0.010 1 0.12391254 0.01861701
## 3 0.050 1 0.08970527 0.01581927
## 4 0.090 1 0.08850857 0.01733489
## 5 0.005 2 0.13621136 0.01897863
## 6 0.010 2 0.10741220 0.01326667
## 7 0.050 2 0.08160526 0.01808895
## 8 0.090 2 0.08100735 0.01865069
## 9 0.005 4 0.11581342 0.01712226
## 10 0.010 4 0.09960709 0.01237262
## 11 0.050 4 0.08100466 0.01759456
## 12 0.090 4 0.08130676 0.01268190
## 13 0.005 6 0.10861071 0.01165150
## 14 0.010 6 0.09570499 0.01076463
## 15 0.050 6 0.08160616 0.01309471
## 16 0.090 6 0.08400946 0.01282314
The figure below shows a level plot which is type of 3D visualization. It shows how CV error depends on the gamma and cost. This type of plot is similar to a contour plot but takes its inspiration from geographical maps which indicate land elevation or ocean depth using shades of brown or blue. The level plot helps to visualize the surface and may be useful refining the search. But it is important to remember the level plot shown is stochastic so it exhibits substantial variation due to the randomness in 10-fold CV.
plot(obj)
Selecting gamma=0.05 and cost=4 the SVM is fit and the confusion matrix on the test sample is shown.
#fit SVM and predict on test data
ans <- svm(churn~., data=churnTrain, gamma=0.05, cost=4, scale=TRUE)
yH <- predict(ans, newdata=churnTest)
yTe <- churnTest$churn
eta <- mean(yTe!=yH)
(tb <- table(yTe, yH))
## yH
## yTe yes no
## yes 136 88
## no 24 1419
From the table we see that there are 112 misclassifications producing a misclassification rate 0.0672 with a 95% level MOE 0.012.
Next we compare with the classifiers C5.0 and Random Forest. We find that SVM has slightly poorer performance that either C5.0 or Random Forest and as previously found Random Forest slightly outperforms C5.0. It is possible that SVM may be improved by using a different kernel. In particular polynomial and sigmoid are also popular kernels.
## Misclassifcation% 95% CI lower 95% CI upper
## SVM 6.72 5.52 7.92
## C5.0 5.28 4.08 6.48
## RF 4.80 3.60 6.00
We were able to obtain reasonable results with SVM for the churn dataset and it is possible that with further tuning we may improve the predictions for this test dataset. However this tuning step can easily turn into data dredging https://en.wikipedia.org/wiki/Data_dredging resulting in an overfit model that does not generalize well. The most frequent difficulty with SVM classifiers is the difficulty of choosing adequate tuning parameters while avoiding the problem of overtraining.
On the other hand methods such as C5.0 and RandomForest take almost no effort to train and often generalize very well.