Customer Churn Prediction

The Churn Rate is important part of business and more generally all organizations including humans. It refers to the rate at which people leave. See Wikipedia article https://en.wikipedia.org/wiki/Churn_rate.

This dataset is included in the R package and was obtained from MLC++ machine learning software for modeling customer churn http://www.sgi.com/tech/mlc/. The outcome is binary yes/no according whether or not the customer switched. There are 19 inputs, mostly numeric but some categorical. The R package documentation notes that the original source data included a note that this was similar to actual data was artifical but similar to actual data – presumbably due to confidentiality issues. The data has been divided into training and test samples with 3333/1667 in the train/test samples respectively.

The variables have been given self-descriptive names and are summarized below.

Note that the factor variable state has 51 levels. Unless special care is taken in the algorithm design this could lead to excessive computation since the number of possible splits is combinatorally very large. It is interesting that all the methods discussed handle this with ease!

## 'data.frame':    3333 obs. of  20 variables:
##  $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ account_length               : int  128 107 137 84 75 118 121 147 117 141 ...
##  $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
##  $ international_plan           : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
##  $ voice_mail_plan              : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
##  $ number_vmail_messages        : int  25 26 0 0 0 0 24 0 0 37 ...
##  $ total_day_minutes            : num  265 162 243 299 167 ...
##  $ total_day_calls              : int  110 123 114 71 113 98 88 79 97 84 ...
##  $ total_day_charge             : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ total_eve_minutes            : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ total_eve_calls              : int  99 103 110 88 122 101 108 94 80 111 ...
##  $ total_eve_charge             : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ total_night_minutes          : num  245 254 163 197 187 ...
##  $ total_night_calls            : int  91 103 104 89 121 118 118 96 90 97 ...
##  $ total_night_charge           : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ total_intl_minutes           : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ total_intl_calls             : int  3 3 5 7 3 6 7 6 4 5 ...
##  $ total_intl_charge            : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ number_customer_service_calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn                        : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

The CART and C5.0 automatically deal with categorical variables.

C5.0 Fit

Due to the large number of inputs, the tree is too complicated to display but the rules produced by CART are displayed below.

## 
## Call:
## C5.0.formula(formula = churn ~ ., data = churnTrain)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Mar 22 10:30:02 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3333 cases (20 attributes) from undefined.data
## 
## Decision tree:
## 
## total_day_minutes > 264.4:
## :...voice_mail_plan = yes:
## :   :...international_plan = no: no (45/1)
## :   :   international_plan = yes: yes (8/3)
## :   voice_mail_plan = no:
## :   :...total_eve_minutes > 187.7:
## :       :...total_night_minutes > 126.9: yes (94/1)
## :       :   total_night_minutes <= 126.9:
## :       :   :...total_day_minutes <= 277: no (4)
## :       :       total_day_minutes > 277: yes (3)
## :       total_eve_minutes <= 187.7:
## :       :...total_eve_charge <= 12.26: no (15/1)
## :           total_eve_charge > 12.26:
## :           :...total_day_minutes <= 277:
## :               :...total_night_minutes <= 224.8: no (13)
## :               :   total_night_minutes > 224.8: yes (5/1)
## :               total_day_minutes > 277:
## :               :...total_night_minutes > 151.9: yes (18)
## :                   total_night_minutes <= 151.9:
## :                   :...account_length <= 123: no (4)
## :                       account_length > 123: yes (2)
## total_day_minutes <= 264.4:
## :...number_customer_service_calls > 3:
##     :...total_day_minutes <= 160.2:
##     :   :...total_eve_charge <= 19.83: yes (79/3)
##     :   :   total_eve_charge > 19.83:
##     :   :   :...total_day_minutes <= 120.5: yes (10)
##     :   :       total_day_minutes > 120.5: no (13/3)
##     :   total_day_minutes > 160.2:
##     :   :...total_eve_charge > 12.05: no (130/24)
##     :       total_eve_charge <= 12.05:
##     :       :...total_eve_calls <= 125: yes (16/2)
##     :           total_eve_calls > 125: no (3)
##     number_customer_service_calls <= 3:
##     :...international_plan = yes:
##         :...total_intl_calls <= 2: yes (51)
##         :   total_intl_calls > 2:
##         :   :...total_intl_minutes <= 13.1: no (173/7)
##         :       total_intl_minutes > 13.1: yes (43)
##         international_plan = no:
##         :...total_day_minutes <= 223.2: no (2221/60)
##             total_day_minutes > 223.2:
##             :...total_eve_charge <= 20.5: no (295/22)
##                 total_eve_charge > 20.5:
##                 :...voice_mail_plan = yes: no (20)
##                     voice_mail_plan = no:
##                     :...total_night_minutes > 174.2: yes (50/8)
##                         total_night_minutes <= 174.2:
##                         :...total_day_minutes <= 246.6: no (12)
##                             total_day_minutes > 246.6:
##                             :...total_day_charge <= 43.33: yes (4)
##                                 total_day_charge > 43.33: no (2)
## 
## 
## Evaluation on training data (3333 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      27  136( 4.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     365   118    (a): class yes
##      18  2832    (b): class no
## 
## 
##  Attribute usage:
## 
##  100.00% total_day_minutes
##   93.67% number_customer_service_calls
##   87.73% international_plan
##   20.73% total_eve_charge
##    8.97% voice_mail_plan
##    8.01% total_intl_calls
##    6.48% total_intl_minutes
##    6.33% total_night_minutes
##    4.74% total_eve_minutes
##    0.57% total_eve_calls
##    0.18% account_length
##    0.18% total_day_charge
## 
## 
## Time: 0.1 secs

The mis-classification rate is 5.3% with a 95% confidence interval (4.2, 6.4)%.

For comparison the mis-classification rate on the training data is 4.1%.

CART Fit

CART is fit using the rpart package and a skeleton version of the tree is produced. The rules are different from C5.0 but just as complicated so they are not displayed.

The mis-classification rate is 7% with a 95% confidence interval (5.8, 8.2)%. So the mis-classification rate is slightly higher than with C5.0.

For comparison the mis-classification rate on the training data is 4.9%.

Naive Bayes

Naive Bayes generalizes the approach we saw in Diagonal Linear Discrimination Analysis (DLDA). It is assumed that all input variables are statistically independent. Usually as in DLDA, continuous variables are assumed to be normally distributed. Categorical variables are assumed to be multinomially distributed. It is sometimes considered one of the Top 10 Algorithms in Data Mining.

As mentioned in the Wikipedia article https://en.wikipedia.org/wiki/Naive_Bayes_classifier Naive Bayes often outperforms Support Vector Machines and other more elaborate methods.

The misclassification rate is 11% with a 95% confidence interval (9.5, 12.5)%.

For comparison the misclassification rate on the training data is 11%.

Random Forests

One of the most useful features of Random Forests is the algorithm to evaluate the importance of each of the input variables to the prediction. This takes extra computation so the default setting in randomForest() is FALSE. When plotting the importance scores, I recommend using dotchart() since this handles long variable names better than barchart(). You may also wish to use abbreviate() to shorten some names. It is also a good idea to use a variation of the Pareto chart, simply show the display in descending order of importance. This greatly enhances the visualization.

The importance plot below shows that the number of customer service calls is the most important variable in predicting churn.

The misclassification rate on the test data is 5% with a 95% confidence interval (3.9, 6)%.

For comparison the misclassification rate on the training data is 11%.

Gradient Boosting Machines

The misclassification rate on the test data is 6.4%.

Other

We could also consider using group Lasso logistic regression that is available in the package grpreg. In this case we need to reparametrize by introducing dummy variables. We see that very strong multicollinearity exists with 8 variables. The VIF is shown in the table below.

## [1] "total_day_minutes"   "total_day_charge"    "total_eve_minutes"  
## [4] "total_eve_charge"    "total_night_minutes" "total_night_charge" 
## [7] "total_intl_minutes"  "total_intl_charge"

Summary

The misclassifcation rates, in percent, are summarized in the table below.

##  C50 CART   RF   NB  GBM 
##  5.3  7.0  5.0 11.0  6.4

Note that because cross-validation is used, the misclassifcation rates may change slightly when this notebook is recompiled. But usually the rankings remain about the same with RF first, followed closely by C5.0. Then GBM and CART with NB always last.

This notebook takes about 13 seconds to compile, so all the algorithms have very good speed performance on this moderately sized dataset.