The Churn Rate is important part of business and more generally all organizations including humans. It refers to the rate at which people leave. See Wikipedia article https://en.wikipedia.org/wiki/Churn_rate.
This dataset is included in the R package and was obtained from MLC++ machine learning software for modeling customer churn http://www.sgi.com/tech/mlc/. The outcome is binary yes/no according whether or not the customer switched. There are 19 inputs, mostly numeric but some categorical. The R package documentation notes that the original source data included a note that this was similar to actual data was artifical but similar to actual data – presumbably due to confidentiality issues. The data has been divided into training and test samples with 3333/1667 in the train/test samples respectively.
The variables have been given self-descriptive names and are summarized below.
Note that the factor variable state has 51 levels. Unless special care is taken in the algorithm design this could lead to excessive computation since the number of possible splits is combinatorally very large. It is interesting that all the methods discussed handle this with ease!
## 'data.frame': 3333 obs. of 20 variables:
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
## $ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
## $ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
## $ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
## $ total_day_minutes : num 265 162 243 299 167 ...
## $ total_day_calls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ total_eve_minutes : num 197.4 195.5 121.2 61.9 148.3 ...
## $ total_eve_calls : int 99 103 110 88 122 101 108 94 80 111 ...
## $ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ total_night_minutes : num 245 254 163 197 187 ...
## $ total_night_calls : int 91 103 104 89 121 118 118 96 90 97 ...
## $ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ total_intl_minutes : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ total_intl_calls : int 3 3 5 7 3 6 7 6 4 5 ...
## $ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
The CART and C5.0 automatically deal with categorical variables.
Due to the large number of inputs, the tree is too complicated to display but the rules produced by CART are displayed below.
##
## Call:
## C5.0.formula(formula = churn ~ ., data = churnTrain)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Mar 22 10:30:02 2017
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 3333 cases (20 attributes) from undefined.data
##
## Decision tree:
##
## total_day_minutes > 264.4:
## :...voice_mail_plan = yes:
## : :...international_plan = no: no (45/1)
## : : international_plan = yes: yes (8/3)
## : voice_mail_plan = no:
## : :...total_eve_minutes > 187.7:
## : :...total_night_minutes > 126.9: yes (94/1)
## : : total_night_minutes <= 126.9:
## : : :...total_day_minutes <= 277: no (4)
## : : total_day_minutes > 277: yes (3)
## : total_eve_minutes <= 187.7:
## : :...total_eve_charge <= 12.26: no (15/1)
## : total_eve_charge > 12.26:
## : :...total_day_minutes <= 277:
## : :...total_night_minutes <= 224.8: no (13)
## : : total_night_minutes > 224.8: yes (5/1)
## : total_day_minutes > 277:
## : :...total_night_minutes > 151.9: yes (18)
## : total_night_minutes <= 151.9:
## : :...account_length <= 123: no (4)
## : account_length > 123: yes (2)
## total_day_minutes <= 264.4:
## :...number_customer_service_calls > 3:
## :...total_day_minutes <= 160.2:
## : :...total_eve_charge <= 19.83: yes (79/3)
## : : total_eve_charge > 19.83:
## : : :...total_day_minutes <= 120.5: yes (10)
## : : total_day_minutes > 120.5: no (13/3)
## : total_day_minutes > 160.2:
## : :...total_eve_charge > 12.05: no (130/24)
## : total_eve_charge <= 12.05:
## : :...total_eve_calls <= 125: yes (16/2)
## : total_eve_calls > 125: no (3)
## number_customer_service_calls <= 3:
## :...international_plan = yes:
## :...total_intl_calls <= 2: yes (51)
## : total_intl_calls > 2:
## : :...total_intl_minutes <= 13.1: no (173/7)
## : total_intl_minutes > 13.1: yes (43)
## international_plan = no:
## :...total_day_minutes <= 223.2: no (2221/60)
## total_day_minutes > 223.2:
## :...total_eve_charge <= 20.5: no (295/22)
## total_eve_charge > 20.5:
## :...voice_mail_plan = yes: no (20)
## voice_mail_plan = no:
## :...total_night_minutes > 174.2: yes (50/8)
## total_night_minutes <= 174.2:
## :...total_day_minutes <= 246.6: no (12)
## total_day_minutes > 246.6:
## :...total_day_charge <= 43.33: yes (4)
## total_day_charge > 43.33: no (2)
##
##
## Evaluation on training data (3333 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 27 136( 4.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 365 118 (a): class yes
## 18 2832 (b): class no
##
##
## Attribute usage:
##
## 100.00% total_day_minutes
## 93.67% number_customer_service_calls
## 87.73% international_plan
## 20.73% total_eve_charge
## 8.97% voice_mail_plan
## 8.01% total_intl_calls
## 6.48% total_intl_minutes
## 6.33% total_night_minutes
## 4.74% total_eve_minutes
## 0.57% total_eve_calls
## 0.18% account_length
## 0.18% total_day_charge
##
##
## Time: 0.1 secs
The mis-classification rate is 5.3% with a 95% confidence interval (4.2, 6.4)%.
For comparison the mis-classification rate on the training data is 4.1%.
CART is fit using the rpart package and a skeleton version of the tree is produced. The rules are different from C5.0 but just as complicated so they are not displayed.
The mis-classification rate is 7% with a 95% confidence interval (5.8, 8.2)%. So the mis-classification rate is slightly higher than with C5.0.
For comparison the mis-classification rate on the training data is 4.9%.
Naive Bayes generalizes the approach we saw in Diagonal Linear Discrimination Analysis (DLDA). It is assumed that all input variables are statistically independent. Usually as in DLDA, continuous variables are assumed to be normally distributed. Categorical variables are assumed to be multinomially distributed. It is sometimes considered one of the Top 10 Algorithms in Data Mining.
As mentioned in the Wikipedia article https://en.wikipedia.org/wiki/Naive_Bayes_classifier Naive Bayes often outperforms Support Vector Machines and other more elaborate methods.
The misclassification rate is 11% with a 95% confidence interval (9.5, 12.5)%.
For comparison the misclassification rate on the training data is 11%.
One of the most useful features of Random Forests is the algorithm to evaluate the importance of each of the input variables to the prediction. This takes extra computation so the default setting in randomForest() is FALSE. When plotting the importance scores, I recommend using dotchart() since this handles long variable names better than barchart(). You may also wish to use abbreviate() to shorten some names. It is also a good idea to use a variation of the Pareto chart, simply show the display in descending order of importance. This greatly enhances the visualization.
The importance plot below shows that the number of customer service calls is the most important variable in predicting churn.
The misclassification rate on the test data is 5% with a 95% confidence interval (3.9, 6)%.
For comparison the misclassification rate on the training data is 11%.
The misclassification rate on the test data is 6.4%.
We could also consider using group Lasso logistic regression that is available in the package grpreg. In this case we need to reparametrize by introducing dummy variables. We see that very strong multicollinearity exists with 8 variables. The VIF is shown in the table below.
## [1] "total_day_minutes" "total_day_charge" "total_eve_minutes"
## [4] "total_eve_charge" "total_night_minutes" "total_night_charge"
## [7] "total_intl_minutes" "total_intl_charge"
The misclassifcation rates, in percent, are summarized in the table below.
## C50 CART RF NB GBM
## 5.3 7.0 5.0 11.0 6.4
Note that because cross-validation is used, the misclassifcation rates may change slightly when this notebook is recompiled. But usually the rankings remain about the same with RF first, followed closely by C5.0. Then GBM and CART with NB always last.
This notebook takes about 13 seconds to compile, so all the algorithms have very good speed performance on this moderately sized dataset.