The purpose of this notebook is to demonstrate how to use R for exploratory analysis and discrimant analysis using LDA and logistic regression.
There is a lot of computer code and output. Students should not use this notebook as a example or paradigm for writing reports!
Default is a simulated data set containing information on ten thousand customers which is available in the R package ISLR The purpose is to predict which customers default on their credit cards.
We load the data and view the terse summary.
data(Default, package="ISLR") #from ISRL package from CRAN
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
The explanatory variables are student, balance and income. The factor student indicates if the customer is a student or not. The variables balance and income refer respectively to the balance on the credit card and the customer’s income.
Lattice provies a coordinated multi-panel display of income vs. balance for the two default status cases. From the plot we see: default is has low probability overall students have lower incomes - I would not expect such a sharp division. *there is a tendency for student balances to be larger both in those that default and also those that don’t.
studentQ <- as.integer(Default$student)-1
xyplot(income~balance | default,
data=Default, panel=function(x, y, subscripts) {
ind <- as.logical(studentQ[subscripts])
panel.xyplot(x[!ind], y[!ind], col=rgb(0,0,1, 0.5))
panel.xyplot(x[ind], y[ind], col=rgb(1,0,0,0.8))
panel.grid(h=-1, v=-1, col=rgb(0,0,0,0.6), lty=1)
},
main="Credit card default (Yes/No)", sub="red:students")
Mosaic plots are useful with categorical variables and for visualization crosstabs.
The mosaic display is comprised of rectangles that form a partition of a larger rectangle that represents the entire dataset. The subrectangles are created recursively so that the area of each subrectangle is proportion to the observed count or frequency in the original contingency table. The left edge starts with the first variable and we divide the data into blocks. The second variable subdivides these blocks starting at the top edge. We proceed with recursively dividing the remaining variables into blocks. The choice of color and spacing between rectangles can aid in the perception of patterns and insight.
References: Wikipedia https://en.wikipedia.org/wiki/Mosaic_plot. The Mondrian package http://www.theusrus.de/Mondrian/Mondrian.html and http://www.interactivegraphics.org/Home.html provides a dynamic visualization approach using mosaic plots. There are three interesting video examples using Mondrian here http://www.theusrus.de/blog/making-movies/ I recommend the first example with the famous Titanic dataset. The paper by Friendly (2002) traces the history of mosaics from Islamic art, Halley’s graphical proof for the multiplication rule for independent events in probabilty to present. Friendly, M. (2002). A Brief History of the Mosaic Display. Journal of Computational and Graphical Statistics, vol. 11, no. 1, pp. 89-107. http://www.jstor.org.proxy1.lib.uwo.ca/stable/1391129.
The first step is to obtain the crosstab for the factor variables student and default. From this table we see that most of the cases are non-students (6850+206) but proportion to their numbers students have a higher probability of default. We also see left blue rectangle is slightly smaller in area than the shorter but slightly wider blue rectangle on the right confirming that more non-students defaulted.
student <- ordered(Default$student, labels=c("notStudent","Student"))
default <- Default$default
tb <- xtabs(~student+default)
tb
## default
## student No Yes
## notStudent 6850 206
## Student 2817 127
The mosaic plot provides a visualization of the crosstabs. In simple cases like this the mosaic plot is not really needed but it is extremely useful in complex cases involving more factors and when there are interactions between the factors.
mosaicplot(tb, color=c( "red", "blue"),
main="Default and Student/Non-student Status")
Income is a quantitative variable but we can use the R functions cut() and quantile() to make it into an ordered categorical variable. Specifically, we divide the income into four quartiles.
Remark: Making a quantitative variable into a categorical variable is often done with large datasets since it provides a better and simpler way of fitting models to the data. It often provides a flexible approach to understanding and dealing effectively with non-linearity.
#more complex mosaic plot
income <- with(Default,
cut(income,
breaks=quantile(income, probs=c(0, 0.25, 0.5, 0.75, 1 ), names=FALSE),
labels=c("Q1", "Q2", "Q3", "Q4"),
include.lowest=TRUE, ordered_result=TRUE))
tb <- xtabs(~default+income)
tb
## income
## default Q1 Q2 Q3 Q4
## No 2394 2428 2424 2421
## Yes 106 72 76 79
From the plot we see that income plays no significant role in defaults by non-students but a slight role with students. Partly the reason for this is that there are no students in the two upper income quartiles - as shown in the crosstabs following the plot.
mosaicplot(tb, color=c( "red", "blue"),
main="Default Status and Income Class")
xtabs(~student+income)
## income
## student Q1 Q2 Q3 Q4
## notStudent 222 1834 2500 2500
## Student 2278 666 0 0
The variable balance is made into a categorical variables corresponding to levels low, medium and high
balance = with(Default,
cut(balance,
breaks=quantile(balance, probs=c(0, 1/3, 2/3, 1), names=FALSE),
labels=c("low", "medium", "high"),
include.lowest=TRUE, ordered_result=TRUE))
tb <- xtabs(~default+balance)
tb
## balance
## default low medium high
## No 3334 3324 3009
## Yes 0 9 324
From the simple crosstabs of balance and default we see that balance has a huge effect: almost all defaults occur when the balance is in the high category and there are no defaults for balances in the low category.
The mosaic plot for this simple contingency table reflects these facts. In particular, viewing just the right-hand part of the plot we see that lower red rectangle reflects the large number of defaults that occur with high credit card balances. The tiny blue rectange in the upper right corresponds to the small number of defaults with medium balances. Since no defaults occurred with low balances, this is represented by a dashed line.
mosaicplot(tb, color=c( "red", "blue"),
main="Default Status and Card Balance")
We conclude with the crosstabs for all four factors and several possible mosaicplots. Since there are no students in income classes Q3 and Q4, we use the new R function droplevels() to remove them. The printout that follows shows the data is very difficult to comprehend when displayed like this.
DF <- data.frame(student=student, income=income, balance=balance,
default=default)
DFsubset <- subset(DF, DF$income!="Q3"& DF$income!="Q4")
tb <- xtabs(~student+default+droplevels(income)+balance, data=DFsubset)
tb
## , , droplevels(income) = Q1, balance = low
##
## default
## student No Yes
## notStudent 86 0
## Student 496 0
##
## , , droplevels(income) = Q2, balance = low
##
## default
## student No Yes
## notStudent 714 0
## Student 158 0
##
## , , droplevels(income) = Q1, balance = medium
##
## default
## student No Yes
## notStudent 80 0
## Student 758 1
##
## , , droplevels(income) = Q2, balance = medium
##
## default
## student No Yes
## notStudent 609 3
## Student 216 0
##
## , , droplevels(income) = Q1, balance = high
##
## default
## student No Yes
## notStudent 51 5
## Student 923 100
##
## , , droplevels(income) = Q2, balance = high
##
## default
## student No Yes
## notStudent 465 43
## Student 266 26
The crosstabs is represented by a higher order array of dimensions
dim(tb)
## [1] 2 2 2 3
The variables associated with each dimension are
dimnames(tb)
## $student
## [1] "notStudent" "Student"
##
## $default
## [1] "No" "Yes"
##
## $`droplevels(income)`
## [1] "Q1" "Q2"
##
## $balance
## [1] "low" "medium" "high"
Focusing just on the customers who default, these is represented by the bottom row. The huge bulk is red indicating that virtually all defaults occur with high balance accounts. Not many students are in income class Q2 whereas this is reversed in the non-students.
mosaicplot(tb, color=c( "red", "blue"), main="")
ansLDA <- MASS::lda(default~. , data=Default)
yH <- predict(ansLDA, newdata=Default)$class
y <- Default$default
eta <- 1-mean(yH==y)
MOE <- 1.96*sqrt(eta*(1-eta)/length(yH))
CI <- round(eta+c(-MOE, MOE),3)
CI <- paste0("(", CI[1], ", ", CI[2],")")
table(y, yH)
## yH
## y No Yes
## No 9645 22
## Yes 254 79
The misclassification rate is 0.0276 with 95% confidence interval rate (0.024, 0.031).
ansLogit <- glm(default~., data=Default, family="binomial")
yHprob <- predict(ansLogit, newdata=Default, type="response")
yH <- factor(ifelse(yHprob<0.5, "No", "Yes"))
eta <- 1-mean(yH==Default$default) #mis-classification rate
MOE <- 1.96*sqrt(eta*(1-eta)/length(yH))
CI <- round(eta+c(-MOE, MOE),3)
CI <- paste0("(", CI[1], ", ", CI[2],")")
table(Default$default, yH) #confusion table
## yH
## No Yes
## No 9627 40
## Yes 228 105
The observed misclassification rate is 0.0268 and a 95% confidence interval for the hypothetical true misclassifcation rate (0.024, 0.03).
Both LDA and logistic DA give essentially equivalent predictions but the logistic method also provides a model that describes the importance of the variables.
summary(ansLogit)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
The fitted equation may be written,
\[{\rm logit(Pr\{default\})} = -10.87- 0.6468\times {\rm Student} + 0.005737\times {\rm Balance} + 0.00000303 \times {\rm Income}.\]
From R help on glm, “For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures”. In the present case this means that “success” is the 2nd level of the factor default which is coded as “Yes”.