The purpose of this notebook is to demonstrate how to use R for exploratory analysis and discrimant analysis using LDA and logistic regression.

There is a lot of computer code and output. Students should not use this notebook as a example or paradigm for writing reports!

Exploratory Analysis and Visualization

Brief Data Description

Default is a simulated data set containing information on ten thousand customers which is available in the R package ISLR The purpose is to predict which customers default on their credit cards.

We load the data and view the terse summary.

data(Default, package="ISLR") #from ISRL package from CRAN
str(Default)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

The explanatory variables are student, balance and income. The factor student indicates if the customer is a student or not. The variables balance and income refer respectively to the balance on the credit card and the customer’s income.

Conditional Lattice Plot

Lattice provies a coordinated multi-panel display of income vs. balance for the two default status cases. From the plot we see: default is has low probability overall students have lower incomes - I would not expect such a sharp division. *there is a tendency for student balances to be larger both in those that default and also those that don’t.

studentQ <- as.integer(Default$student)-1
xyplot(income~balance | default, 
       data=Default, panel=function(x, y, subscripts) {
        ind <- as.logical(studentQ[subscripts])
        panel.xyplot(x[!ind], y[!ind], col=rgb(0,0,1, 0.5))
        panel.xyplot(x[ind], y[ind], col=rgb(1,0,0,0.8))
        panel.grid(h=-1, v=-1, col=rgb(0,0,0,0.6), lty=1)
       },
       main="Credit card default (Yes/No)", sub="red:students")

Mosaic plot and crosstabs

Introduction

Mosaic plots are useful with categorical variables and for visualization crosstabs.

The mosaic display is comprised of rectangles that form a partition of a larger rectangle that represents the entire dataset. The subrectangles are created recursively so that the area of each subrectangle is proportion to the observed count or frequency in the original contingency table. The left edge starts with the first variable and we divide the data into blocks. The second variable subdivides these blocks starting at the top edge. We proceed with recursively dividing the remaining variables into blocks. The choice of color and spacing between rectangles can aid in the perception of patterns and insight.

References: Wikipedia https://en.wikipedia.org/wiki/Mosaic_plot. The Mondrian package http://www.theusrus.de/Mondrian/Mondrian.html and http://www.interactivegraphics.org/Home.html provides a dynamic visualization approach using mosaic plots. There are three interesting video examples using Mondrian here http://www.theusrus.de/blog/making-movies/ I recommend the first example with the famous Titanic dataset. The paper by Friendly (2002) traces the history of mosaics from Islamic art, Halley’s graphical proof for the multiplication rule for independent events in probabilty to present. Friendly, M. (2002). A Brief History of the Mosaic Display. Journal of Computational and Graphical Statistics, vol. 11, no. 1, pp. 89-107. http://www.jstor.org.proxy1.lib.uwo.ca/stable/1391129.

Simple Example

The first step is to obtain the crosstab for the factor variables student and default. From this table we see that most of the cases are non-students (6850+206) but proportion to their numbers students have a higher probability of default. We also see left blue rectangle is slightly smaller in area than the shorter but slightly wider blue rectangle on the right confirming that more non-students defaulted.

student <- ordered(Default$student, labels=c("notStudent","Student"))
default <- Default$default
tb <- xtabs(~student+default)
tb
##             default
## student        No  Yes
##   notStudent 6850  206
##   Student    2817  127

The mosaic plot provides a visualization of the crosstabs. In simple cases like this the mosaic plot is not really needed but it is extremely useful in complex cases involving more factors and when there are interactions between the factors.

mosaicplot(tb, color=c( "red", "blue"), 
           main="Default and Student/Non-student Status")

Mosaic plot with Income

Income is a quantitative variable but we can use the R functions cut() and quantile() to make it into an ordered categorical variable. Specifically, we divide the income into four quartiles.

Remark: Making a quantitative variable into a categorical variable is often done with large datasets since it provides a better and simpler way of fitting models to the data. It often provides a flexible approach to understanding and dealing effectively with non-linearity.

#more complex mosaic plot
income <- with(Default, 
  cut(income, 
    breaks=quantile(income, probs=c(0, 0.25, 0.5, 0.75, 1 ), names=FALSE),
    labels=c("Q1", "Q2", "Q3", "Q4"),
    include.lowest=TRUE, ordered_result=TRUE))
tb <- xtabs(~default+income)
tb
##        income
## default   Q1   Q2   Q3   Q4
##     No  2394 2428 2424 2421
##     Yes  106   72   76   79

From the plot we see that income plays no significant role in defaults by non-students but a slight role with students. Partly the reason for this is that there are no students in the two upper income quartiles - as shown in the crosstabs following the plot.

mosaicplot(tb, color=c( "red", "blue"), 
      main="Default Status and Income Class")

xtabs(~student+income)
##             income
## student        Q1   Q2   Q3   Q4
##   notStudent  222 1834 2500 2500
##   Student    2278  666    0    0

Mosaic plot with Balance

The variable balance is made into a categorical variables corresponding to levels low, medium and high

balance = with(Default, 
    cut(balance, 
    breaks=quantile(balance, probs=c(0, 1/3, 2/3, 1), names=FALSE),
    labels=c("low", "medium", "high"),
    include.lowest=TRUE, ordered_result=TRUE))
tb <- xtabs(~default+balance)
tb
##        balance
## default  low medium high
##     No  3334   3324 3009
##     Yes    0      9  324

From the simple crosstabs of balance and default we see that balance has a huge effect: almost all defaults occur when the balance is in the high category and there are no defaults for balances in the low category.

The mosaic plot for this simple contingency table reflects these facts. In particular, viewing just the right-hand part of the plot we see that lower red rectangle reflects the large number of defaults that occur with high credit card balances. The tiny blue rectange in the upper right corresponds to the small number of defaults with medium balances. Since no defaults occurred with low balances, this is represented by a dashed line.

mosaicplot(tb, color=c( "red", "blue"), 
      main="Default Status and Card Balance")

Higher Dimensional Tables

We conclude with the crosstabs for all four factors and several possible mosaicplots. Since there are no students in income classes Q3 and Q4, we use the new R function droplevels() to remove them. The printout that follows shows the data is very difficult to comprehend when displayed like this.

DF <- data.frame(student=student, income=income, balance=balance, 
                 default=default)
DFsubset <- subset(DF, DF$income!="Q3"& DF$income!="Q4")
tb <- xtabs(~student+default+droplevels(income)+balance, data=DFsubset)
tb
## , , droplevels(income) = Q1, balance = low
## 
##             default
## student       No Yes
##   notStudent  86   0
##   Student    496   0
## 
## , , droplevels(income) = Q2, balance = low
## 
##             default
## student       No Yes
##   notStudent 714   0
##   Student    158   0
## 
## , , droplevels(income) = Q1, balance = medium
## 
##             default
## student       No Yes
##   notStudent  80   0
##   Student    758   1
## 
## , , droplevels(income) = Q2, balance = medium
## 
##             default
## student       No Yes
##   notStudent 609   3
##   Student    216   0
## 
## , , droplevels(income) = Q1, balance = high
## 
##             default
## student       No Yes
##   notStudent  51   5
##   Student    923 100
## 
## , , droplevels(income) = Q2, balance = high
## 
##             default
## student       No Yes
##   notStudent 465  43
##   Student    266  26

The crosstabs is represented by a higher order array of dimensions

dim(tb)
## [1] 2 2 2 3

The variables associated with each dimension are

dimnames(tb)
## $student
## [1] "notStudent" "Student"   
## 
## $default
## [1] "No"  "Yes"
## 
## $`droplevels(income)`
## [1] "Q1" "Q2"
## 
## $balance
## [1] "low"    "medium" "high"

Focusing just on the customers who default, these is represented by the bottom row. The huge bulk is red indicating that virtually all defaults occur with high balance accounts. Not many students are in income class Q2 whereas this is reversed in the non-students.

mosaicplot(tb, color=c( "red", "blue"), main="")

Discriminant Analysis

LDA

ansLDA <- MASS::lda(default~. , data=Default) 
yH <- predict(ansLDA, newdata=Default)$class
y <- Default$default
eta <- 1-mean(yH==y)
MOE <- 1.96*sqrt(eta*(1-eta)/length(yH))
CI <- round(eta+c(-MOE, MOE),3)
CI <- paste0("(", CI[1], ", ", CI[2],")")
table(y, yH)
##      yH
## y       No  Yes
##   No  9645   22
##   Yes  254   79

The misclassification rate is 0.0276 with 95% confidence interval rate (0.024, 0.031).

Logistic Regression

ansLogit <- glm(default~., data=Default, family="binomial")
yHprob <- predict(ansLogit, newdata=Default, type="response")
yH <- factor(ifelse(yHprob<0.5, "No", "Yes"))
eta <- 1-mean(yH==Default$default) #mis-classification rate
MOE <- 1.96*sqrt(eta*(1-eta)/length(yH))
CI <- round(eta+c(-MOE, MOE),3)
CI <- paste0("(", CI[1], ", ", CI[2],")")
table(Default$default, yH) #confusion table
##      yH
##         No  Yes
##   No  9627   40
##   Yes  228  105

The observed misclassification rate is 0.0268 and a 95% confidence interval for the hypothetical true misclassifcation rate (0.024, 0.03).

Both LDA and logistic DA give essentially equivalent predictions but the logistic method also provides a model that describes the importance of the variables.

summary(ansLogit)
## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8
Interpretation of the Output

The fitted equation may be written,

\[{\rm logit(Pr\{default\})} = -10.87- 0.6468\times {\rm Student} + 0.005737\times {\rm Balance} + 0.00000303 \times {\rm Income}.\]

From R help on glm, “For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures”. In the present case this means that “success” is the 2nd level of the factor default which is coded as “Yes”.