A training and test samples of size 200 and 10,000 respectively were generated from a complicated mixture distribution using the function gencve::rmix(). In these datasets there are two inputs \(x_1\) and \(x_2\) and the output has two classes. Previously we showed that the theoretical optimal decision boundary was very nonlinear and it corresponds to an optimal misclassification rate of about 20.7%. We don’t expect that the logistic classifier will do very well with this data.

The example data is available on my homepage and we can input it using read.csv(). We use str() to check it looks ok.

XyTr <- read.csv(
 "http://www.stats.uwo.ca/faculty/aim/2017/sdm/data/rmixTr.csv", 
 header=TRUE)
str(XyTr)
## 'data.frame':    200 obs. of  3 variables:
##  $ x1: num  2.147 3.368 -0.247 1.006 1.027 ...
##  $ x2: num  1.032 -0.181 2.887 0.09 -0.505 ...
##  $ y : int  0 0 0 0 0 0 0 0 0 0 ...

Next we fit a logistic regression model using the R function glm(). First we need to make the output variable into a factor. We use the table() function to check the data looks correct and confirm there are 100 observations in each class.

#glm with logistic regression needs factor variable
XyTr$y <- as.factor(XyTr$y)
table(XyTr$y)
## 
##   0   1 
## 100 100

After fitting, we generate the predictions. Notice that the predict() function with logistic regression only returns the estimated probabilities. We use ifelse() to convert this into 0/1 predictions. Technically, note that we are assuming also a 0/1 loss function so these are the optimal predictions. More about 0/1 loss functions later.

ans <- glm(y ~ ., data=XyTr, family=binomial(link="logit"))
pH <- predict(ans, type="response")
yH <- as.factor(ifelse(pH < 0.5, 0, 1))
yTr <- XyTr$y
r <- mean(yH!=XyTr$y)
MOE <- 1.96*sqrt(r*(1-r)/length(yTr))
table(yTr, yH)
##    yH
## yTr  0  1
##   0 70 30
##   1 28 72

The estimate of the misclassication rate based on the training data is 29% and its 95% confidence interval is (22.7, 35.3)%.

A plot of the data showing the logistic classifer decision boundary is shown below.

plot(XyTr$x1, XyTr$x2, xlab=expression(x[1]), ylab=expression(x[2]),
     type="n")
redQ <- XyTr$y=="1"
points(XyTr$x1[redQ], XyTr$x2[redQ], col="red", pch=18, cex=2)
points(XyTr$x1[!redQ], XyTr$x2[!redQ], col="blue", pch=18, cex=2)
ab <- -coef(ans)[1:2]/coef(ans)[3]
abline(a=ab[1], b=ab[2], lwd=2, col="gray")
title(main="Mixture training data with logistic decision boundary")
title(sub="y=0/blue and y=1/red")

Finally we see how well does the logistic classifier generalize to test data. The confusion matrix shows that we make more errors when the true class is blue, that is when \(y=0\) and this is in agreement with the confusion table for the training data.

XyTe <- read.csv(
 "http://www.stats.uwo.ca/faculty/aim/2017/sdm/data/rmixTe.csv", 
 header=TRUE)
pH <- predict(ans, newdata=XyTe, type="response")
yH <- as.factor(ifelse(pH < 0.5, 0, 1)) 
yTe <- as.factor(XyTe$y)
r <- mean(yH!=yTe)
MOE <- 1.96*sqrt(r*(1-r)/length(yTe))
table(yTe, yH)
##    yH
## yTe    0    1
##   0 6633 3367
##   1 2052 7948

The estimate of the misclassication rate based on the test data is 27.1% and its 95% confidence interval is (26.5, 27.7)%.

Since the misclassifications rates on the training and test data are in agreement, we conclude that the logistic classifier generalizes well – in this case at least. In high dimensional problem when \(p\) the number of inputs is large, an overfit model may not generalize so well.