This artficial dataset is generated using gencve::rmix() and presents a classification problem with nonlinear decision boundaries. The DGM (data generating mechanism) involves simulating two inputs \(x_1\) and \(x_2\) from two different mixture distributions denoted by green and red. Each mixture distribution is comprised of a random mixture of ten normal random variables - see rmix() source code and documentation for more details. A plot of a random sample of size \(n=200\) is shown in Figure 1.
Because the precise details of DGM are known, it is possible to derive the optimal Bayes decision boundaries and the corresponding Bayes error rate, \(\eta = 20.76\)%. As shown in Figure 2, these decision boundaries exhibit nonlinearity.
We compare Random Forest (RF) with some other well-known state-of-the-art classifiers: k-Nearest-Neighbour (kNN), Support Vector Machines (SVM) and Feedforward-Neural-Net Classifiers. First a brief introduction to each classifier is given.
kNN Classifiers are considered an important classifier since they are very simple and often perform well. As with Simple Bayes Classifiers, kNN often provides a benchmark for other more complex classifiers. The specific R algorithms we will use are ##class::cv.knn()## to select \(k\ \) the number of nearest neighbours to use and ##class::knn()## the algorithm to make predictions for test data.
The kNN classifier is specified by a single parameter \(k\ \) which is the number of neighbours to use to make the prediction. Given this parameter the classifier requires no further training. We assume there are \(p\ \) inputs, \(x_1, \ldots, x_p\) which are all continuous variables and the output variable \(y\) is categorical. Given \(n\ \) cases in the training data, \(x_{i,1},\ldots,x_{i,p}, y_i, i=1,\ldots,n\ \) and a test case with inputs \({\cal X} = (X_1, \ldots, X_p)\), we find the \(k\ \) nearest neighbours of \({\cal X}\) in the training data using usual Euclidean distance function. The prediction is determined by a vote among these nearest neighbours. If there is a tie in the vote, it is split at random. The kNN is a universal approximator so that as the training data increasing and as \(k\) increases, the approximation converges to the truth. But kNN’s performance usally suffers drastically as the dimension of the inputs, \(p\), increases. A slight exception this curse of dimensionality that the nearest neighbour predictor with \(k=1\) often sets a reasonable benchmark performance, much better than random guessing, to compare with more sophisticated predictors.
Weka extends the kNN algorithm to deal with categorical inputs. This package is widely used in the Data Mining Community for teaching and research and supports a flexible and extendible interface for developing data mining algorithms It is supported by an book Data Mining http://www.cs.waikato.ac.nz/ml/weka/book.html which is now in its fourth edition. An R interface to Weka is available in the R package https://cran.r-project.org/web/packages/RWeka/index.html. Professor Ian Whitten, lead author, provides a 6 minute video on how to use Weka for kNN classification. https://www.youtube.com/watch?v=zjYUYJ2b4r8
Other R packages on CRAN that provide \(kNN\) algorithms include:
Support Vector Machine (SVM) classifiers, a newly developed and very complex mathematical algorithm, enjoyed initial great success and is still very popular. We use the function e1071::svm() for SVM fitting and prediction which is based on an efficient and well regarded LIBSVM software. The original C and Java code as well as interesting datasets is available from https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
The base method was developed to solve an optimization problem to develop classifier boundaries that maximally separate the data into classes. The separable case is illustrated in the diagram below.
In the non-separable case, two approaches are used:
We will outline the theory of SVM in a future lecture.
The feedforward neural net (FFNN) is often used for class prediction problem. It is widely used to solve image recognition problems. Today deep neural nets can be trained to visualize reckognize reliabily images of dogs, cats, etc. and even human faces. Deep neural nets provide the driver expertise required for self-driving cars and countless other modern engineering applications. It takes considerable skill and subject matter expertise to develop neural nets for these image recognition problems.
At the simplest level there is the FFNN with one-hidden layer with \(p\) inputs and \(h\) hidden nodes. As more data becomes available, it is known that the FFNN is a universal approximator as \(h\) increases also. The schematic below shows an FFNN with \(p=2\) inputs plus a constant term and with \(h=2\) hidden nodes on a single layer. The coefficients denoted by \(w\)’s are weights which are input to an activation function which is a sigmoid shaped function. Often the logistic curve is used.
Several state-of-the-art neural net packages are available on CRAN. We will use the R function nnet::nnet() that was developed by Brian Ripley and discussed in his book http://www.cambridge.org/ca/academic/subjects/statistics-probability/computational-statistics-machine-learning-and-information-sc/pattern-recognition-and-neural-networks?format=PB&isbn=9780521717700.
If time permits we will discuss further aspects of neural nets in a future lecture.
GBM has become very popular and is frequently used in winnng Kaggle competitions. Trevor Hastie has an excellent talk on GBM https://www.youtube.com/watch?v=wPqtzj5VZus.
We compare the effectiveness of the algorithms discussed above to solve the class prediction problem for the mixture dataset. Each classifier is run on the same 25 simulated datasets. The challenge is to predict the class with only a training sample of only 200 examples - 100 in each class. After training the predictor is run on a test sample of size \(10^4\) so that an accurate estimate of the misclassification rate is obtained. The maximum MOE for a 95% confidence interval is about 0.01. The results are displayed in the boxplots below. The horizontal red line shows the theoretical minimum misclassification rate, about 20.76%.
The computing times in seconds are shown in the table below and are all very reasonable.
## RF GBM SVM NNET KNN
## 12.03 3.96 5.45 4.56 4.25
From the boxplots we see there is considerable variance in the predictors. Although kNN has the best mean performance its variance is quite large.
In many applications the skill and subject matter expertise of the practioner is important especially feature selection to determine the best inputs.