This data set is from the U.S. Small Business Administration (SBA) and provides historical data from 1987 through 2014. This large data set contains 28 variables and 699,153 observations (19,576,284 data values). The file size is about 1.3MB - easily handled using R. Each observation represents a loan that was guaranteed to some degree by the SBA. Included is a variable [MIS_Status] which indicates if the loan was paid-in-full or defaulted/charged-off. More details are available in the DataDescription.pdf included in the data directory.

One important area in finance to develop algorithms for deciding which loans should be granted. Two popular approaches are logistic regression and classification trees. It is also of interest to know what are the important variables for this prediction - support vector machines, neural nets and many other popular Data Mining algorithms don’t address this question very well.

R code is boring. We hide the code since we are usually more interested in the analysis. But the code is there to make your report reproducible and an outside expert may review for it for correctness.

The data is in CSV format in the file: sba.csv. The first step is to examine the file - since it is in CSV format it could be loaded into Excel. Instead we start by printing out the first 30 lines:

#Note the unix style head program is installed when you install the R
#add-on RTools package
#https://cran.r-project.org/bin/windows/Rtools/
system("head  E:/Dropbox/R/2017/sdm/data/sba/sba.csv -n 30")

Unfortunately this command does not work as expected with Knit since it simply outputs to the console in RStudio - but this is ok, since normally this output is only of interest in the beginning stage of your analysis. It is not suitable for inclusion in the final report!

The following code chunk inputs the data from a local drive. Due to the size of this dataset it is important to use a local drive since inputting from a URL would take a very long time.

The following table shows the number of “default” and “ok” loans over the entire period 1987-2014.

## default      ok 
##  138440  560713

Next we examine the crosstabs for the percentage of loans that default for each year from 2004 to 2010.

## 2002 2003 2004 2005 2006 2007 2008 2009 2010 
## 11.2 13.9 17.4 24.7 34.4 42.3 40.8 20.5 13.7

A barchart provides an excellent data visualization for this table. We see that defaults were increasing up to 2008. The great recession started in earnest with the collapse of Lehman Brothers in September 2008.