Since this notebook is for tutorial purposes I show the R code explicitly. Normally reports that you prepare would NOT show the R code since the my interest is on how well you explain what you have done. I can also check your Rmd file for the R code if I need. If you are employed as an analyst, it would be distracting to your readers to insert computer code!
The dataset bone was discussed in Elements of Statistical Learning (ESL) where it was used to illustrate the application of smoothing splines. This dataset as well as others discussed in the ESL book are available from the R package ElemStatLearn on CRAN.
#install.packages("ElemStatLearn")
library(ElemStatLearn, quietly=TRUE)
## Warning: package 'ElemStatLearn' was built under R version 3.2.5
The dataframe bone has dimensions 485, 4. The variable of interest spnbmd is the relative spinal bone mineral density which was determined in 261 North American adolescent children as a function of age and gender. A brief summary of the dataframe is given below, Measurements in the bone mineral density of 261 north american adolescents, as function of age. Each value is the difference in spnbmd taken on two consecutive visits, divided by the average. The age is the average age over the two visits.
str(bone)
## 'data.frame': 485 obs. of 4 variables:
## $ idnum : int 1 1 1 2 2 2 3 3 3 4 ...
## $ age : num 11.7 12.7 13.8 13.2 14.3 ...
## $ gender: Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 1 ...
## $ spnbmd: num 0.01808 0.06011 0.00586 0.01026 0.21053 ...
A more informative summary provides information on the distribution of each of the variables in dataframe,
sapply(bone, FUN=summary)
## $idnum
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 60.0 124.0 151.5 240.0 384.0
##
## $age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.40 12.70 15.40 16.10 19.15 25.55
##
## $gender
## female male
## 259 226
##
## $spnbmd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.064100 0.005858 0.026590 0.039250 0.064130 0.219900
The base R function ksmooth() can be used for kernel regression but as shown in the plot the default smoothing is much too little so the regression is overfit.
ym <- bone$spnbmd[bone$gender=="male"]
xm <- bone$age[bone$gender=="male"]
plot(xm, ym, xlab="age", ylab="bone density", main="Males")
fit <- ksmooth(xm, ym)
lines(fit, lwd=2, col=rgb(0,0,0,0.5))
By tweaking the argument settings in ksmooth() an improved smoother can be found.
plot(xm, ym, xlab="age", ylab="bone density", main="Males")
fit <- ksmooth(xm, ym, kernel="normal", bandwidth=2)
lines(fit, lwd=2, col=rgb(0,0,0,0.5))
The R package KernSmooth provides an automatic methods for choosing the bandwidth. The resulting plot, shown below, provides a much better visualization of the data. The bump revealed in the plot is an important part of bone development in young adults.
#install.packages("KernSmooth")
library(KernSmooth)
## Warning: package 'KernSmooth' was built under R version 3.2.5
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
plot(xm, ym, xlab="age", ylab="bone density", main="Males")
h <- dpill(xm, ym)
fit <- locpoly(xm, ym, bandwidth = h)
lines(fit, lwd=2, col=rgb(0,0,0,0.5))
text(24, 0.08, labels="Very Bad!", col="red")
arrows(24, 0.07, fit$x[400], fit$y[400]+0.01, col=rgb(1,0,0,0.6), lwd=3)
But a glaring defect in the kernel smoothing method is also illustrated. Kernel smoothing methods tend to often behave poorly near the end-points. There are many other practical problems with the use of these methods for regression.
In particular, nonparametric smoothers such as loess which we will discuss next are better.