The rate of change in bone density at various ages for North American adolescents. This dataset was discussed in the ESL book and was used to illustrate spline regression which is a type of parametric regression that we will discuss later.
In this notebook we will discuss how loess can be used for nonparametric regression analysis with a single input. This nonparametric loess regression method is useful also with two inputs. In principle, it could be extended to a higher number of inputs but we will see that the curse of dimensionality limits the utility of this approach.
The dataframe has 485 observations of four variables,
## 'data.frame': 485 obs. of 4 variables:
## $ idnum : int 1 1 1 2 2 2 3 3 3 4 ...
## $ age : num 11.7 12.7 13.8 13.2 14.3 ...
## $ gender: Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 1 ...
## $ spnbmd: num 0.01808 0.06011 0.00586 0.01026 0.21053 ...
The lattice boxplot summarizes the overall distribution for males and females. The distribution of bone density changes is skewed to the right but there is little other apparent difference.
With lattice graphics when can do conditional plots where we can condition on a factor variable. Numeric variables are handled by aggregating the variable in an ordered factor variable containing an approximately equal number of cases. The conditional boxplot for the bone density shows the dependency on age.
The scatterplot matrix is also a widely used exploratory graphics method. Traditional R graphics uses the function pairs() for scatterplot matricies. Because we have the two classes ‘male’ and ‘female’, it is preferable to use the lattice graphics function splom(). In addition to the scatterplot in each panel, a loess curve is shown to help visualize the data. Only default setting for the loess curve were used so the curve just provides rough guidance. From the plot it is apparent that females have larger increases in bone density early than males. Females bone density changes level off sooner than males. By age 25 there does not appear to be much difference.
##Regression Fitting
The R function loess() provides full non-parametric regression modelling capability for one, two and three quantiative explanatory variables. R methods functions are provided for summary(), resid() and predict(). The model complexity is controlled by the smoothing parameter span or enp.target. It is more convenient to use enp.target to specify the effective degrees of freedom. We may use trial-and-error to select the most suitable fit which balances the trade-off between bias and variance. Later in the course we will show how the AIC/BIC criteria may also be used to select a suitable smoothing parameter.
Figure 4 shows a superimposed plot with the data and curves for both male and female shown. By using color for the two classes ‘male’ and ‘female’ a reasonably good graphic visualization is obtained. When there are more than two classes it is better to use juxtaposition.
Juxtaposed plots may easily be created using the R lattice package. Figure 5 shows the resulting two-panel display. The loess curve was created using the lattice function panel.loess() with the same smoothing parameters span=0.36 and degree=2 that were uses in Figure 4 but close inspection shows that the curves are a little different! The reason for this is that panel.loess() only evaluates the loess smoother at the observed data values but in Figure 4 we evaluated the smoother using the predict() function at 100 equi-spaced points in the domain of the input variable age.
Probably Figure 4 is good enough for basic exploratory data visualization but since we would like to see an accurate regression curve the more elaborate approach used to create the superimposed plot in Figure 4 may be used to create a similar lattice version. Figure 6 shows the more accurate version of the loess regression curve.
By inspecting the plot of the data with loess curve we can also judge whether the loess curve may be overfitting or underfitting. In addition other diagnostic plots are useful.
The most important diagnostic plot for loess regression is the residual dependency plot. Sometimes the montone spread-plot (see, VIS and McLeod, ) is useful. The normal probability plot may be useful to check for outliers.
The purpose of the residual dependency plot is to look for systematic bias or lack-of-fit. This is revealed by a non-zero slope with the fitted loess curve. Usually we use parameter degree=1 and span=0.8 or higher with this plot since we are looking for trend.
Since the slope is non-zero we see that there is no systematic departure or bias. But the striking feature revealed is the non-constant variance. The variance depends on age and/or the output variable itself. For valid p-values or confidence intervals for predictions it would be necessary to take this non-constant variance into account. This could be done by using weighted loess as we do sometimes use weighted least squares to adjust for an error variance that depends on an explanatory variable. Another possibility would be a data transformation. But for the purpose of visualzing and summarzing the relationship between changes in bone density and age, the current model is adequate. Using more elaborate techniques will not change the resulting curve in any material way.
The VIS book by Cleveland discussed numerous applications of loess regression and R scripts are provided for every figure given in the book - see online documentation of xyplot(). The VIS book demonstrates the power of data visualizing by pointing out major blunders that famous researchers have made in their publications. VIS shows how loess may be applied with more than one input variable and also trend-seasonal analysis of time series.
In this course we will discuss local fitting of other parametric models including logistic regression as discussed by Loader (2006) and to generalized additve models (Wood, 2006) and ESL.
ESL. Elements of Statistical Learning, Hastie, Tibshirani and Freidman (2009).
Local Regression and Likelihood, Loader (2006). See also CRAN locfit package.
VIS. Visualizing Data, W.S. Cleveland (1993)
Generalized Additive Models, S. Wood (2006)
Diagnostic checking for monotone spread. Computational Statistics and Data Analysis 26, 437-443, McLeod (1996) http://www.stats.uwo.ca/faculty/aim/vita/pdf/diagms.pdf