Using the pls Package with Pollution Data

Purpose

We show how to use the pls package for model fitting and prediction. This package provides both partial least squares (PLS) and principal component regression (PCR).

The pls package was published in the Journal of Statistical Software (JSS) and provides algorithms for fitting partial least squares as well as principal component regression.

Since the pls package is designed for use with multivariate output variables many of the results are represented in matrices or three-dimensional arrays.

Cross-validation is provided as a guide in an iterative model building approach rather than as an automated method. Only very naive cross-validation methods are provided. Essentially only a single iteration of K-fold cross-validation, leave-one-out cross-validation or simple modifications. The objection to these methods is that because no iteration is done the results too random. Also note that theoretical analysis discussed in ESL (Ch. 7) as shown that leave-one-out cross-validation has high variance in terms of its unconditional performance. In conclusion, this package is not suitable as it stands for automatic modelling. I have implemented bestglm::pcreg() to provide improved automatic univariate models for both PLS and PCR.

Datasets provided in the pls package

This package also provides four interesting datasets that are discussed in the JSS paper. Although the datasets are provided as dataframes, often a single column corresponds to a matrix and most of the datasets have multivariate output variables.

Data sets in package âplsâ:

gasoline                Octane numbers and NIR spectra of gasoline
mayonnaise              NIR measurements and oil types of mayonnaise
oliveoil                Sensory and physico-chemical data of olive oils
yarn                    NIR spectra and density measurements of PET
                        yarns

The exception is the yarn dataset which is dataframe with dimensions 28-by-3. The (1,1)-element of this dataframe is a vector of length 268 representing NIR spectra which are the inputs.

Figure 1. Kernel Density of 268 Inputs for the First Observation

plot(density(yarn[1,1]), main="Kernel density")

The second column is the univariate output varible, yarn density, and the third variable is a logical variable indicating if the observation corresponds to the training or test data. This is an example of wide data since there are 268 inputs but only 28 observations.

Illustrative fitting using the McDonald pollution dataset

This dataset was in a 1973 Technometrics paper to illustrate the application of ridge regression when there is high degree of dependence or multicollinearity in the inputs. Figure 2 shows the plot of the variance inflation factors (VIF) with the warning limit at 10. PLS is often advocated for such multicollinear data.

barplot(vifx(mcdonald[,-ncol(mcdonald)]))
abline(h=10, col="red")
title(main="Figure 2. Variance inflation factors")

Since there are only 60 observations in this dataset and our goal here is just to demonstrate how fitting and prediction can be done using this package, we will keep 57 observations in the training and for illustration only use 3 observations in the test data. We can use our trainTestPartition().

XyList <- trainTestPartition(mcdonald, trainFrac=0.95)
XyTr <- XyList$XyTr

We would like to use cross-validation to determine the number of components to use. But as previously explained, insufficient replications are used. The table below summarize distribution of the selected number of components based on a for-loop with 50 runs. The results are not replicable either because an internally set random seed is used.

k <- NULL
for (i in 1:50) {
  ans <- plsr(MORT ~ ., scale=TRUE, data=XyTr, validation="CV")
  k[i] <- which.min(ans$validation$PRESS)
}
table(k)

## k
##  3  4  5  6  7 
##  9  9 19 11  2

For good measure we repeat.

k <- NULL
for (i in 1:50) {
  ans <- plsr(MORT ~ ., scale=TRUE, data=XyTr, validation="CV")
  k[i] <- which.min(ans$validation$PRESS)
}
table(k)

## k
##  3  4  5  6  7  8  9 
##  9  5 21  8  5  1  1

As mentioned the author’s of the pls package do not envision cross-validation to be used in this rather. Instead it is considered more like a diagnostic check or rough rule to suggest how many components to use.

The output from plsr() as S3 class “mvr” and associated methods function predict.mvr(). The prediction function take the argument ncomp that specifies how many latent vectors to use, that is, the latent vectors specified correspond to 1:ncomp. Let’s try prediction using the first 3 components.

XyTe <- XyList$XyTe
predict(ans, newdata=XyTe, ncomp=3)

## , , 3 comps
## 
##         MORT
## 19  985.9580
## 37  986.5152
## 40 1016.3996

Notice that output is an R array with three dimensions but only 1 is used.

Now predictions using the first 3, 4, 5 and 6 components and comparing with the observed.

tb <- matrix(numeric(0), ncol=3, nrow=5)
tb[1,] <- as.vector(predict(ans, newdata=XyTe, ncomp=3))
tb[2,] <- as.vector(predict(ans, newdata=XyTe, ncomp=4))
tb[3,] <- as.vector(predict(ans, newdata=XyTe, ncomp=5))
tb[4,] <- as.vector(predict(ans, newdata=XyTe, ncomp=6))
tb[5,] <- XyList$yTe
row.names(tb) <- c(paste0("ncomp=",3:6), "observed")
tb

##              [,1]      [,2]      [,3]
## ncomp=3  985.9580  986.5152 1016.3996
## ncomp=4  980.4488  994.5307 1013.2233
## ncomp=5  973.9632  987.7730 1009.9963
## ncomp=6  978.8664  982.6820  998.6171
## observed 959.2210 1113.1560  991.2900

If we wanted to use this package for prediction in an automatic fashion, I would recommend using a weighted average of the predictions with weights determined by running plsr() many times and determining the frequency with which the number of components.

Using the pls Package with Pollution Data

Ian McLeod

March 3, 2017

Purpose

Datasets provided in the pls package

Figure 1. Kernel Density of 268 Inputs for the First Observation

Illustrative fitting using the McDonald pollution dataset