Preamble

This writeup also serves as an example case study for students in my Statistics 9864: Statistical Computing Course. Because I use a well-known dataset and since the rubric indicates that 25% of the grade is for choice of dataset, this report would not achieve a perfect score. However an interesting and insightful presentation is given with some novel comments including the effect of autocorrelation, using ggplot for exploratory analysis with data transformation and suggesting a ggplot replace for coplots and conditional lattice plots. If this was a student presentation, it would receive about 90%. This lecture note also contains more instructional material than I would like to see in a project. Ideally the project writeup should aim to provide an interesting report for someone who is knowlegable about statistical methods in general and should not contain explanations of basic methods.

Later in this course, I will provide you with another case study which will be formatted using PDF and use enhanced features available for tables and figures.

Style Requirements

Introduction

The ozone dataset, lattice::environmental, is comprised of 111 observations over consecutive days of ozone pollution and some covariates in New York City area. More details about the data and an insightful analysis using the scatterplot matrix and conditional plots are given in Cleveland’s book Visualizing Data. The complete dataset is also available in R, datasets::airquality. The original contained missing values, so technically this is an irregular time series. In this report we are going to compare Cleveland’s analysis with what can be achieved using gplot2. In a later presentation, we will explore this dataset using R software for dynamic data analysis. From Table 1, We see that this dataset has four variables, so this means that it can not be visualized perceptually since we can only perceive directly 3D.

Table 1. Summary of Environmental Dataset.
Statistic N Mean St. Dev. Min Max
ozone 111 42.099 33.276 1 168
radiation 111 184.802 91.152 7 334
temperature 111 77.793 9.530 57 97
wind 111 9.939 3.559 2.300 20.700

Ozone pollution is produced solar radiation acting on pollutants from burning coal and from automobile exhaust (gasoline/diesel combustion engines). Our objective is to understand from the data how wind speed and daily maximum temperature are related to observed ozone pollution. We are especially interested in high levels of ozone pollution since these are especially harardous to human health.

Autocorrelation analysis

As noted the data are an irregularly spaced time series. We can use R’s acf() function to check for the strength of dependence in the Ozone series.

Autocorrelation analysis indicates that there is moderate positive autocorrelation in the Ozone series - note that the missing values has a slight whitening effect. For exploratory analysis we can ignore this moderate autocorrelation but for modelling and especially statistical inferences it would need to be taken into account.

Figure 1. Autocorrelation Analysis

Figure 1. Autocorrelation Analysis

Data transformations and boxplot analysis

As discussed in the course, power transformations may be used to improve both statistical modelling and data visualization by making skewed data more symmetric. In practice, simple boxplot analysis enables a reasonable choice of transformation. When a specific model is appropriate, Box-Cox analysis may also be used.

We use a tidyverse script. After making the transformation and using the dplyr::mutate() function we have a dataframe in wide format. To use ggplot we convert this to long format which is accomplished using dplyr::gather. To fix the tick marks on the x-axis, I googled “ggplot2 no tick marks” which gave me the following link on stackoverflow, LINK. Similarly for use plotmath in the panel labels. See also help(plotmath) in R.

Figure 2. Choosing a Power Transformation using Boxplots

Figure 2. Choosing a Power Transformation using Boxplots

The boxplot analysis confirms that Cleveland’s choice of a cube root transformation for ozone does a good job making the data distribution more symmetric.

Scatterplot matrix

Provided the dimension is not too large, as a rough rule, less than 6-8, the most important multi-dimensional visualization is the scatterplot matrix. Dynamic data analysis provides tools for extending this limitation to higher dimensions.

There are several implementations of the available including graphics::pairs() and lattice:splom(). The CRAN package GGally contains extensions of ggplot2 which would require complex programming (ie. more than 10 minutes) to achieve using the tidyverse scripting methods. GGally includes functions for scatterplot matricies, parallel coordinate plots, correlation plots and time series plots.

There are various other renditions of the scatterplot matrix possible and one frequently sees only the lower or upper panels since mathematically speaking the extra panels could be considered to contain redundant information. But this is a mistake! For proper visualization all panels should be used since this enables visual linking by reading across the panels as we will demonstrate in the following discussion. In this discussion, for brievity, we will refer to ozone but strictly it is the cube-root of ozone that we are referring to!

  • The bottom row corresponds to ozone pollution and reading across panel(1,1) shows ozone pollution against radiation. While chemistry theory predicts that ozone pollution should increase with radiation this is only partially true. After radiation reaches about 200, ozone pollution actually declines. This must be due to some confounder. Panel(2,1), wind vs. radiation, shows that wind increases with radiation after radiation reaches 200 also. When the ventillation is low, the concentration of smoke/exhaust pollutants is higher and so the chemical reaction caused by radiation can proceed at a higher rate. This point was previously pointed by Cleveland (1994), however the next two paragraphs discuss some new insights that could be important. Some new further insights are now presented.

  • Perhaps there is a meteorological explanation for this association of wind and radiation.

  • Panel(4,2), radiation vs. temperature, clearly reveals that there are two clusters in the data. Interesting the transpose panel, (3,1) for temperature vs. radiation does not so clearly show this separation into groups. There must a scientific reason for this separation into groups!

  • Panel(4,1) a nonparametric estimate of the probability density functions shown. It is somewhat bimodal. It is well-known that the degree of cloudiness also has a bimodal distribution and it is obvious that solar radiation at ground level strongly depends on cloud cover. So degree of cloudiness is likely another confounding variable.

  • Panel(1,3) ozone suggests a broken-stick or segmented regression relationship of ozone regressed on wind. Wikipedia - segmented regression LINK.

Conclusion: The scatterplot matrix suggests that an accurate statistical model for this data will need to take into account the interaction between wind and radiation as well as non-linearity in the relationship between ozone and wind.

Figure 3. Scatterplot Matrix

Figure 3. Scatterplot Matrix

Conditional plots or coplots

Introduction to coplots

Purpose: visualization of interaction and other non-linear effects

One of major innoviations in Cleveland’s book was the development of co-plots or conditional plots. These plots are implemented in base R graphics::coplot() and in lattice::xyplot().

Interactions

As simple example was provided by Ryan, Joiner & Ryan in their Minitab book. The data is from a randomized experiment with two factors and is summarized in Table 4. This is an extreme example since the effect of each factor by itself is zero but when acting together they are effective.

Table 4. Fraction of Larvae Killed.
poison/YES poison/NO
attractant/YES 0.846 0.204
attractant/NO 0.295 0.223

Simulated example

As a simple simulation example, a model was simulated to generate \(n=200\) data values from a statistical model relating an outcome variable to two input variables pred1 and pred2. The coplot for outcome vs. pred2 conditional on pred1 is obtained dividing pred1 into overlapping intervals with approximately an equal number of observations in each interval with a specified fraction of overlap. These overlapping intervals are called shingles by analogy with the overlapping shingles used for roof construction and the algorithm lattice::equal.count() is used to construct these shingles. For pred1, the plot below shows the shingles formed using the equal count algorithm do divide this variable into 4 shingles with a 50% overlap. The purpose of the overlap is to make the transition from one shingle to the next smoother and so improve the visualization.

Figure 4. Four Shingles with 50% Overlap

Figure 4. Four Shingles with 50% Overlap

Using the shingles shown in Figure 4, the coplot constructed using lattice::xyplot() is shown below. From Figure 4 we see that when pred1 is at the lowest level the relationship of pred2 with outcome is trending up while at the highest level of pred1, the relationship trends down with increasing pred2. Figure 4 provides strong evidence an interaction effect.

Figure 5. Lattice Coplot for Simulated Interaction Example.

Figure 5. Lattice Coplot for Simulated Interaction Example.

Comparison of Coplot and Scatterplot Matrix

The coplot and scatterplot matrix provide complementary insights into high dimensional data. The scatterplot provides information how each pair of variables depend on each other and may reveal useful information about many aspects such as outliers, lack-of-symmetry, departures from linearity and many other irregularities such as illustrated in the famous Ansombe Quartet discussed in my Mathematica Demonstration. There are many causes for non-linearity revealed in scatterplot matricies and one of these causes may be simple interaction between the predictor variables. The coplot is designed to detect interactions as well as many other possible unexpected irregularities.

Figure 6. Scatterplot Matrix With Simulated Interaction Data

Figure 6. Scatterplot Matrix With Simulated Interaction Data

Viewing across the bottom row, the scatterplots of outcome vs. pred1 and pred2, we see there appears to be a linear relationship between outcome and pred1 and a quadratic relationship for outcome vs. pred2.

  • Model1: outcome on pred1, pred2 and pred2-squared
  • Model2: outcome on pred1, pred2 and pred1 * pred2

Both models are satisfactory with respect to all the usual linear regression diagnostic checks. But Model2 provides a much better fit as shown in Table 2.

Table 2. Comparing the Two Regression Models
Model1 Model2
RSq 0.4 0.4
AIC 1536.8 1523.4

Table 3 provides a stargazer summary of the final model selected for the simulated data.

Table 3. Final regression model summary.
Dependent variable:
outcome
pred1 0.362
(0.643)
I(pred1 * pred2) -7.366***
(1.456)
pred2 23.729**
(9.722)
Constant 97.142***
(5.033)
Observations 200
R2 0.422
Adjusted R2 0.413
Residual Std. Error 10.747 (df = 196)
F Statistic 47.658*** (df = 3; 196)
Note: p<0.1; p<0.05; p<0.01

Suggested Alternative Approach Using ggplot

With some additional programming effort we could make lattice::equal.count() work with this gplot(). I looked briefly into this and decided that although it is straightforward in principle, the detailed implementation would be tedious due to the non-standard output provided by this function. So I decided to experiment with a different approach.

In fact, in many areas of application, continuous variables, such as age, are often grouped into intervals to form an ordered factor using the base R function cut(). Tidyverse provides an ehanced version of this function that is more suitable for graphics, ggplot::cut_number(). This is essentially equivalent creating zero-overlappling shingles using the equal count algorithm.

Figure 7 shows a facetted ggplot with pred2 transformed to a categorical variable. This plot shows in a similar way to the lattice coplot in Figure 5 the interaction between the two inputs.

Figure 7. Using ggplot with categorical variables

Figure 7. Using ggplot with categorical variables

Reference

Jan Vanhove provided the data for this example and discussed various other visualization techniques for interactions.

Coplot Visualization of Ozone Dataset

Introduction

For further dicussion, let

  • O: cube-root ozone
  • R: radiation
  • W: wind
  • T: temperature

Instead of one coplot there are now three possible coplots to consider:

  • {O vs R | T, W}
  • {O vs T | R, W}
  • {O vs W | R, T}

The main purpose of coplots is to detect interaction which is one of the main sources of non-linearity in linear models. An interaction occurs when two input variables act in a synergestic way to produce an effect on the output variable which is greater than the effect if each variable just acted alone.

Ozone vs Radiation Conditional on Temperature and Wind

Ozone pollution \(O_3^{1/3}\) is plotted against radiation conditional on wind and temperature. The question of interest is whether there is an interaction between radiation and either of the two other variables. We know that ozone pollution tends to increase with radiation holding the effects of wind and temperature constant.

An interaction radiation and temperature occurs if for a fixed wind speed, there is synergestic effect between radiation and temperature. It is sufficient to look across each row to see when wind is held fixed, temperature and radiation produce a non-additive effecive. This is most easily seen in the bottom two rows, since the slope and shape of the curve changes as temperature inceases.

Note that in Figure 8, there are not an equal number of points in each panel unlike the case where there is only one conditioning variable. In general, this is not unexpected.

Figure 8. Lattice Style Coplot for the Environmental Dataset

Figure 8. Lattice Style Coplot for the Environmental Dataset

Figure 9 shows the ggplot version of the coplot. Note ggplot faceting has reversed the y-axis. But this shows equally well that radiation and temperature are interacting.

Figure 9. Coplot using ggplot: O~R|T*W.

Figure 9. Coplot using ggplot: O~R|T*W.

Similarly further plots: * ozone vs temperature conditional on windradiation ozone vs wind conditional on temperature*radiation could be investigated.

Parallel coordinates plot

This is another plot useful for high-dimensional data. Some of the functions available in R for these plots include MASS::parcoord(), lattice::parallel() and GGally::ggparcoord().

This plot is based on the simple idea of plotting each observation with a separate vertical scale for each variable and joining the points with line to enhance their visualization. When the number of observations is larger than about 20 it usually becomes less useful due to overlap. But in some situations it is helpful. The dynamic versions of this plot available in GGPlot and Mondrian are often more useful than the static version.

Figure 10. Parallel Coordinates Plot Comparing Observations in the Top 10% and Bottom 10% Ozone Quantiles

Figure 10. Parallel Coordinates Plot Comparing Observations in the Top 10% and Bottom 10% Ozone Quantiles

Conclusion

Scatterplot matricies provide two-dimensional views with linking between plots when we read across the rows. Coplots provide an extension to allow improved visualization of interaction effects.

For large sample sizes, we could estimate the conditional probability density function and various packages in R are available for this. But since this is not really a data visualization method, it is not evident how well the resulting estimating actually describes the data. The coplot method provides a type of conditional scatterplot visualization that let’s the data speak.

Dynamic data analysis using software such as GGobi and Mondrian provide further insightful methods for enhancing scatterplots. By brushing with the mouse we can visualize complex relationships in linked scatterplots and 3D data projections.