Lead Pollution Dataset

The leadpol dataset is in file formatted as follows:

#Source: leadpol.txt
#            lead = lead content in tree bark
#            traffic = traffic volume/day
              lead    traffic
              227         8.3
              312         8.3
              362        12.1
              521        12.1
              640        17  
              539        17  
              728        17  
              945        24.3
              738        24.3
              759        24.3
             1263        33.6

Input the data to R using read.table().

leadpol <- read.table("http://www.stats.uwo.ca/faculty/aim/2017/3859/data/leadpol.txt", 
                      skip=3, header=TRUE)
leadpol

##    lead traffic
## 1   227     8.3
## 2   312     8.3
## 3   362    12.1
## 4   521    12.1
## 5   640    17.0
## 6   539    17.0
## 7   728    17.0
## 8   945    24.3
## 9   738    24.3
## 10  759    24.3
## 11 1263    33.6

A simple scatterplot suggests a linear relationship.

with(leadpol, plot(traffic, lead))

Fit a simple linear regression and print a brief summary.

ans <- lm(lead ~ traffic, data=leadpol)
ans

## 
## Call:
## lm(formula = lead ~ traffic, data = leadpol)
## 
## Coefficients:
## (Intercept)      traffic  
##      -12.84        36.18

Here is a more detailed summary. We see that \(R^2 = 91.4\)% so the regression explains 91.4% of the variation. It may be a useful model but only if the model assumptions are correct. Diagnostic checking is very important for statistical model construction. If the assumptions are empirically false, conclusions from the fitted model may be wrong.

summary(ans)

## 
## Call:
## lm(formula = lead ~ traffic, data = leadpol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128.43  -63.13   24.52   69.32  125.72 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -12.842     72.143  -0.178    0.863    
## traffic       36.184      3.693   9.798 4.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 92.19 on 9 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9048 
## F-statistic: 96.01 on 1 and 9 DF,  p-value: 4.239e-06

In the case of simple linear regression, a basic diagnostic plot comparing the data and the fitted model is useful. We look for systematic departures from the fit including outliers, bias and heteroscedasticity or non-constant variance. The plot below looks reasonable and in this simple situation we may conclude that the regression model appears to be valid.

with(leadpol, plot(traffic, lead, pch=19, cex=1.5, col="blue"))
abline(reg=ans, col="magenta")

Lead Pollution Dataset

A. I. McLeod

September 11, 2017