Predict U.K. Stock Market

The dataset CGK contains data extracted from Coen et al. (1969, Table 1) for three variables fti, ftci and ukcar for the each quarter beginning the third quarter of 1952 through to last quarter of 1967. These variables are summarized in Table 1 below.

Table 1. CGK variables, 1952.3-1967.4
Variable	Meaning
fti	Financial Times Ordinary Share Index
ftci	Financial Times Commodity Index
ukcar	U.K. Car Production (seasonally adjusted, unit=10^5)

A multi-panel time series plot in the style of The Economist magazine is shown in Figure 1.

There are 62 observations that corresponding to consecutive times which may be denoted by \(t=1,\ldots,62\).

Coen et al. (1969) claimed that U.K. Car Production was a leading indicator for the stock market. Box and Newbold (1971) found that the regression model used by Coen et al. (1969) was incorrect because the model did not taken into the serial dependence in the time series.

The random walk hypothesis suggests that optimal forecast is simply the last observed value Bachelier (1900). See Wikipedia entry for Louis Bachelier. This model may be written,

\[z_t = z_{t-1} + e_t, \quad (*)\]

where \(z_t\) is the FTI at time period \(t=1,\ldots,62\) and similarly \(e_t\) is assumed independently distributed with mean 0 and constant variance.

According to the random walk model the first differences should be white noise. This is supported by the autocorrelation plot in Figure 2.

As a prediction exercise let’s take training data to be from 1952.3 to 1966.4 and we will forecast the data for each quarter in 1967. For simplicity the RMSE (root-mean-square error) loss function may be used. Two types of forecast may be considered. The simplest is to forecast at the fixed orgin 1966.4 for lead times 1, 2, 3 and 4 quarters. In this case we can only use the data from 1952.3 to 1964.4 for each forecast. The one-step ahead forecast uses a moving forecast orgin at times 1966.4, 1967.1, 1967.2 and 1967.3.

The optimal forecasts for the random walk model are shown in Table below.

Table 2. Random Walk Forecasts
	observed	forecast	RMSE
1967/1	318.5	299.9	18.6
1967/2	343.1	318.5	24.6
1967/3	360.8	343.1	17.7
1967/4	397.8	360.8	37.0

Let’s compare with lagged linear regression using the previous FT index value and car production as inputs. As a multiple linear regression we may consider the model,

\[z_t = \beta_0 + \beta_1 z_{t-1} + \beta_2 x_{t-1} + e_t,\]

where \(z_{t-1}\) is a lagged dependent variable corresponding to the FTI in the previous quarter and \(x_{t-1}\), a lagged independent variable corresponds the U.K. car production in the previous quarter. Setting \(\beta_0 = 0\), \(\beta_1 = 1\) and \(\beta_2 = 0\), we see that this regression is an enlarged version of the random walk model. That is, the random walk model is nested in the regression model.

Note the slightly tricky point, we can not use current car production to predict current FT index since this predictor variable is not available. In general in data science this general phenomenon can arise in various more subtle ways and is sometimes called data leakage (see Doing Data Science, O’Neil & Schutt, §13).

The linear regression fit is shown in Table 3.


	Dependent variable:

	fti

ftiL1	0.895^***
	(0.059)

ukcarL1	5.252
	(4.499)

Constant	13.982^*
	(7.241)


Observations	57
R²	0.956
Adjusted R²	0.954
Residual Std. Error	16.898 (df = 54)
F Statistic	584.769^*** (df = 2; 54)

Note:	p<0.1; p<0.05; p<0.01

Table 3. Linear Regression with Lagged Variables

Table 4 gives the linear regression forecast. There are only slight differences compared with the random walk model in Table 2.

Table 4. Regression with Leading Indicator Forecasts
	observed	forecast	RMSE
1967/1	318.5	300.1	18.4
1967/2	343.1	318.5	24.6
1967/3	360.8	341.9	18.9
1967/4	397.8	356.8	41.0

Regression with both a lagged dependent variable as well as an independent variable are widely used in econometrics. But I don’t recommend these models in general. The main difficulty is the use of a lagged dependent variable. This means that the usual least squares assumptions, as given for example, for the Gauss-Markov Theorem, are not satisfied. In particular, the assumption that the error term is statistically independent of the input variables is not satisfied since one of the input variables is a lagged version of the output. This means that the standard statistical inferences for regression are not valid. So we can not be sure that the predictor variable lagged U.K. is really statistically signficant at the 5% level. The approximate benchmark limits given for the residual autocorrelation plot shown in Figure 3 are also not correct. By not correct we mean not even asymptotically correct!

The lagged dependent regression can be re-formulated in a slightly different way so that at least the statistically inferences are either exactly correct or in the more general case asymptotically correct. This model is often called regression with autocorrelated error and the model is fully supported with the function in R using the argument to input the matrix of input variables. For the FTI example we may consider the model

\[z_t = \beta_0 + \beta_1 x_{t-1} + n_t\],

where \(n_t\) is autocorrelated noise that satisfies the random walk equation

\[n_t = n_{t-1} + e_t\],

where \(e_t\) is strong white noise (ie. IID mean zero and constant variance). It is more convenient to re-write this model as \(\nabla n_t = e_t\), where \(\nabla\) is the first difference operator. Now the regression model may be written,

\[\nabla z_t = \beta_1 \nabla x_{t-1} + e_t. \quad\quad (**)\]

This model looks very similar in some ways to the previous regression model with the lagged dependent variable but this model satisfies the Gauss-Markov assumptions and if we assume the error term is normally distributed white noise, statistically inferences are exactly correct. We see that \(\beta_1\) is not significantly differently from zero at the 10% level. Note that when \(\beta_1 = 0\), eqn. (**) reduces to the random walk model.


	Dependent variable:

	dfti

dukcarLag1	-3.187
	(7.351)


Observations	57
R²	0.003
Adjusted R²	-0.014
Residual Std. Error	17.631 (df = 56)
F Statistic	0.188 (df = 1; 56)

Note:	p<0.1; p<0.05; p<0.01

Table 4. Regression with Autocorrelated Error

In order to check that the inferences in Table 4 are correct, we need to check that the model assumptions are satisfied. The standard regression model diagnostic checks are shown in Figure 4 below.

With time series regression the most important diagnostic check is for residual diagnostic checking. In practice, unless the time series lengths are very short, perhaps less than 20, then it suffices to examine at autocorrelation plot of the residuals.

References

George E. P. Box and Paul Newbold (1971).
Some Comments on a Paper of Coen, Gomme and Kendall. Journal of the Royal Statistical Society A 134/2, 229-240. http://www.jstor.org/stable/2343873. doi:10.2307/2343873.

Coen, P. J., Gomme, E. D and Kendall, M. G. (1969). Lagged Relationships in Economic Forecasting. Journal of the Royal Statistical Society A,
132/2 133-163. http://www.jstor.org/stable/2343782. doi:10.2307/2343782

O’Neil, Cathy and Rachel Schutt (2014). Doing Data Science. O’Reilly. https://books.google.ca/books?id=puj\_mAEACAAJ

Assignment 1B (due January 30)

Provide another notebook giving a similar analysis to that given in this notebook for the FT commodities index (fourth column in the dataframe uk).

Predict U.K. Stock Market

Ian McLeod

January 16, 2017