Additional work

Timeseries in R from RStudio Conference 2020

Autoregessive Models

Autoregressive models are stochastic models that assume a linear relationship of data based on its history. The idea is that there is information in the historic data, hidden patterns that can be understood within the patterns. Autoregressive models are generally explained as AR(p) models, where p is the number of historic periods used. Autoregessive models are just specialized models of standard linear models: \(y = \beta x + c + \epsilon\). We take \(y\) as the period we are estimating, and \(x\) as the prior period. The base concept of the model is \(x_{t} = \beta x_{t-1} + c + \epsilon\).

\(x_{t}\): Estimation for period

\(x_{t-1}\): Value of the prior period

\(\beta\): Weight for the prior period

\(c\): Regression constant / intercept

\(\epsilon\): Error

Increasing the number of periods would simple add \(x_{t-n}\) sequentially. This can be expanded to \(p\) periods: \(x_{t} = \beta_{1} x_{t-1} + \beta_{2} x_{t-2}+ ... + \beta_{n} x_{t-n} +c + \epsilon\). We’ll explore this below in R.

We’ll us the following packages:

library(quantmod)
library(lubridate)
library(tidyverse)
library(plotly)

We’ll walk through this example with US GDP, quarterly reporting, starting with a simple linear model. We will follow that up with a more complicated linear model and then some AR() models. First we will pull the GDP data using the quantmod package to pull the GDP from FRED, then clean the time series data and create a tibble with date and GDP. We will not add handling for the impact from the Covid-19 pandemic at this time.

quantmod::getSymbols(Symbols = "GDP",src="FRED")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
# Create time series in data format and not xts object
# This can also be used for forecasting by creating extra forecast periods 
Periods = length(GDP)
DateStart = date(GDP[1])
DateEnd = date(GDP[nrow(GDP)])
DateSeries = seq(DateStart, by = "quarter", length.out = Periods)

GDPn = as.numeric(GDP$GDP)

GDPData = tibble(DateSeries, GDPn)

Now let’s look at a basic regression model. We can see that the basic estimation of quality: r\(^2\) is 0.8771. So it would appear to be decent. But once we plot the estimation, we can see its not very representative of the data. If we look at the root mean square error (RMSE), we can baseline our models against 38029.37

lr = lm(GDPData$GDPn ~ seq(1:nrow(GDPData)))

# Model Summary
summary(lr)
## 
## Call:
## lm(formula = GDPData$GDPn ~ seq(1:nrow(GDPData)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2636.2 -2176.7  -580.5  1657.9  5306.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4064.88     261.14  -15.57   <2e-16 ***
## seq(1:nrow(GDPData))    70.16       1.54   45.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2229 on 291 degrees of freedom
## Multiple R-squared:  0.8771, Adjusted R-squared:  0.8767 
## F-statistic:  2076 on 1 and 291 DF,  p-value: < 2.2e-16
# Plot
plot(x = GDPData$DateSeries, y = GDPData$GDPn)
lines(x = GDPData$DateSeries, y = predict(lr), col = "blue")

# RMSE
sqrt(sum((GDPData$GDPn - predict(lr))^2))
## [1] 38029.37

For the next model, we’ll look at a linear regression using a natural log transformation of the data. This simple transformation captures the shape of GDP much better, but misses events like the great recession and is still too optimistic about the growth rate. The RMSE jumps significantly to 152230.6 (400%!), largely driven by the delta in the most recent periods. This model would could fit better with a more standard exponent.

lnlr = lm(log(GDPData$GDPn) ~ seq(1:nrow(GDPData)))

# Model Summary
summary(lnlr)
## 
## Call:
## lm(formula = log(GDPData$GDPn) ~ seq(1:nrow(GDPData)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41016 -0.13198 -0.02026  0.16707  0.28356 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.5362665  0.0207686   266.6   <2e-16 ***
## seq(1:nrow(GDPData)) 0.0165582  0.0001225   135.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1773 on 291 degrees of freedom
## Multiple R-squared:  0.9843, Adjusted R-squared:  0.9843 
## F-statistic: 1.828e+04 on 1 and 291 DF,  p-value: < 2.2e-16
# Plot
plot(x = GDPData$DateSeries, y = GDPData$GDPn)
lines(x = GDPData$DateSeries, y = exp(predict(lnlr)), col = "blue")

# RMSE
sqrt(sum((GDPData$GDPn - predict(lnlr))^2))
## [1] 152230.6

Now, the benefit of an AR model is that you are always starting from the prior period, so generally the model can’t get too far afield, but it also loses predictive power quickly. After the number of prior periods you are using it tends towards a trend driven by the weights of the prior periods. One additional thing to note: do not use a single column to make models in R. Because of the way that the predict() function works, R will not use the the same column for dependent and independent variables. This model - AR(1) (We will use AR(p) for the number of periods being used in the model) has a vastly better fit. The estimation line is hidden by the dots and the RMSE has been reduced to 1024.54 (2.69% the linear regression’s RMSE).

# DO NOT MAKE THE MODEL IN THIS FASHION
AR_model = lm(GDPData$GDPn[2:nrow(GDPData)] ~ GDPData$GDPn[1:(nrow(GDPData)-1)])

# Proper method
# Add lag, so that we have row 1 with an NA, since we do not know the prior period, and row 2 will have the first periods data in the lag column
GDPDataAR = GDPData
GDPDataAR$GDPLag = c(NA, GDPData$GDPn[1:(nrow(GDPData)-1)])

# Remove NA Row - alt: use na.omit, but we always know it will be the first row, and if there are other NAs we want the warning
GDPDataAR = GDPDataAR[2:nrow(GDPDataAR),]  

AR_model = lm(GDPDataAR$GDPn ~ GDPDataAR$GDPLag)

summary(AR_model)
## 
## Call:
## lm(formula = GDPDataAR$GDPn ~ GDPDataAR$GDPLag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -422.19  -17.46   -4.41   23.09  162.48 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      2.013e+01  4.945e+00    4.071 6.05e-05 ***
## GDPDataAR$GDPLag 1.009e+00  5.603e-04 1800.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.16 on 290 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 3.24e+06 on 1 and 290 DF,  p-value: < 2.2e-16
# Plot
plot(x = GDPDataAR$DateSeries, y = GDPDataAR$GDPn)
lines(x = GDPDataAR$DateSeries, y = predict(AR_model), col = "blue")

# RMSE
sqrt(sum((GDPDataAR$GDPn - predict(AR_model))^2))
## [1] 1024.544

For comparison, let’s try an AR(3) model to see how it compares. We can first notice that the significance has decreased on the 2nd and 3rd lags; they still have a p-value < .05, but are not the *** p-values of the prior models. The fit on the estimation is still very tight, and the RMSE has been reduced to 840.23.

# Add lags
GDPDataAR3 = GDPData
GDPDataAR3$GDPLag1 = c(rep(NA,1), GDPData$GDPn[1:(nrow(GDPData)-1)])
GDPDataAR3$GDPLag2 = c(rep(NA,2), GDPData$GDPn[1:(nrow(GDPData)-2)])
GDPDataAR3$GDPLag3 = c(rep(NA,3), GDPData$GDPn[1:(nrow(GDPData)-3)])

# Remove NA Row - alt: use na.omit, but we always know it will be the first row, and if there are other NAs we want the warning
GDPDataAR3 = GDPDataAR3[4:nrow(GDPDataAR),]  

AR3_model = lm(GDPDataAR3$GDPn ~ GDPDataAR3$GDPLag1 + GDPDataAR3$GDPLag2 +GDPDataAR3$GDPLag3)

summary(AR3_model)
## 
## Call:
## lm(formula = GDPDataAR3$GDPn ~ GDPDataAR3$GDPLag1 + GDPDataAR3$GDPLag2 + 
##     GDPDataAR3$GDPLag3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -382.49  -12.78   -4.47   18.00  203.12 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         9.30775    4.26586   2.182   0.0299 *  
## GDPDataAR3$GDPLag1  1.38072    0.05862  23.553   <2e-16 ***
## GDPDataAR3$GDPLag2 -0.22967    0.09975  -2.302   0.0220 *  
## GDPDataAR3$GDPLag3 -0.14675    0.05915  -2.481   0.0137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.77 on 285 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.537e+06 on 3 and 285 DF,  p-value: < 2.2e-16
# Plot
plot(x = GDPDataAR3$DateSeries, y = GDPDataAR3$GDPn)
lines(x = GDPDataAR3$DateSeries, y = predict(AR3_model), col = "blue")

# RMSE
sqrt(sum((GDPDataAR3$GDPn - predict(AR3_model))^2))
## [1] 840.2311

First Difference Models

First Difference models are a another why to add value to time series analysis. FD models are the same design as the standard model: \(y = \beta x + c + \epsilon\), but now the \(y\) is the difference between \(x_{t}\) and \(x_{t-1}\) or \(\Delta y\). This changes the idea from predicting the next period and predicting the change from this period to the next. The value add comes from showing truer correlation in dependent and independent variables by focusing on the correlation of change. For our 1st FD model we’ll look at GDP with a sequential series. The first plot will look at the estimated delta vs actual delta, and the second will look at the estimated GDP. The r\(^2\) for FD models is usually awful, but we can see that the RMSE is 997.63, a vast improvement over the standard linear model.

# We are going to keep the NA to use the first period for the estimation
GDPDataFD = GDPData
GDPDataFD$FD = GDPDataFD$GDPn - dplyr::lag(GDPDataFD$GDPn, n = 1)

# FD Model
FD_model = lm(GDPDataFD$FD[2:nrow(GDPDataFD)] ~ seq(1:(nrow(GDPDataFD)-1)))

# Model Summary
summary(FD_model)
## 
## Call:
## lm(formula = GDPDataFD$FD[2:nrow(GDPDataFD)] ~ seq(1:(nrow(GDPDataFD) - 
##     1)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -414.46  -16.74   -0.88   21.04  183.29 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -23.10887    6.87426  -3.362 0.000879 ***
## seq(1:(nrow(GDPDataFD) - 1))   0.65558    0.04067  16.119  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.58 on 290 degrees of freedom
## Multiple R-squared:  0.4726, Adjusted R-squared:  0.4707 
## F-statistic: 259.8 on 1 and 290 DF,  p-value: < 2.2e-16
# Plot
plot(x = GDPDataFD$DateSeries[2:nrow(GDPDataFD)], y = GDPDataFD$FD[2:nrow(GDPDataFD)])
lines(x = GDPDataFD$DateSeries[2:nrow(GDPDataFD)], y = predict(FD_model), col = "blue")

# Convert FD to Y
GDPDataFD$FDEst = c(NA,predict(FD_model))
GDPDataFD$GDPEst = dplyr::lag(GDPDataFD$GDPn, n = 1) + GDPDataFD$FDEst

# Plot
plot(x = GDPDataFD$DateSeries, y = GDPDataFD$GDPn)
lines(x = GDPDataFD$DateSeries, y = GDPDataFD$GDPEst, col = "blue")

# RMSE
sqrt(sum((GDPDataFD$GDPn[2:nrow(GDPDataFD)] - GDPDataFD$GDPEst[2:nrow(GDPDataFD)])^2))
## [1] 997.6304

AR & FD (ARI) Model

Lets combine both of these to see how they perform. This model over estimates early data to the high side, and creates are bigger gap against early data, with an RMSE of 2093.64

# We are going to keep the NA to use the first period for the estimation
GDPDataARI = GDPData
GDPDataARI$FD = GDPDataARI$GDPn - dplyr::lag(GDPDataARI$GDPn, n = 1)
GDPDataARI$FDLag = dplyr::lag(GDPDataARI$FD, n = 1)

# FD Model
ARI_model = lm(GDPDataARI$FD[3:nrow(GDPDataFD)] ~ GDPDataARI$FDLag[3:nrow(GDPDataFD)])

# Model Summary
summary(ARI_model)
## 
## Call:
## lm(formula = GDPDataARI$FD[3:nrow(GDPDataFD)] ~ GDPDataARI$FDLag[3:nrow(GDPDataFD)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -342.07  -22.19  -11.54   17.81  292.12 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         21.13622    4.66909   4.527 8.76e-06 ***
## GDPDataARI$FDLag[3:nrow(GDPDataFD)]  0.70479    0.04317  16.327  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.2 on 289 degrees of freedom
## Multiple R-squared:  0.4798, Adjusted R-squared:  0.478 
## F-statistic: 266.6 on 1 and 289 DF,  p-value: < 2.2e-16
# Plot
plot(x = GDPDataARI$DateSeries[3:nrow(GDPDataFD)], y = GDPDataARI$FD[3:nrow(GDPDataFD)])
lines(x = GDPDataARI$DateSeries[3:nrow(GDPDataFD)], y = predict(ARI_model), col = "blue")

# Convert ARI to FD then to Y
GDPDataARI$ARIEst = c(rep(NA,2),predict(ARI_model))
GDPDataARI$FDEst = dplyr::lag(GDPDataARI$FD, n = 1) + GDPDataARI$ARIEst
GDPDataARI$GDPEst = dplyr::lag(GDPDataARI$GDPn, n = 1) + GDPDataARI$FDEst

# Plot
plot(x = GDPDataARI$DateSeries, y = GDPDataARI$GDPn)
lines(x = GDPDataARI$DateSeries, y = GDPDataARI$GDPEst, col = "blue")

# RMSE
sqrt(sum((GDPDataARI$GDPn[3:nrow(GDPDataFD)] - GDPDataARI$GDPEst[3:nrow(GDPDataFD)])^2))
## [1] 2093.637

Lets take a look at all the models’ RMSE in the last 50 periods, and we see that that the AR(3) model is still the best performance, but the AR(1) and FD also do very well, but they are not as great together. Or at least not without a little more help.

## [1] "Linear Regression: 20974.050645019"
## [1] "Natural Log Linear Regression: 125023.477969126"
## [1] "Autoregressive AR(1): 860.544072070127"
## [1] "Autoregressive AR(3): 698.348798302764"
## [1] "First Difference: 852.829978038067"
## [1] "FD AR(1): 1587.18117220336"