A time-series represents the sequence of a variable recorded with time over regular intervals of time. There are various applications of time series like Stock Market Prediction (with daily and hourly data), Annual Sales/Revenue, Seasonal Temperatures changes, monthly Cloud Infrastructure cost and second level prediction of
Python has 2 libraries StatsModels and Pyramid that helps to build forecasting models and predict values at a future time. As both these libraries provide forecasts for similar models namely,
AR (Auto-Regressive)MA (Moving Average)ARIMA (Auto-Regressive Integrated Moving Average)SARIMA (Seasonal Auto-Regressive Integrated Moving Average)
it will be interesting to see the features provided by each of the libraries along with the pros and cons of using each of them.
In this blog, I try to summarise the functionalities of both of these libraries by demonstrating the Number of Active Cases for Covid-19 for any Indian state. We keep our scope limited to univariate time series analysis.
The scope of the data is fro 8th March-1st May where data from 8th March-14th April has been used for training and 15th March-1st May has been used for validation and prediction. Here, we have demonstrated using state Maharashtra, however all states data and modeling can be done using the processed data from https://github.com/sharmi1206/covid-19-analysis.
It primarily comprises of three main modules:
tsa– Comprises of functions useful for univariate time series analysis – autoregressive models (AR), vector autoregressive models (VAR), and univariate autoregressive moving average models (ARMA), along with Non-linear models like dynamic regression and autoregression. In addition, it also provides autocorrelation, partial autocorrelation function, periodogram, and other statistical properties related to ARMA, and methods that process moving average lag-polynomials.
statespace – Equipped with functionality to model observations at any time instant with unobserved state vectors and irregular components. It has added flexibility to allow estimation with missing observations, forecasting, impulse response functions, etc. This flexibility helps it to differentiate from tsa by estimating additive and multiplicative seasonal effects on models (both on univariate and multi-variate), as well as arbitrary trend polynomials.
vector_ar-Allows simultaneous modeling and analyzing multiple time series.
An ARIMA model is composed of 3 constituent units which are :
AR: Autoregression. This part explores any dependent relationship between an observation and some number of lagged variables.
I: Integrated. This part aims to make time-series stationery by subtracting or differencing an observation from observation at the previous time step of the same time series.
MA: Moving Average. This part explores the relationship between an observation and a residual error by application of moving average to lagged observations, with any given time window.
In order to design a model (say ARIMA) by applying all of the above stated procedures we need to enable all parameters. We can also enable partial parameters and select to choose either AR or MA model. The parameters can be summarized as follows:
p: The number of lag observations included in the model, also called the lag order.
d: The number of times that the raw observations are differenced also called the degree of difference.
q: The size or the order of the moving average window.
To start with and in order to have different flavours of ARIMA model, the library imports are:
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
The following diagram illustrates the mechanism of using AR/MA/ARIMA/SARIMA by self choosing the model orders (p, q, d) and self tuning it.
AR Model
This Auto-Regressive model estimates a dependent variable Yt at any timing instant based on some number of lagged observations. The variables can be summarized as follows and formulated by the below-given equation:
Y(t-1): Lag 1 of the time series
β: Coefficient of lag1
α: Intercept term
Dependence of Lag in AR Estimation, where β and α are model estimates
Number of Active cases for Covid-19 can be formulated using AR model as follows. Here we put p, i.e. number of lag observations as 2, while d and q are set to 0.
model_ar = ARIMA(trainActiveCases, order=(2, 0, 0))
model_ar_fit = model_ar.fit()
prediction_ar=model_ar_fit.forecast(len(validActiveCases))[0]
print(model_ar_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
The summary and the predicted output obtained from the model are illustrated below:
AR Moel Summary
Predicted vs Actual AR
MA Model
The MA model also estimates a dependent variable Yt based on the lagged forecast errors.
MA Estimation with Lagged Errors
where the error terms are the errors introduced from respective lags. The errors Yt and Y(t-1) are the errors from the following equations :
Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put ma (moving average window), q as 2, while d and p are set to 0.
model_ma = ARIMA(trainActiveCases, order=(0, 0, 2))
model_ma_fit = model_ma.fit()
prediction_ma=model_ma_fit.forecast(len(validActiveCases))[0]
print(model_ma_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
model_scores.append(np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
print("Root Mean Square Error for MA Model: ",np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
The summary and the predicted output obtained from the model are illustrated below:
MA Model Summary
Predicted vs Actual MA
ARIMA Model
An ARIMA model comes along with both the flavours of AR and MA model where the time series is differenced at least once to make it stationary. The equation can be formulated as follows, where we see the term β coming from AR and ε–the error terms coming from MA model .
ARIMA – Differencing + Combination of Parameters from AR and MA model
Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put q i.e the ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1.
The code snippet for this is :
model_arima = ARIMA(trainActiveCases, order=(1, 1, 1))
model_arima_fit = model_arima.fit()
prediction_arima=model_arima_fit.forecast(len(validActiveCases))[0]
valid_ml["ARIMA Model Prediction"]=list(np.exp(prediction_arima))
print(model_arima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe()
Predicted vs Actual ARIMA
We can see it’s better to use ARIMA than AR or MA model, as this is the model that closely resembles the actual outcome.
SARIMAX model is built by extending the ARIMA model, discussed in our previous section. In addition to terms
AR, I and MA terms, there are four seasonal elements that are not part of ARIMA that must be configured; they are:
P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.
m: The number of time steps for a single seasonal period.
Number of Active cases for Covid-19 can be formulated using the SARIMAX model as follows. Here we put ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1. We also enforce seasonal order as (0,0,0,12) to study its behavior, even though Covid-19 does not show any seasonality, evident from research and daily reports.
The code snippet for SARIMAX (without exogenous variable) can be given us:
model_sarima = sm.tsa.statespace.SARIMAX(trainActiveCases, order=(1, 1, 1), seasonal_order=(0,0,0,0),
enforce_stationarity=False, enforce_invertibility=False)
model_sarima_fit = model_sarima.fit()
prediction_sarima=model_sarima_fit.forecast(np.shape(validActiveCases)[0])
valid_ml["SARIMA Model Prediction"]=list(np.exp(prediction_sarima))
print(model_sarima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
The model summary and the predicted outcome are illustrated from the below plots.
SARIMAX Model Summary
Predicted vs Actual SARIMA
In order to plot the trained, validation and predicted data under proper dates, we use the following code snippet.
plt.figure(figsize=(10,5))
index= pd.date_range(start=train_dates[0], periods=len(train_dates), freq='D')
valid_index = pd.date_range(start=valid_dates[0], periods=len(valid_dates), freq='D')
train_active = pd.Series(train_ml['Active Cases'].values, index)
valid_active = pd.Series(valid_ml['Active Cases'].values, valid_index)
pred_active = pd.Series(prediction_ar, valid_index)
f, ax = plt.subplots(1,1 , figsize=(12,10))
plt.plot(train_active, marker='o',color='blue',label ="Train Data Set")
plt.plot(valid_active, marker='o',color='green',label ="Valid Data Set")
plt.plot(pred_active, marker='o',color='red',label ="Predicted ARIMA/SARIMAX")
plt.legend()
plt.xlabel("Date Time")
plt.ylabel('Active Cases')
plt.title("Active Cases ARIMA/SARIMAX Model Forecasting for state " + stateName)
Auto-Correlation Plots
Differencing helps to make time-series stationary. In order to find a right order of differencing, we use the ACF plot and see that it reaches zero, fast and with a fairly static mean. With Covid-19 data for state Maharashtra, the ACF plot and the code snippet are given below.
# Original Series
fig, axes = plt.subplots(3, 2, sharex=True, figsize=(10,5))
axes[0, 0].plot(df_per_State_features['Active Cases']); axes[0, 0].set_title('Original Series')
plot_acf(df_per_State_features['Active Cases'], ax=axes[0, 1])
# 1st Differencing
axes[1, 0].plot(df_per_State_features['Active Cases'].diff()); axes[1, 0].set_title('1st Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().dropna(), ax=axes[1, 1])
# 2nd Differencing
axes[2, 0].plot(df_per_State_features['Active Cases'].diff().diff()); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().diff().dropna(), ax=axes[2, 1])
plt.show()
This library automatically discovers the optimal order for an ARIMA model with stepwise execution of hyperparameters and parallel fitting of models.
This objective of this library (auto_arima) is to identify the most optimal parameters for an ARIMA/SARIMA and return a fitted ARIMA model. It does not depend on the PACF/Auto-Correlation (manual computation of differencing), but instead, it conducts differencing tests (i.e., Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips–Perron) on its own to determine the order of differencing.
One other difference from Stats Model’s Vanilla ARIMA is that in case the model does not converge, it raises ValueError to signify the re-evaluation of stationarity-inducing measures.
For seasonality, SARIMA identifies the optimal P and Q hyperparameters after conducting the Canova-Hansen to determine the optimal order of seasonal differencing D, so that the auto-regressive/moving average portions of the seasonal model are defined by start_P, start_Q, max_P, max_Q
The model is fitted within ranges of defined start_p order (or number of time lags for AR), max_p, start_q (order of the moving average MA), max_q ranges.
Hence it can be formulated as : (p, q, d)x(P, Q, D)
The best model is obtained by optimizing information_criterion (AIC – Akaike Information Criterion, AICC – Corrected Akaike Information Criterion, BIC – Bayesian Information Criterion, HQIC – Hannan-Quinn Information Criterion, or “out of bag”–for validation scoring–respectively to a minimal value.
It also operates with exogenous variables (just like state space methods/models) for predicting added features in the regression operation.
Summary of AR with Auto-ARIMA
The following code and figure depicts AR model with Auto ARIMA with start_p=0, start_q=2 (by default), max_p=5, max_q=0
model_ar= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=5,max_q=0,
suppress_warnings=True,stepwise=False,seasonal=False)
model_ar_fit = model_ar.fit(trainActiveCases)
prediction_ar=model_ar_fit.predict(len(validActiveCases))
print(model_ar_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid())
model_ar_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
Predicted vs Actual AR Auto-ARIMA
Now let's look at the Moving Average model by changing start_p (AutoRegressive Moving part - set to 0), start_q, max_p, max_q (set to 5). Here's the code snippet for it.
model_ma= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=0,max_q=5,
suppress_warnings=True,stepwise=False,seasonal=False)
model_ma_fit = model_ma.fit(trainActiveCases)
prediction_ma=model_ma_fit.predict(len(validActiveCases))
print(model_ma_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid())
model_ma_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
The hyper-parameter tuning for evaluating the lowest and best AIC with the residual components are given below.
Summary of MA with Auto-ARIMA
Predicted vs Actual MA Auto-ARIMA
The Auto ARIMA is then used to model ARIMA by using the following parameters ------------------(start_p=1,start_q=1,max_p=5,max_q=5) as shown in the below code snippet.
model_arima= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=1,start_q=1,max_p=5,max_q=5,
suppress_warnings=True,stepwise=False,seasonal=False)
model_arima_fit = model_arima.fit(trainActiveCases)
prediction_arima=model_arima_fit.predict(len(validActiveCases))
print(model_arima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid())
model_arima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
Summary of ARIMA with Auto-ARIMA
Predicted vs Actual Auto-ARIMA
The SARIMA model is tuned with hyper-parameters to find the best model with the lowest AIC. The code snippet is given below, which does parameter tuning for (p, q, d)x(P, Q, D)
Here (P, Q, D) and (m) are tuned with values (1,0,1), (1,0,0), (0,0,1) and 12 respectively.
The parameters are initialized as , start_p=0,start_q=2,max_p=5,max_q=5,m=12. start_q even set to 0, takes default value of 2.
model_sarima= auto_arima(trainActiveCases,trace=True, error_action='ignore',
start_p=0,start_q=2,max_p=5,max_q=5,m=12,
suppress_warnings=True,stepwise=True,seasonal=True)
model_sarima_fit = model_sarima.fit(trainActiveCases)
prediction_sarima=model_sarima_fit.predict(len(validActiveCases))
print(model_sarima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid())
model_sarima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
Predicted vs Actual Auto-SARIMA
We also try to explore seasonality with Covid-19 dataset by applying regular differencing, where the values are differenced from the previous season.
In the equation of SARIMA(p,d,q)x(P,D,Q):
P denotes Seasonal AR, D denote Seasonal Order of seasonal differencing and Q denotes Seasonal Moving Average, x is the frequency of the time series.
If a model has defined seasonal patterns, we set D=1 for a given frequency ‘x’. For a SARIMA model, we keep the total differencing + D’ <=2
The following figure illustrates how to select the best Auto ARIMA Model, compute its error metrics for any input time-series data.
Conclusion and Future Work
The complete source code is available at https://github.com/sharmi1206/covid-19-analysis