Basic Understanding of ARIMA/SARIMA vs Auto ARIMA/SARIMA using Covid-19 Data Predictions by@sharmi1206

July 15th 2020 447 reads

A time-series represents the sequence of a variable recorded with time over regular intervals of time. There are various applications of time series like Stock Market Prediction (with daily and hourly data), Annual Sales/Revenue, Seasonal Temperatures changes, monthly Cloud Infrastructure cost and second level prediction of

Python has 2 libraries StatsModels and Pyramid that helps to build forecasting models and predict values at a future time. As both these libraries provide forecasts for similar models namely,

AR (Auto-Regressive)MA (Moving Average)ARIMA (Auto-Regressive Integrated Moving Average)SARIMA (Seasonal Auto-Regressive Integrated Moving Average)

it will be interesting to see the features provided by each of the libraries along with the pros and cons of using each of them.

In this blog, I try to summarise the functionalities of both of these libraries by demonstrating the Number of Active Cases for Covid-19 for any Indian state. We keep our scope limited to univariate time series analysis.

The scope of the data is fro 8th March-1st May where data from 8th March-14th April has been used for training and 15th March-1st May has been used for validation and prediction. Here, we have demonstrated using state Maharashtra, however all states data and modeling can be done using the processed data from https://github.com/sharmi1206/covid-19-analysis.

It primarily comprises of three main modules:

An ARIMA model is composed of 3 constituent units which are :

In order to design a model (say ARIMA) by applying all of the above stated procedures we need to enable all parameters. We can also enable partial parameters and select to choose either AR or MA model. The parameters can be summarized as follows:

To start with and in order to have different flavours of ARIMA model, the library imports are:

```
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
```

The following diagram illustrates the mechanism of using AR/MA/ARIMA/SARIMA by self choosing the model orders (p, q, d) and self tuning it.

This Auto-Regressive model estimates a dependent variable **Yt** at any timing instant based on some number of lagged observations. The variables can be summarized as follows and formulated by the below-given equation:

Number of Active cases for Covid-19 can be formulated using AR model as follows. Here we put p, i.e. number of lag observations as 2, while d and q are set to 0.

```
model_ar = ARIMA(trainActiveCases, order=(2, 0, 0))
model_ar_fit = model_ar.fit()
prediction_ar=model_ar_fit.forecast(len(validActiveCases))[0]
print(model_ar_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
```

The summary and the predicted output obtained from the model are illustrated below:

The MA model also estimates a dependent variable Yt based on the lagged forecast errors.

where the error terms are the errors introduced from respective lags. The errors Yt and Y(t-1) are the errors from the following equations :

Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put ma (moving average window), q as 2, while d and p are set to 0.

```
model_ma = ARIMA(trainActiveCases, order=(0, 0, 2))
model_ma_fit = model_ma.fit()
prediction_ma=model_ma_fit.forecast(len(validActiveCases))[0]
print(model_ma_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
model_scores.append(np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
print("Root Mean Square Error for MA Model: ",np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
```

The summary and the predicted output obtained from the model are illustrated below:

An ARIMA model comes along with both the flavours of AR and MA model where the time series is differenced at least once to make it stationary. The equation can be formulated as follows, where we see the term β coming from AR and ε–the error terms coming from MA model .

Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put q i.e the ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1.

The code snippet for this is :

```
model_arima = ARIMA(trainActiveCases, order=(1, 1, 1))
model_arima_fit = model_arima.fit()
prediction_arima=model_arima_fit.forecast(len(validActiveCases))[0]
valid_ml["ARIMA Model Prediction"]=list(np.exp(prediction_arima))
print(model_arima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe()
```

SARIMAX model is built by extending the ARIMA model, discussed in our previous section. In addition to terms

Number of Active cases for Covid-19 can be formulated using the **SARIMAX** model as follows. Here we put ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1. We also enforce seasonal order as (0,0,0,12) to study its behavior, even though Covid-19 does not show any seasonality, evident from research and daily reports.

The code snippet for SARIMAX (without exogenous variable) can be given us:

```
model_sarima = sm.tsa.statespace.SARIMAX(trainActiveCases, order=(1, 1, 1), seasonal_order=(0,0,0,0),
enforce_stationarity=False, enforce_invertibility=False)
model_sarima_fit = model_sarima.fit()
prediction_sarima=model_sarima_fit.forecast(np.shape(validActiveCases)[0])
valid_ml["SARIMA Model Prediction"]=list(np.exp(prediction_sarima))
print(model_sarima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
```

The model summary and the predicted outcome are illustrated from the below plots.

In order to plot the trained, validation and predicted data under proper dates, we use the following code snippet.

```
plt.figure(figsize=(10,5))
index= pd.date_range(start=train_dates[0], periods=len(train_dates), freq='D')
valid_index = pd.date_range(start=valid_dates[0], periods=len(valid_dates), freq='D')
train_active = pd.Series(train_ml['Active Cases'].values, index)
valid_active = pd.Series(valid_ml['Active Cases'].values, valid_index)
pred_active = pd.Series(prediction_ar, valid_index)
f, ax = plt.subplots(1,1 , figsize=(12,10))
plt.plot(train_active, marker='o',color='blue',label ="Train Data Set")
plt.plot(valid_active, marker='o',color='green',label ="Valid Data Set")
plt.plot(pred_active, marker='o',color='red',label ="Predicted ARIMA/SARIMAX")
plt.legend()
plt.xlabel("Date Time")
plt.ylabel('Active Cases')
plt.title("Active Cases ARIMA/SARIMAX Model Forecasting for state " + stateName)
```

Differencing helps to make time-series stationary. In order to find a right order of differencing, we use the ACF plot and see that it reaches zero, fast and with a fairly static mean. With Covid-19 data for state Maharashtra, the ACF plot and the code snippet are given below.

```
# Original Series
fig, axes = plt.subplots(3, 2, sharex=True, figsize=(10,5))
axes[0, 0].plot(df_per_State_features['Active Cases']); axes[0, 0].set_title('Original Series')
plot_acf(df_per_State_features['Active Cases'], ax=axes[0, 1])
# 1st Differencing
axes[1, 0].plot(df_per_State_features['Active Cases'].diff()); axes[1, 0].set_title('1st Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().dropna(), ax=axes[1, 1])
# 2nd Differencing
axes[2, 0].plot(df_per_State_features['Active Cases'].diff().diff()); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().diff().dropna(), ax=axes[2, 1])
plt.show()
```

This library automatically discovers the optimal order for an ARIMA model with stepwise execution of hyperparameters and parallel fitting of models.

This objective of this library (auto_arima) is to identify the most optimal parameters for an ARIMA/SARIMA and return a fitted ARIMA model. It does not depend on the **PACF/Auto-Correlation (manual computation of differencing)**, but instead, it conducts differencing tests (i.e., Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips–Perron) on its own to determine the order of differencing.

One other difference from Stats Model’s Vanilla ARIMA is that in case the model does not converge, it raises ValueError to signify the re-evaluation of stationarity-inducing measures.

For seasonality, SARIMA identifies the optimal P and Q hyperparameters after conducting the Canova-Hansen to determine the optimal order of seasonal differencing **D**, so that the auto-regressive/moving average portions of the seasonal model are defined by start_P, start_Q, max_P, max_Q

The model is fitted within ranges of defined start_p order (or number of time lags for AR), max_p, start_q (order of the moving average MA), max_q ranges.

Hence it can be formulated as **: (p, q, d)x(P, Q, D)**

The best model is obtained by optimizing **information_criterion** (**AIC** – Akaike Information Criterion, **AICC** – Corrected Akaike Information Criterion, **BIC** – Bayesian Information Criterion, **HQIC** – Hannan-Quinn Information Criterion, or “out of bag”–for validation scoring–respectively to a minimal value.

It also operates with **exogenous** variables (just like state space methods/models) for predicting added features in the regression operation.

The following code and figure depicts AR model with Auto ARIMA with **start_p=0, start_q=2 (by default), max_p=5, max_q=0**

```
model_ar= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=5,max_q=0,
suppress_warnings=True,stepwise=False,seasonal=False)
model_ar_fit = model_ar.fit(trainActiveCases)
prediction_ar=model_ar_fit.predict(len(validActiveCases))
print(model_ar_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid())
model_ar_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

Now let's look at the Moving Average model by changing** start_p (AutoRegressive Moving part - set to 0), start_q, max_p, max_q** (set to 5). Here's the code snippet for it.

```
model_ma= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=0,max_q=5,
suppress_warnings=True,stepwise=False,seasonal=False)
model_ma_fit = model_ma.fit(trainActiveCases)
prediction_ma=model_ma_fit.predict(len(validActiveCases))
print(model_ma_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid())
model_ma_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

The hyper-parameter tuning for evaluating the lowest and best AIC with the residual components are given below.

The Auto ARIMA is then used to model ARIMA by using the following parameters ------------------**(start_p=1,start_q=1,max_p=5,max_q=5)** as shown in the below code snippet.

```
model_arima= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=1,start_q=1,max_p=5,max_q=5,
suppress_warnings=True,stepwise=False,seasonal=False)
model_arima_fit = model_arima.fit(trainActiveCases)
prediction_arima=model_arima_fit.predict(len(validActiveCases))
print(model_arima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid())
model_arima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

The SARIMA model is tuned with **hyper-parameters** to find the best model with the lowest AIC. The code snippet is given below, which does parameter tuning for **(p, q, d)x(P, Q, D)**

Here**(P, Q, D)** and **(m) **are tuned with values (1,0,1), (1,0,0), (0,0,1) and 12 respectively.

Here

The parameters are initialized as , **start_p=0,start_q=2,max_p=5,max_q=5,m=12**. **start_q **even **set to 0**, takes **default value of 2.**

```
model_sarima= auto_arima(trainActiveCases,trace=True, error_action='ignore',
start_p=0,start_q=2,max_p=5,max_q=5,m=12,
suppress_warnings=True,stepwise=True,seasonal=True)
model_sarima_fit = model_sarima.fit(trainActiveCases)
prediction_sarima=model_sarima_fit.predict(len(validActiveCases))
print(model_sarima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid())
model_sarima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

We also try to explore seasonality with Covid-19 dataset by applying regular differencing, where the values are differenced from the previous season.

In the equation of SARIMA(p,d,q)x(P,D,Q):

P denotes Seasonal AR, D denote Seasonal Order of seasonal differencing and Q denotes Seasonal Moving Average, x is the frequency of the time series.

If a model has defined seasonal patterns, we set D=1 for a given frequency ‘x’. For a SARIMA model, we keep the total differencing + D’ <=2

The following figure illustrates how to select the best Auto ARIMA Model, compute its error metrics for any input time-series data.

- More work needs to be done by studying the impact of the number of people quarantined, hospital beds, icu units, ventilators, rate of tests done by using Vector Autoregressive Moving-Average with eXogenous regressors (VARMAX).
- Evaluate the model performance by
*sm.tsa.statespace.ExponentialSmoothing*(multiplicative models) and*sm.tsa.ExponentialSmoothing (linear model), i.e.*the two different modules of StatsModel. - A detailed study of using different smoothing techniques with latency and accuracy to analyze the tradeoff between the two.
- Evaluate confidence intervals for forecasts, based on an assumption of Gaussian errors as provided by
*sm.tsa.statespace.ExponentialSmoothing*Analyzing the performance impact by concentrating initial values out of the objective function, particularly in situations when the seasonal periodicity is large. - With more number of COVID-19 cases evolving, the right order of differencing (if needed multiple orders of differencing) needs to be computed to arrive at a near-stationary series (ACF plot nearing zero).
- Here we have only computed RMSE, but an ideal situation all error metrics Mean Absolute Percentage Error (MAPE), Mean Error (ME), Mean Absolute Error (MAE), Mean Percentage Error (MPE), Root Mean Squared Error (RMSE), Lag 1 Autocorrelation of Error (ACF1), Correlation between the Actual and the Forecast (core), Min-Max Error (min-max) needs to be computed.

The complete source code is available at **https://github.com/sharmi1206/covid-19-analysis**

- https://www.statsmodels.org/dev/user-guide.html#time-series-analysis
- http://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html
- https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/
- https://machinelearningmastery.com/how-to-grid-search-sarima-model-hyperparameters-for-time-series-forecasting-in-python/https://otexts.com/fpp2/arima-r.html