- Read
- Top Stories
Top Stories - All Top Stories
- Are We in the Third or the Fourth Industrial Revolution?
- Ten Future Technologies That Aren't in the Public Eye (Yet)
- 3 New Startups That Are Innovating DeFi Data Analysis Technology
- 5 Solidity Code Smells Every Dev Should Know
- What Emerging American Legislation Means for Crypto Assets and CBDCs

- Write
Writer Brand Writing Prompts - Full List of Prompts
- Decentralized Internet Contest
- Startups Interview
- Gaming Metaverse Contest
- All about you
- Do you own a good company?
- Get feral when you answer to the greatest interview in history
- Share your philosophy
- Which tech trends excite you?
- Do you travel?
- Describe your life hacks
- Describe your DIY project
- What do you wish existed?

- Learn
- STARTUPS
- Tech Giants
- About
Company People Software by HackerNoon - Help
- Sponsor

Basic Understanding of ARIMA/SARIMA vs Auto ARIMA/SARIMA using Covid-19 Data Predictions by@sharmi1206

July 15th 2020 4,918 reads

Python has 2 libraries StatsModels and Pyramid that helps to build forecasting models and predict values at a future time. In this blog, I try to summarise the functionalities of both of these libraries by demonstrating the Number of Active Cases for Covid-19 for any Indian state. We keep our scope limited to univariate time series analysis. Here, we have demonstrated using state Maharashtra data and modeling can be done using the processed data from https://github.com/sharmi1206/covid19-analysis.

A time-series represents the sequence of a variable recorded with time over regular intervals of time. There are various applications of time series like Stock Market Prediction (with daily and hourly data), Annual Sales/Revenue, Seasonal Temperatures changes, monthly Cloud Infrastructure cost and second level prediction of

Python has 2 libraries StatsModels and Pyramid that helps to build forecasting models and predict values at a future time. As both these libraries provide forecasts for similar models namely,

AR (Auto-Regressive)MA (Moving Average)ARIMA (Auto-Regressive Integrated Moving Average)SARIMA (Seasonal Auto-Regressive Integrated Moving Average)

it will be interesting to see the features provided by each of the libraries along with the pros and cons of using each of them.

In this blog, I try to summarise the functionalities of both of these libraries by demonstrating the Number of Active Cases for Covid-19 for any Indian state. We keep our scope limited to univariate time series analysis.

The scope of the data is fro 8th March-1st May where data from 8th March-14th April has been used for training and 15th March-1st May has been used for validation and prediction. Here, we have demonstrated using state Maharashtra, however all states data and modeling can be done using the processed data from https://github.com/sharmi1206/covid-19-analysis.

It primarily comprises of three main modules:

**tsa**– Comprises of functions useful for univariate time series analysis – autoregressive models (AR), vector autoregressive models (VAR), and univariate autoregressive moving average models (ARMA), along with Non-linear models like dynamic regression and autoregression. In addition, it also provides autocorrelation, partial autocorrelation function, periodogram, and other statistical properties related to ARMA, and methods that process moving average lag-polynomials.

**statespace** – Equipped with functionality to model observations at any time instant with unobserved state vectors and irregular components. It has added flexibility to allow estimation with missing observations, forecasting, impulse response functions, etc. This flexibility helps it to differentiate from tsa by estimating additive and multiplicative seasonal effects on models (both on univariate and multi-variate), as well as arbitrary trend polynomials.

**vector_ar**-Allows simultaneous modeling and analyzing multiple time series.

An ARIMA model is composed of 3 constituent units which are :

**AR**: * Autoregression*. This part explores any dependent relationship between an observation and some number of lagged variables.

**I**: * Integrated*. This part aims to make time-series stationery by subtracting or differencing an observation from observation at the previous time step of the same time series.

**MA**: *Moving Average***.** This part explores the relationship between an observation and a residual error by application of moving average to lagged observations, with any given time window.

In order to design a model (say ARIMA) by applying all of the above stated procedures we need to enable all parameters. We can also enable partial parameters and select to choose either AR or MA model. The parameters can be summarized as follows:

**p**: The number of lag observations included in the model, also called the lag order.

**d**: The number of times that the raw observations are differenced also called the degree of difference.

**q**: The size or the order of the moving average window.

To start with and in order to have different flavours of ARIMA model, the library imports are:

```
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
```

The following diagram illustrates the mechanism of using AR/MA/ARIMA/SARIMA by self choosing the model orders (p, q, d) and self tuning it.

**AR Model**

This Auto-Regressive model estimates a dependent variable **Yt** at any timing instant based on some number of lagged observations. The variables can be summarized as follows and formulated by the below-given equation:

**Y(t-1): Lag 1 of the time series**

**β: Coefficient of lag1**

**α: Intercept term**

*Dependence of Lag in AR Estimation, where β and α are model estimates*

Number of Active cases for Covid-19 can be formulated using AR model as follows. Here we put p, i.e. number of lag observations as 2, while d and q are set to 0.

```
model_ar = ARIMA(trainActiveCases, order=(2, 0, 0))
model_ar_fit = model_ar.fit()
prediction_ar=model_ar_fit.forecast(len(validActiveCases))[0]
print(model_ar_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
```

The summary and the predicted output obtained from the model are illustrated below:

** AR Moel Summary**

** Predicted vs Actual AR**

**MA Model**

The MA model also estimates a dependent variable Yt based on the lagged forecast errors.

** MA Estimation with Lagged Errors**

where the error terms are the errors introduced from respective lags. The errors Yt and Y(t-1) are the errors from the following equations :

Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put ma (moving average window), q as 2, while d and p are set to 0.

```
model_ma = ARIMA(trainActiveCases, order=(0, 0, 2))
model_ma_fit = model_ma.fit()
prediction_ma=model_ma_fit.forecast(len(validActiveCases))[0]
print(model_ma_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
model_scores.append(np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
print("Root Mean Square Error for MA Model: ",np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
```

The summary and the predicted output obtained from the model are illustrated below:

** MA Model Summary**

** Predicted vs Actual MA**

**ARIMA Model**

An ARIMA model comes along with both the flavours of AR and MA model where the time series is differenced at least once to make it stationary. The equation can be formulated as follows, where we see the term β coming from AR and ε–the error terms coming from MA model .

**ARIMA – Differencing + Combination of Parameters from AR and MA model**

Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put q i.e the ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1.

The code snippet for this is :

```
model_arima = ARIMA(trainActiveCases, order=(1, 1, 1))
model_arima_fit = model_arima.fit()
prediction_arima=model_arima_fit.forecast(len(validActiveCases))[0]
valid_ml["ARIMA Model Prediction"]=list(np.exp(prediction_arima))
print(model_arima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe()
```

** Predicted vs Actual ARIMA**

*We can see it’s better to use ARIMA than AR or MA model, as this is the model that closely resembles the actual outcome.*

SARIMAX model is built by extending the ARIMA model, discussed in our previous section. In addition to terms

**AR, I **and** MA** terms, there are four seasonal elements that are not part of ARIMA that must be configured; they are:

**P**: Seasonal autoregressive order.

**D**: Seasonal difference order.

**Q**: Seasonal moving average order.

**m**: The number of time steps for a single seasonal period.

Number of Active cases for Covid-19 can be formulated using the **SARIMAX** model as follows. Here we put ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1. We also enforce seasonal order as (0,0,0,12) to study its behavior, even though Covid-19 does not show any seasonality, evident from research and daily reports.

The code snippet for SARIMAX (without exogenous variable) can be given us:

```
model_sarima = sm.tsa.statespace.SARIMAX(trainActiveCases, order=(1, 1, 1), seasonal_order=(0,0,0,0),
enforce_stationarity=False, enforce_invertibility=False)
model_sarima_fit = model_sarima.fit()
prediction_sarima=model_sarima_fit.forecast(np.shape(validActiveCases)[0])
valid_ml["SARIMA Model Prediction"]=list(np.exp(prediction_sarima))
print(model_sarima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
```

The model summary and the predicted outcome are illustrated from the below plots.

** SARIMAX Model Summary**

** Predicted vs Actual SARIMA**

In order to plot the trained, validation and predicted data under proper dates, we use the following code snippet.

```
plt.figure(figsize=(10,5))
index= pd.date_range(start=train_dates[0], periods=len(train_dates), freq='D')
valid_index = pd.date_range(start=valid_dates[0], periods=len(valid_dates), freq='D')
train_active = pd.Series(train_ml['Active Cases'].values, index)
valid_active = pd.Series(valid_ml['Active Cases'].values, valid_index)
pred_active = pd.Series(prediction_ar, valid_index)
f, ax = plt.subplots(1,1 , figsize=(12,10))
plt.plot(train_active, marker='o',color='blue',label ="Train Data Set")
plt.plot(valid_active, marker='o',color='green',label ="Valid Data Set")
plt.plot(pred_active, marker='o',color='red',label ="Predicted ARIMA/SARIMAX")
plt.legend()
plt.xlabel("Date Time")
plt.ylabel('Active Cases')
plt.title("Active Cases ARIMA/SARIMAX Model Forecasting for state " + stateName)
```

**Auto-Correlation Plots**

Differencing helps to make time-series stationary. In order to find a right order of differencing, we use the ACF plot and see that it reaches zero, fast and with a fairly static mean. With Covid-19 data for state Maharashtra, the ACF plot and the code snippet are given below.

```
# Original Series
fig, axes = plt.subplots(3, 2, sharex=True, figsize=(10,5))
axes[0, 0].plot(df_per_State_features['Active Cases']); axes[0, 0].set_title('Original Series')
plot_acf(df_per_State_features['Active Cases'], ax=axes[0, 1])
# 1st Differencing
axes[1, 0].plot(df_per_State_features['Active Cases'].diff()); axes[1, 0].set_title('1st Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().dropna(), ax=axes[1, 1])
# 2nd Differencing
axes[2, 0].plot(df_per_State_features['Active Cases'].diff().diff()); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().diff().dropna(), ax=axes[2, 1])
plt.show()
```

This library automatically discovers the optimal order for an ARIMA model with stepwise execution of hyperparameters and parallel fitting of models.

This objective of this library (auto_arima) is to identify the most optimal parameters for an ARIMA/SARIMA and return a fitted ARIMA model. It does not depend on the **PACF/Auto-Correlation (manual computation of differencing)**, but instead, it conducts differencing tests (i.e., Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips–Perron) on its own to determine the order of differencing.

One other difference from Stats Model’s Vanilla ARIMA is that in case the model does not converge, it raises ValueError to signify the re-evaluation of stationarity-inducing measures.

For seasonality, SARIMA identifies the optimal P and Q hyperparameters after conducting the Canova-Hansen to determine the optimal order of seasonal differencing **D**, so that the auto-regressive/moving average portions of the seasonal model are defined by start_P, start_Q, max_P, max_Q

The model is fitted within ranges of defined start_p order (or number of time lags for AR), max_p, start_q (order of the moving average MA), max_q ranges.

Hence it can be formulated as **: (p, q, d)x(P, Q, D)**

The best model is obtained by optimizing **information_criterion** (**AIC** – Akaike Information Criterion, **AICC** – Corrected Akaike Information Criterion, **BIC** – Bayesian Information Criterion, **HQIC** – Hannan-Quinn Information Criterion, or “out of bag”–for validation scoring–respectively to a minimal value.

It also operates with **exogenous** variables (just like state space methods/models) for predicting added features in the regression operation.

** Summary of AR with Auto-ARIMA**

The following code and figure depicts AR model with Auto ARIMA with **start_p=0, start_q=2 (by default), max_p=5, max_q=0**

```
model_ar= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=5,max_q=0,
suppress_warnings=True,stepwise=False,seasonal=False)
model_ar_fit = model_ar.fit(trainActiveCases)
prediction_ar=model_ar_fit.predict(len(validActiveCases))
print(model_ar_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid())
model_ar_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

** Predicted vs Actual AR Auto-ARIMA**

Now let's look at the Moving Average model by changing** start_p (AutoRegressive Moving part - set to 0), start_q, max_p, max_q** (set to 5). Here's the code snippet for it.

```
model_ma= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=0,max_q=5,
suppress_warnings=True,stepwise=False,seasonal=False)
model_ma_fit = model_ma.fit(trainActiveCases)
prediction_ma=model_ma_fit.predict(len(validActiveCases))
print(model_ma_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid())
model_ma_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

The hyper-parameter tuning for evaluating the lowest and best AIC with the residual components are given below.

** Summary of MA with Auto-ARIMA**

**Predicted vs Actual MA Auto-ARIMA**

The Auto ARIMA is then used to model ARIMA by using the following parameters ------------------**(start_p=1,start_q=1,max_p=5,max_q=5)** as shown in the below code snippet.

```
model_arima= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=1,start_q=1,max_p=5,max_q=5,
suppress_warnings=True,stepwise=False,seasonal=False)
model_arima_fit = model_arima.fit(trainActiveCases)
prediction_arima=model_arima_fit.predict(len(validActiveCases))
print(model_arima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid())
model_arima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

** Summary of ARIMA with Auto-ARIMA**

** Predicted vs Actual Auto-ARIMA**

The SARIMA model is tuned with **hyper-parameters** to find the best model with the lowest AIC. The code snippet is given below, which does parameter tuning for **(p, q, d)x(P, Q, D)**

Here **(P, Q, D)** and **(m) **are tuned with values (1,0,1), (1,0,0), (0,0,1) and 12 respectively.

The parameters are initialized as , **start_p=0,start_q=2,max_p=5,max_q=5,m=12**. **start_q **even **set to 0**, takes **default value of 2.**

```
model_sarima= auto_arima(trainActiveCases,trace=True, error_action='ignore',
start_p=0,start_q=2,max_p=5,max_q=5,m=12,
suppress_warnings=True,stepwise=True,seasonal=True)
model_sarima_fit = model_sarima.fit(trainActiveCases)
prediction_sarima=model_sarima_fit.predict(len(validActiveCases))
print(model_sarima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid())
model_sarima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
```

** Predicted vs Actual Auto-SARIMA**

We also try to explore seasonality with Covid-19 dataset by applying regular differencing, where the values are differenced from the previous season.

In the equation of SARIMA(p,d,q)x(P,D,Q):

P denotes Seasonal AR, D denote Seasonal Order of seasonal differencing and Q denotes Seasonal Moving Average, x is the frequency of the time series.

If a model has defined seasonal patterns, we set D=1 for a given frequency ‘x’. For a SARIMA model, we keep the total differencing + D’ <=2

The following figure illustrates how to select the best Auto ARIMA Model, compute its error metrics for any input time-series data.

**Conclusion and Future Work**

- More work needs to be done by studying the impact of the number of people quarantined, hospital beds, icu units, ventilators, rate of tests done by using Vector Autoregressive Moving-Average with eXogenous regressors (VARMAX).
- Evaluate the model performance by
*sm.tsa.statespace.ExponentialSmoothing*(multiplicative models) and*sm.tsa.ExponentialSmoothing (linear model), i.e.*the two different modules of StatsModel. - A detailed study of using different smoothing techniques with latency and accuracy to analyze the tradeoff between the two.
- Evaluate confidence intervals for forecasts, based on an assumption of Gaussian errors as provided by
*sm.tsa.statespace.ExponentialSmoothing*Analyzing the performance impact by concentrating initial values out of the objective function, particularly in situations when the seasonal periodicity is large. - With more number of COVID-19 cases evolving, the right order of differencing (if needed multiple orders of differencing) needs to be computed to arrive at a near-stationary series (ACF plot nearing zero).
- Here we have only computed RMSE, but an ideal situation all error metrics Mean Absolute Percentage Error (MAPE), Mean Error (ME), Mean Absolute Error (MAE), Mean Percentage Error (MPE), Root Mean Squared Error (RMSE), Lag 1 Autocorrelation of Error (ACF1), Correlation between the Actual and the Forecast (core), Min-Max Error (min-max) needs to be computed.

The complete source code is available at **https://github.com/sharmi1206/covid-19-analysis**

- https://www.statsmodels.org/dev/user-guide.html#time-series-analysis
- http://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html
- https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/
- https://machinelearningmastery.com/how-to-grid-search-sarima-model-hyperparameters-for-time-series-forecasting-in-python/https://otexts.com/fpp2/arima-r.html

Join Hacker Noon

Create your free account to unlock your custom reading experience.