Must-Know Base Tips for Feature Engineering With Time Series Data

Introduction

Time series data, which consists of data points arranged in chronological order, is vital in diverse sectors such as finance, healthcare, and meteorology.

The art of feature engineering, where new features are derived from existing data, is a crucial aspect of developing precise and dependable predictive models.

In this article, we will discuss some essential base tips for feature engineering with time series data to help you harness the full potential of your data and improve the performance of your machine learning models.

We will delve into date and time features, domain-specific features, lag features, rolling and expanding window functions, exponential smoothing, and seasonal decomposition.

By mastering these techniques, you will be better equipped to uncover hidden patterns, trends, and relationships within your time series data and enhance your model's ability to make accurate predictions.

Tips

1. Date and Time Features and Domain-Specific Features

Time series data often comes with timestamps that provide valuable information about the data.

Extracting date and time features such as day, month, year, day of the week, and hour can help improve the model's performance by capturing patterns and trends related to these temporal components.

Seasonality, holidays, and weekends are examples of cyclical patterns that can be identified using date and time features.

Incorporating domain-specific features can significantly enhance the performance of a machine-learning model. Domain-specific features are derived from expert knowledge in the relevant field and can provide valuable information that is not present in the raw time series data.

For example, in finance, the volatility index or other economic indicators can be used as additional features to improve the prediction of stock prices.

Here is an example of my own function that was widely used across quant hedge fund practices:

def extract_date_features(data):
    data['Year'] = data.index.year
    data['Month'] = data.index.month
    data['Day'] = data.index.day
    data['Weekday'] = data.index.weekday
    data['Day_of_year'] = data.index.dayofyear
    data['Week_of_year'] = data.index.isocalendar().week
    data['Quarter'] = data.index.quarter
    data['Is_month_start'] = data.index.is_month_start
    data['Is_month_end'] = data.index.is_month_end
    data['Is_quarter_start'] = data.index.is_quarter_start
    data['Is_quarter_end'] = data.index.is_quarter_end
    data['Is_year_start'] = data.index.is_year_start
    data['Is_year_end'] = data.index.is_year_end
    data['Days_in_month'] = data.index.days_in_month
    data['Is_leap_year'] = data.index.is_leap_year
    data['Elapsed_days'] = (data.index - data.index.min()).days
    data['Weekday_name'] = data.index.day_name()
    data['Month_name'] = data.index.month_name()
    data['Is_weekend'] = data['Weekday'].apply(lambda x: x >= 5)
    data['Is_weekday'] = ~data['Is_weekend']
    data['Days_till_month_end'] = data['Days_in_month'] - data['Day']
    data['Days_since_month_start'] = data['Day'] - 1
    data['Week_of_month'] = (data['Day'] - 1) // 7 + 1
    data['Weekday_of_month'] = (data['Day'] - 1) % 7 + 1
    data['Days_to_next_holiday'] = data.index.to_series().apply(lambda x: (x + pd.DateOffset(days=1)).to_period('D').start_time)
    data['Days_since_last_holiday'] = data.index.to_series().apply(lambda x: (x - pd.DateOffset(days=1)).to_period('D').end_time)
    data['Business_days_in_month'] = data.index.to_series().apply(lambda x: np.busday_count(x.replace(day=1), x.replace(day=x.days_in_month) + pd.DateOffset(days=1)))
    data['Business_day_of_month'] = data.index.to_series().apply(lambda x: np.busday_count(x.replace(day=1), x))
    data['Days_since_first_day_of_year'] = data['Day_of_year'] - 1
    data['Days_remaining_in_year'] = (data.index + pd.offsets.YearEnd(0)).dayofyear - data['Day_of_year']
    
    return data

data = extract_date_features(data)
print(data.head())

2. Lag Features

Lag features refer to the values of previous time steps in the time series. They can help capture the autocorrelation present in the data, which is the relationship between the current value and its past values.

Adding lag features can improve a model's performance by allowing it to learn patterns from the past to predict future values. You can create lag features by shifting the original time series data by a specific number of periods, often called the "lag order."

import pandas as pd

def create_lag_features(data, n_lags):
    data_frame = pd.DataFrame(data, columns=['value'])
    for i in range(1, n_lags+1):
        data_frame[f'lag_{i}'] = data_frame['value'].shift(i)
    return data_frame

time_series = [100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200]
n_lags = 3
lag_features = create_lag_features(time_series, n_lags)

#Result:

   value  lag_1  lag_2  lag_3
0    100    NaN    NaN    NaN
1    110  100.0    NaN    NaN
2    120  110.0  100.0    NaN
3    130  120.0  110.0  100.0
4    140  130.0  120.0  110.0
5    150  140.0  130.0  120.0
6    160  150.0  140.0  130.0
7    170  160.0  150.0  140.0
8    180  170.0  160.0  150.0
9    190  180.0  170.0  160.0
10   200  190.0  180.0  170.0

3. Rolling and Expanding Windows Functions

Rolling window and expanding window statistics involve computing summary statistics such as mean, median, standard deviation, and maximum or minimum values over a moving or progressively larger window of fixed size.

These features help capture the local trends, fluctuations, and overall behavior in the time series data, allowing the model to learn from the temporal dynamics. Rolling window statistics can be particularly useful when dealing with noisy data or when the time series exhibits non-stationarity.

Expanding windows, on the other hand, can provide insights into the cumulative effects or long-term trends in the data. Both rolling and expanding window statistics can be easily implemented using the rolling and expanding functions available in the pandas library.

Code example:

# Apply a rolling window
window_size = 3
time_series['Rolling_Mean'] = time_series['Data'].rolling(window=window_size).mean()
time_series['Rolling_Std'] = time_series['Data'].rolling(window=window_size).std()

# Apply an expanding window
time_series['Expanding_Mean'] = time_series['Data'].expanding().mean()
time_series['Expanding_Std'] = time_series['Data'].expanding().std()

# Visualize the results
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(time_series['Date'], time_series['Data'], label='Original Data', marker='o')
ax.plot(time_series['Date'], time_series['Rolling_Mean'], label='Rolling Mean', marker='o')
ax.plot(time_series['Date'], time_series['Expanding_Mean'], label='Expanding Mean', marker='o')
ax.legend(loc='best')
ax.set_title('Rolling and Expanding Window Example')
plt.show()

4. Exponential Smoothing

Exponential smoothing is a technique that assigns exponentially decreasing weights to past observations, placing more emphasis on recent observations. This approach can be useful for reducing noise and capturing trends in the time series data.

There are various forms of exponential smoothing such as simple exponential smoothing, double exponential smoothing (Holt's method), and triple exponential smoothing (Holt-Winters method), which can be employed based on the characteristics of the time series data.

Simple Exponential smoothing is another method used to smooth time series data. It assigns exponentially decreasing weights to past observations, giving more importance to recent data points. The formula for exponential smoothing is:

Where ES(t) is the exponentially smoothed value at time t, X_t is the data point at time t, and α is the smoothing factor (0 < α < 1).

Simple Python example:

def exponential_smoothing(data, alpha):
    es = [data[0]]
    for t in range(1, len(data)):
        es.append(alpha * data[t] + (1 - alpha) * es[t - 1])
        return es

Holt-Winters is a forecasting method that extends the simple Exponential Smoothing method, accounting for seasonality and trend. The method consists of three components: level (smoothing of the series), trend (smoothing of the trend), and seasonality (seasonal component).

The method comes in two flavors: additive and multiplicative.

Holt-Winters code example:

from statsforecast.models import HoltWinters
from statsforecast.utils import AirPassengers as ap


model = HoltWinters(season_length=12, error_type='A')
model = model.fit(y=ap)
y_hat_dict = model.predict(h=4)
y_hat_dict

5. Seasonal Decomposition

Seasonal decomposition is a technique to separate a time series into its trend, seasonal, and residual components. The most common method for seasonal decomposition is using the STL (Seasonal and Trend decomposition using Loess) method.

Code example for seasonal decomposition using traditional statsmodels library:

import statsmodels.api as sm
import matplotlib.pyplot as plt

data = np.array([...])
# Input time series data
seasonal_period = 12
result = sm.tsa.seasonal_decompose(data, period=seasonal_period)
trend = result.trend
seasonal = result.seasonal
residual = result.resid
result.plot()
plt.show()

An additional code example for seasonal decomposition using statsforecast library:

from statsforecast import StatsForecast
from statsforecast.models import MSTL, AutoARIMA
import matplotlib.pyplot as plt

models = [
    MSTL(
        season_length=[12 * 7],  # seasonalities of the time series
        trend_forecaster=AutoARIMA(),  # model used to forecast trend
    )
]

sf = StatsForecast(
    models=models,  # model used to fit each time series
    freq="D",  # frequency of the data
)
sf = sf.fit(data)
test = sf.fitted_[0, 0].model_
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
test.plot(ax=ax, subplots=True, grid=True)
plt.tight_layout()
plt.show()

Conclusion

Time series data is indispensable for various industries, and feature engineering plays a pivotal role in developing accurate and reliable predictive models.

In this article, we have discussed essential techniques for feature engineering with time series data, such as extracting date and time features, incorporating domain-specific features, generating lag features, using rolling and expanding window functions, implementing exponential smoothing, and performing seasonal decomposition.

By mastering these approaches, you will be better equipped to harness the full potential of your time series data and enhance your model's ability to make accurate predictions.

Remember that every time series dataset is unique, and it's crucial to tailor your feature engineering strategies to your specific problem domain.

Experiment with different techniques, combinations, and parameters to uncover hidden patterns and relationships in your data and create models that are better suited to address the challenges of your particular domain.

By doing so, you'll be well on your way to unlocking the full potential of your time series data and creating more robust and accurate predictive models.