Time series data, which consists of data points arranged in chronological order, is vital in diverse sectors such as finance, healthcare, and meteorology.
The art of feature engineering, where new features are derived from existing data, is a crucial aspect of developing precise and dependable predictive models.
In this article, we will discuss some essential base tips for feature engineering with time series data to help you harness the full potential of your data and improve the performance of your machine learning models.
We will delve into date and time features, domain-specific features, lag features, rolling and expanding window functions, exponential smoothing, and seasonal decomposition.
By mastering these techniques, you will be better equipped to uncover hidden patterns, trends, and relationships within your time series data and enhance your model's ability to make accurate predictions.
Time series data often comes with timestamps that provide valuable information about the data.
Extracting date and time features such as day, month, year, day of the week, and hour can help improve the model's performance by capturing patterns and trends related to these temporal components.
Seasonality, holidays, and weekends are examples of cyclical patterns that can be identified using date and time features.
Incorporating domain-specific features can significantly enhance the performance of a machine-learning model. Domain-specific features are derived from expert knowledge in the relevant field and can provide valuable information that is not present in the raw time series data.
For example, in finance, the volatility index or other economic indicators can be used as additional features to improve the prediction of stock prices.
Here is an example of my own function that was widely used across quant hedge fund practices:
def extract_date_features(data):
data['Year'] = data.index.year
data['Month'] = data.index.month
data['Day'] = data.index.day
data['Weekday'] = data.index.weekday
data['Day_of_year'] = data.index.dayofyear
data['Week_of_year'] = data.index.isocalendar().week
data['Quarter'] = data.index.quarter
data['Is_month_start'] = data.index.is_month_start
data['Is_month_end'] = data.index.is_month_end
data['Is_quarter_start'] = data.index.is_quarter_start
data['Is_quarter_end'] = data.index.is_quarter_end
data['Is_year_start'] = data.index.is_year_start
data['Is_year_end'] = data.index.is_year_end
data['Days_in_month'] = data.index.days_in_month
data['Is_leap_year'] = data.index.is_leap_year
data['Elapsed_days'] = (data.index - data.index.min()).days
data['Weekday_name'] = data.index.day_name()
data['Month_name'] = data.index.month_name()
data['Is_weekend'] = data['Weekday'].apply(lambda x: x >= 5)
data['Is_weekday'] = ~data['Is_weekend']
data['Days_till_month_end'] = data['Days_in_month'] - data['Day']
data['Days_since_month_start'] = data['Day'] - 1
data['Week_of_month'] = (data['Day'] - 1) // 7 + 1
data['Weekday_of_month'] = (data['Day'] - 1) % 7 + 1
data['Days_to_next_holiday'] = data.index.to_series().apply(lambda x: (x + pd.DateOffset(days=1)).to_period('D').start_time)
data['Days_since_last_holiday'] = data.index.to_series().apply(lambda x: (x - pd.DateOffset(days=1)).to_period('D').end_time)
data['Business_days_in_month'] = data.index.to_series().apply(lambda x: np.busday_count(x.replace(day=1), x.replace(day=x.days_in_month) + pd.DateOffset(days=1)))
data['Business_day_of_month'] = data.index.to_series().apply(lambda x: np.busday_count(x.replace(day=1), x))
data['Days_since_first_day_of_year'] = data['Day_of_year'] - 1
data['Days_remaining_in_year'] = (data.index + pd.offsets.YearEnd(0)).dayofyear - data['Day_of_year']
return data
data = extract_date_features(data)
print(data.head())
Lag features refer to the values of previous time steps in the time series. They can help capture the autocorrelation present in the data, which is the relationship between the current value and its past values.
Adding lag features can improve a model's performance by allowing it to learn patterns from the past to predict future values. You can create lag features by shifting the original time series data by a specific number of periods, often called the "lag order."
import pandas as pd
def create_lag_features(data, n_lags):
data_frame = pd.DataFrame(data, columns=['value'])
for i in range(1, n_lags+1):
data_frame[f'lag_{i}'] = data_frame['value'].shift(i)
return data_frame
time_series = [100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200]
n_lags = 3
lag_features = create_lag_features(time_series, n_lags)
#Result:
value lag_1 lag_2 lag_3
0 100 NaN NaN NaN
1 110 100.0 NaN NaN
2 120 110.0 100.0 NaN
3 130 120.0 110.0 100.0
4 140 130.0 120.0 110.0
5 150 140.0 130.0 120.0
6 160 150.0 140.0 130.0
7 170 160.0 150.0 140.0
8 180 170.0 160.0 150.0
9 190 180.0 170.0 160.0
10 200 190.0 180.0 170.0
Rolling window and expanding window statistics involve computing summary statistics such as mean, median, standard deviation, and maximum or minimum values over a moving or progressively larger window of fixed size.
These features help capture the local trends, fluctuations, and overall behavior in the time series data, allowing the model to learn from the temporal dynamics. Rolling window statistics can be particularly useful when dealing with noisy data or when the time series exhibits non-stationarity.
Expanding windows, on the other hand, can provide insights into the cumulative effects or long-term trends in the data. Both rolling and expanding window statistics can be easily implemented using the rolling and expanding functions available in the pandas library.
Code example:
# Apply a rolling window
window_size = 3
time_series['Rolling_Mean'] = time_series['Data'].rolling(window=window_size).mean()
time_series['Rolling_Std'] = time_series['Data'].rolling(window=window_size).std()
# Apply an expanding window
time_series['Expanding_Mean'] = time_series['Data'].expanding().mean()
time_series['Expanding_Std'] = time_series['Data'].expanding().std()
# Visualize the results
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(time_series['Date'], time_series['Data'], label='Original Data', marker='o')
ax.plot(time_series['Date'], time_series['Rolling_Mean'], label='Rolling Mean', marker='o')
ax.plot(time_series['Date'], time_series['Expanding_Mean'], label='Expanding Mean', marker='o')
ax.legend(loc='best')
ax.set_title('Rolling and Expanding Window Example')
plt.show()
Exponential smoothing is a technique that assigns exponentially decreasing weights to past observations, placing more emphasis on recent observations. This approach can be useful for reducing noise and capturing trends in the time series data.
There are various forms of exponential smoothing such as simple exponential smoothing, double exponential smoothing (Holt's method), and triple exponential smoothing (Holt-Winters method), which can be employed based on the characteristics of the time series data.
Where ES(t) is the exponentially smoothed value at time t, X_t is the data point at time t, and α is the smoothing factor (0 < α < 1).
Simple Python example:
def exponential_smoothing(data, alpha):
es = [data[0]]
for t in range(1, len(data)):
es.append(alpha * data[t] + (1 - alpha) * es[t - 1])
return es
Holt-Winters is a forecasting method that extends the simple Exponential Smoothing method, accounting for seasonality and trend. The method consists of three components: level (smoothing of the series), trend (smoothing of the trend), and seasonality (seasonal component).
The method comes in two flavors: additive and multiplicative.
Holt-Winters code example:
from statsforecast.models import HoltWinters
from statsforecast.utils import AirPassengers as ap
model = HoltWinters(season_length=12, error_type='A')
model = model.fit(y=ap)
y_hat_dict = model.predict(h=4)
y_hat_dict
Seasonal decomposition is a technique to separate a time series into its trend, seasonal, and residual components. The most common method for seasonal decomposition is using the STL (Seasonal and Trend decomposition using Loess) method.
Code example for seasonal decomposition using traditional statsmodels
library:
import statsmodels.api as sm
import matplotlib.pyplot as plt
data = np.array([...])
# Input time series data
seasonal_period = 12
result = sm.tsa.seasonal_decompose(data, period=seasonal_period)
trend = result.trend
seasonal = result.seasonal
residual = result.resid
result.plot()
plt.show()
An additional code example for seasonal decomposition using statsforecast
library:
from statsforecast import StatsForecast
from statsforecast.models import MSTL, AutoARIMA
import matplotlib.pyplot as plt
models = [
MSTL(
season_length=[12 * 7], # seasonalities of the time series
trend_forecaster=AutoARIMA(), # model used to forecast trend
)
]
sf = StatsForecast(
models=models, # model used to fit each time series
freq="D", # frequency of the data
)
sf = sf.fit(data)
test = sf.fitted_[0, 0].model_
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
test.plot(ax=ax, subplots=True, grid=True)
plt.tight_layout()
plt.show()
Time series data is indispensable for various industries, and feature engineering plays a pivotal role in developing accurate and reliable predictive models.
In this article, we have discussed essential techniques for feature engineering with time series data, such as extracting date and time features, incorporating domain-specific features, generating lag features, using rolling and expanding window functions, implementing exponential smoothing, and performing seasonal decomposition.
By mastering these approaches, you will be better equipped to harness the full potential of your time series data and enhance your model's ability to make accurate predictions.
Remember that every time series dataset is unique, and it's crucial to tailor your feature engineering strategies to your specific problem domain.
Experiment with different techniques, combinations, and parameters to uncover hidden patterns and relationships in your data and create models that are better suited to address the challenges of your particular domain.
By doing so, you'll be well on your way to unlocking the full potential of your time series data and creating more robust and accurate predictive models.