During the final semester of my grad school analytics program, I was tasked with estimating price elasticity for wine. At first glance, this seemed like a simple objective. In economics, I was taught that price elasticity is the percent change in demand given the percent change in price (i.e. how price-sensitive consumers are). All you have to do is look at the quantity purchased at each price to determine price elasticity.
Why would you even need machine learning to solve this type of problem?
To understand how flawed this simplistic methodology is, let’s consider the following question:
Is price the only variable that influences whether you purchase a product or service?
If you answered “no”, then you could imagine the potential gap between economic theory and reality. In order to estimate price elasticity, you likely need to consider many variables, in addition to, the price of the product or service. Otherwise, your elasticity model will likely suffer from poor predictive performance.
Because I have never attempted to estimate elasticity outside of simple financial analysis, this problem required some research. This led to one of the most stimulating conversations I have had in a while as I described the problem to a colleague with a strong physics background. After defining price elasticity for him, we attempted to approach the problem with first principles using derivatives (i.e. measuring change) from calculus.
This discussion, in combination with a post from a “Quantitative Analysis” course at UW, pushed me towards using multi-linear regression with log transformations for my problem. The interpretation of log transformed variables (and their coefficients) reflects the percent change between variables, which was the desired outcome.
The algorithm necessary to estimate price elasticity for wine, as well as each desired dimension, became the following:
After establishing the high-level plan, I proceeded to prep the data. This is typically the most time-consuming part of any model development process. The goal was to identify and perform the necessary steps to preprocess and transform the data prior to any analysis. While the data was readily available, the database documentation left much to be desired. The relationships between tables were not obvious nor were the primary keys. Furthermore, the data definitions left more questions than answers. To illustrate some of the ambiguity, let’s imagine the following mock product table. What do you think Sold Price means in the table below?
Without clear documentation, I was left pondering the following questions:
Is sold price the price per unit, product, or case?
What is a standard unit across all beverages?
Does “sold price” include taxes?
Estimating elasticity with so many unknowns would not only be inaccurate but also unethical. To fill in this gap of knowledge, I needed to ask dozens of questions to our stakeholders. Some examples include:
Why are some prices zero?
When there are columns that appear equivalent, which one should be the most reliable?
Why does so much information appear to be missing in column “X”?
Why isn’t this “ID” column all unique values?
Which records can be filtered out prior to the elasticity estimation?
The process of acquiring this domain and database-specific knowledge is an often overlooked skill in data professionals. One misstep or miscalculated assumption during data prep can lead to the wrong conclusions downstream.
PS: This is why many data science teams are becoming more embedded within business functions rather than separate departments like IT.
During this prep stage, I was exploring the data as well. In order to build a model that is explainable, you need to understand the distribution of the data, and various patterns, and find features that could explain elasticity. Visualizations are a data professional’s best friend during this step.
Using histograms, I was able to understand consumer purchase behavior across different U.S. states, years, accounts (i.e. wineries), varietals (e.g. merlot), price bands (e.g. $10-15 per bottle), sales channels (e.g. online), etc. The slice of data that I was provided, it was an imbalanced data set across several dimensions. A large percentage of the consumer purchases were in California and most of the purchases were from only one account. Furthermore, the number of accounts was inconsistent across years since a few were added as the years advanced.
Seeing a time series of purchases over the years suggested that there might be some seasonality in consumer purchase behavior. This observation heavily influenced my feature engineering decisions in the next step of the machine learning development cycle.
Perhaps the most interesting observations were the increased percentage of purchases via online channels and seeing the dramatic fall and rise of sales during the COVID-affected years. These significant changes in purchasing behavior shaped some of our modeling decisions later on, such as excluding data prior to 2016.
The line between data prep and feature engineering can be a bit blurry. After pre-processing the data from the data tables, I worked on transforming the relevant records into “model readable” formats. For scikit-learn, a popular machine learning library in python, this requires transforming all features into numeric values. Features such as the location of purchase (e.g. California) had to be transformed into a “dummy column” with a binary value (i.e. 1 for yes, 0 for no). This dummy variable creation process was applied to all categorical columns of interest and was easily applied using the method, “get_dummies” from the pandas’ library. See the code example below.
import pandas as pd
dummy_variable_columns = ['Day_of_week', 'Month', 'Status']
sales_dataframe_new = pd.get_dummies(sales_dataframe_old, columns=dummy_variable_columns)
One of the most common, and often frustrating, feature engineering steps included the modification of the date dimensions. I extracted various time slices (e.g. month, week, day, year) from the order date to enable testing for seasonality in consumer behavior across different intervals of time. The time series of sales appeared so much like a sine graph that I had flashbacks to trigonometry class in high school*. This inspired me to research and create sine and cosine transformations of time to be tested as features in the model.
With sales appearing as a sine curve, I decided to test aggregations of the dataframe (i.e. feature vector and sales) across different cycle lengths of time. This somewhat complex task was easily accomplished thanks to another method in pandas, called “resample”. This method transformed the dataframe so that each column would be aggregated across any slice of time I desired (e.g. every 45 days). See the below example of code.
sales_dataframe_new_45daycycle = sales_dataframe_new.resample('45D', on='Order_Date', origin='start', label='right', closed='right').sum()
The result of the code snippet above is that I could now see the total sales for each record across 45-day cycles rather than orders on specific dates. Instead of 5 different orders of the same red wine, I would see the total sales value, total volume, total order count, total order count in each month, etc. for each 45-day cycle. Converting all columns to numeric format was a required prerequisite to this reaggregation.
Now that the dataframe is in this new structure, I calculated the average price per volume for each record and performed a log transformation on this value and on the total quantity sold. Because we need to estimate future demand, I had to create a column that was the future log transformed quantity sold. This was facilitated by the “shift” method in pandas. See the below code example.
sales_dataframe_new_45daycycle['log_quantity_sold_future'] = sales_dataframe_new_45daycycle['log_quantity_sold_current'].shift(periods=-1)
After running this code, each record in our dataframe had the current price and the future period’s sales. This setup is required to train the logistic model so current price, a known feature, is used to explain future quantity demanded — an unknown target since you cannot know future sales.
Any readers remember SOHCAHTOA, the fun acronym to help remember the different trig functions?
While communicating the model training stage to a non-technical audience, it is easy to present the illusion of a complex task. In reality, training the model is often one of the easier steps in the dev cycle thanks to the availability of mature machine learning libraries now available. The code below mirrors what I did to create a training and test set, standardize the data, train the logistic regression model, and fit the model to the scaled feature vector (i.e. log transformed price, time dimension, etc.) as well as the target vector (i.e. log transformed future demand).
import numpy as np
import sklearn
# Set up the train and test sets
train_x, test_x, train_y, test_y = train_test_split(sales_dataframe_new_45daycycle.loc[:, sales_dataframe_new_45daycycle.columns != 'log_quantity_sold_future'].to_numpy(),
sales_dataframe_new_45daycycle['log_quantity_sold_future'].to_numpy(),
test_size=0.20,
random_state=20)
#Standardize the feature vector and target vector
scaler = preprocessing.StandardScaler().fit(train_x)
train_x_scaled = scaler.transform(train_x)
test_x_scaled = scaler.transform(test_x)
# Build and fit the linear regression model
lr = LinearRegression()
lr.fit(train_x_scaled, train_y)
The code snippet above omits the feature selection/regularization process. The feature selection process was primarily driven by excluding features that were poorly correlated with future demand.
While R-squared and root mean squared error (RMSE) is the most common evaluation metrics taught in school, I have often found mean absoluter error (MAE) and mean absolute percentage error (MAPE) as more explainable evaluation metrics. The latter metrics provide a more intuitive measure of the distance between actual and predicted values.
For one of the multi-linear regression models, here were the model evaluation metrics:
Using the MAPE, I was able to communicate to stakeholders that our model’s predictions were off by an average of 1.4% compared to the actual sales values. Additionally, the feature vector explained 75% of the deviation in future demand. This model was successful in terms of explainability and was extremely accurate in terms of predictions.
Overall, I concluded that wine was a mostly inelastic good; however, consumer behavior varied depending on each subset of wine (i.e. different sales channels, varietals, etc.). Assuming that consumers are inelastic in all situations and will continue to be inelastic in the future would be a mistake.
Additionally, there were subsets of data that had too small of a sample size to have confidence in their respective elasticity estimates. To improve this analysis, I would add substitute product prices (e.g. beer) into the feature vector, as well as, research alternative ways to estimate elasticity. It would be interesting to repeat this process with a more elastic good (e.g. fast food), to see what the estimates reveal.
~ Also published here
Note: All text, data, and code were modified to obscure any sensitive information.
If you enjoyed reading my blog, **please vote for me **for the "HackerNoon Contributor of the Year" in their INVESTING category by searching for “Steven Finkelstein” at the following link: https://www.noonies.tech/2022/internet-heroes/2022-hackernoon-contributor-of-the-year-investing.
Image Sources:
Wine bottle photo by Rebecca Matthews on Unsplash