Bridging the stats gap with acceptance criteria for data science
A tale of two recommender systems
Here are two familiar products:
- TripAdvisor, a website that combines travel-related content with accommodation booking. Here I will focus on booking a hotel room.
- Stitch Fix, a subscription service that sends members several clothing items. Members pay only only for what they keep.
Both of these products rely on recommender systems, trying to match each user with the items he or she is most likely to buy. If you are a machine learning engineer, your mind may be going towards stuff like collaborative filtering and matrix factorization.
But before rushing to build a fancy recommender system, let’s think about these two products again from the business perspective. It turns out that they are very different.
When you are looking for a hotel on TripAdvisor, you usually have a pretty good sense of what you are looking for (“a 4-star hotel in downtown Boston”). So you are choosing between fairly similar options. And if all goes well, you will book exactly one hotel.
With Stitch Fix, the company is choosing the items for you. And it wants you to keep as many items as possible. This means that they can’t be too similar. After all, how many new plaid flannel shirts do you need this month?
Also, Stitch Fix is sending you exactly five items. If the item that was ranked sixth was your favorite, tough luck. You will never see it and will never buy it.
Once you start thinking along these lines, the machine learning tasks start to look pretty different. You would train and evaluate models differently. You would track different metrics and weigh them differently.
The stats gap
But how would the data scientists on your team get to think about these considerations in the first place? This is the role of the product manager: to be the voice of the business, the users, and other stakeholders.
However, many product managers are not fully aware of the implications for machine learning, which can be very technical:
- Clothing items shouldn’t be too similar → you should model the joint probability of purchase, not just the individual probabilities for each item.
- Members never see the sixth item → precision (how many of the top 5 items a member buys) is more important than recall (how many of the items the member would potentially buy ended up in the top 5).
This communication challenge can create a gap between the product definitions and modeling metrics. It may even lead to the development of suboptimal or even incorrect models. Ultimately, it all boils down to talking about business goals in statistical terms. Let’s call it the stats gap.
How to deal with it? All data scientists are told that they should understand the business they are operating in. That is very true, so hiring data scientists who show curiosity about the real-world implications of their work is a very good idea.
But to bridge the stats gap, product managers also need to have more technical understanding. They don’t need to be stats whizzes or get 100% of what models are doing under the hood. But some concepts are important to understand: What is the relation between business metrics and statistical metrics? What are the inputs into a machine learning model? How do they interact with each other?
Acceptance criteria: the data science edition
Beyond hiring product managers that understand stats, what can your team do today to improve the situation?
The stats gap is about communication between product and data science. Communication gaps are also common in software engineering, and people have come up with various processes to reduce them. Specifically, agile software development teams use the concept of acceptance criteria.
Acceptance criteria are an agreement between product managers and software engineers on what exactly a software feature should include. Defining acceptance criteria is a joint effort that can take significant time, but it’s worth the investment. Good acceptance criteria ensure that both sides understand the task in the same way.
Acceptance criteria for data science tasks and machine learning models need to do the same thing. They must translate business requirements to clear modeling metrics.
This process is critical to ensure that your models are satisfying product needs and to achieve product/data fit.
Here is what it looks like:
- The product manager defines business metrics and business goals in clear and concise terms.
- The data scientist translates these metrics into rigorous statistical terms and defines the modeling metrics and modeling goals.
- To complete the cycle and confirm the requirements, the team translates the modeling metrics back to business metrics.
- This conversation has to happen for every modeling iteration, just like it does with any improvement to a software feature.
If you have some experience with agile development, you may notice that this process in incomplete. Our acceptance criteria don’t have a definition of done. This is a very important topic that I will address in a separate post. Right now our goal is to make sure that the why and the what are aligned.
From business metrics to modeling metrics
Let’s see this in action with another example. If you get a parking ticket in New York City you may submit an appeal online. The appeal is examined by a city employee, who decides if the ticket is valid.
Imagine that the city contracts your team to build a machine learning system to automate the processing of routine appeals using historical data. When a driver submits an appeal, the system can decide to accept it or make no decision and let the city agent handle it. For legal reasons it’s not possible to reject an appeal without confirmation from an agent.
The product manager is responsible for defining the business metrics. Metrics have to be quantifiable: dollars, clicks, conversion rates, etc. In our case:
- If the system approves an invalid appeal, the city loses revenue from the parking ticket. The metric is the number of invalid appeals approved. We can quantify this exactly in dollar terms: in NYC, each ticket is worth $65.
- If the system passes a valid appeal to the agent, the city incurs the cost of the agent’s additional wages. So the metric is the number of valid appeals examined by an agent. To quantify it in dollars, you would have to figure out the cost of handling an appeal, which would be a good thing to do.
Now that we have the business metrics, data scientists can translate them to statistical terms:
- The percentage of invalid appeals approved by the system is called the false positive rate.
- The percentage of valid appeals handled by an agent is called the false negative rate.
This is a very clear translation, because the model is a simple classifier: each appeal is either accepted or not. There are many standard metrics related to classifiers. As in this example, most of them have to do with what kind of wrong calls the model makes.
In more complex cases like a recommender system, defining business metrics and translating them to data science terms may not be so clean and simple. But then it’s even more important to nail down the right metrics, even it takes a significant amount of time.
Business goals drive modeling goals
So we have two statistical metrics: the false positive and false negative rates. But that is not enough. There is a trade-off between these two metrics:
- You can build a very conservative model that only approves a small number of appeals that it’s very confident about. The false positive rate would be low, but the false negative rate would be high.
- You can build a very permissive model that approves many appeals, even if it also approves some false appeals. The false negative rate would be low, but the false positive rate would be high.
Should you build a conservative model or a permissive model? That is not a data science question. It is a business question.
So the product manager has to define the business goals. Then, business goals are translated to modeling goals. Here are some examples:
Business goal #1: Maximize profit
Suppose the city wants to maximize profit from parking tickets. Then the model should balance the false positive and false negative rates according the their dollar value: the lost revenue for a false approval against the cost of handling an appeal.
If the hourly wage of an agent is low, the model should be more conservative and approve fewer cases. But if the hourly rate is high, the model should be more permissive and approve more cases.
The modeling metric to optimize will end up being a type of weighted average between false positive and false negative rates.
Business goal #2: Maximize profit, but without too many false approvals
But maybe if too many appeals are falsely approved, drivers will submit frivolous appeals or park illegally more often. Ideally, this concern would be addressed in a data-driven manner and quantified. If it’s valid, the modeling goals should reflect that.
The modeling metric to optimize is still a weighted average of the false positive and false negative rates. But in addition, there will be a cap on the false positive rate.
Business goal #3: Reduce turnaround time when caseload is high
What if the agents are salaried and unionized, so they can’t be laid off? The city is not going to save any money by reducing their workload. So it doesn’t make sense for the system to approve too many cases while the agents are just sitting around. But the city may still want to use the system to reduce the turnaround time when the number of appeals is very high.
In this situation, we know how many cases we want the agents to handle. The false negative rate is not a particularly important metric. Instead, the model should minimize false approvals with the volume of cases handled by agents as a given input.
These three examples have somewhat different modeling goals, so each case will result in a different optimal model. Also, these modeling goals are variations on standard metrics, but none of them is exactly a textbook use case. If a data scientist just built a model that optimizes one of the standard metrics such as AUC or F1 score, the city would be likely be leaving money on the table. In other words: business goals must drive modeling goals.
To derive business value from machine learning models, business goals have to be framed in statistical terms. That takes considerable expertise and often creates a communication gap, the stats gap.
To eliminate the stats gap, adapt the concept of acceptance criteria to data science tasks. The product manager defines business metrics and business goals. Then, data scientists translate them to modeling metrics and modeling goals. This is an iterative process that should be repeated for every modeling iteration.
The most important thing to keep in mind is that machine learning models are not built in vacuum. They must satisfy product needs. Business goals must drive modeling goals.
PS: How New York actually solves the parking ticket problem
In reality, if you appeal a parking ticket in New York City, in most cases you are automatically offered a discount of about 30% if you just pay the ticket. If you choose to contest it, you may end up paying more in court.
So the city found a pretty good business solution without using any machine learning. Instead, it is using a model so simple that it’s trivial to implement.
Importantly, This easy solution requires shifting the focus from deciding every individual appeal on the merits. Does this seem obvious? Many machine learning teams naturally concentrate on their modeling tasks. It can be hard to shift the attention to explore a different product approach. I will write more about this in a future post.