paint-brush
Beyond Recall and Precision: Rethinking Metrics in Fraud Preventionby@toshchakov
8,181 reads
8,181 reads

Beyond Recall and Precision: Rethinking Metrics in Fraud Prevention

by Aleksei ToshchakovNovember 11th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Understanding fraud prevention demands a balance between classic metrics and business outcomes. Opt for a comprehensive set of metrics beyond Recall and Precision to capture real-world impacts on your product.
featured image - Beyond Recall and Precision: Rethinking Metrics in Fraud Prevention
Aleksei Toshchakov HackerNoon profile picture

In the digital age, safeguarding business operations from fraud is more complex than meets the eye. It's not just about identifying deceit; it's about protecting brand integrity and customer trust. This demands a shift in focus from traditional metrics like Recall and Precision to a more nuanced approach.


This article explores the advanced metrics that help enterprises safeguard their operational integrity and brand reputation.

Binary Classification

We often solve the binary classification problem when we need to mark objects or events (users, clicks, reviews, ratings, transactions, etc.) as legit or fraudulent. But how do we measure its accuracy? If you ask a random data scientist the two most important metrics for a binary classification task, they will name Recall and Precision.


There is a classic matrix (with 1 for fraud and 0 for legitimate) and everyone's favorite metrics:

Precision measures the percentage of identified positives that are positive. Recall, on the other hand, measures the rate of actual positives that were correctly identified. The F1-Score measures test accuracy and is a weighted average of Precision and Recall.


Sure, these are the most convenient metrics to compare two models on the same dataset, but from a business point of view, I would single out other metrics that need monitoring as part of the anti-fraud task.

Class Imbalance

The problem is a class imbalance, which occurs when one data category significantly outweighs another in a dataset. For example, in a dataset with a million credit card transactions, if only 100 are fraudulent, there's a significant imbalance in the data.


It can lead models to overlook rare but crucial categories like fraud.


As a fraud prevention expert, I have dealt with class imbalances in other categories of fraud. Here are some examples:

  • Similar share. It could happen in e-commerce parsing when the number of requests from parsers can be commensurate with the requests of people, which gives us approximately the same balance.


  • Overwhelming fraud activity. These are DDoS attacks where there may be 0.00001% or less traffic from legitimate users during the attack. One of the most significant attacks I've encountered had 21 million requests per second from an attacker.


    Another example of this imbalance is CAPTCHA services. If you have a good algorithm for determining who to show a CAPTCHA to, then 99.9% of robots and very few people will get it.


  • Minor fraud instances. It is the already mentioned transactional anti-fraud or abuse of promotional codes if there are reasonable product restrictions.


It is important to note that fraud can be unstable. There may be situations when there will be little fraud for weeks, and then there will be an unexpected attack, and the balance of classes will shift.


Also, there could be system shifts in fraud. Let's say our fraud share is 10%; a year ago, it was 5%. If we previously considered 95% Precision a good value, now, we may need to raise the bar.

Precision and Recall Problem

Let's explore a real-world example from my experience dealing with DDoS attacks. Imagine an attack scenario where we recorded 1,000,000 requests from bots amidst the usual human traffic of 1,000 requests — indicating a situation where bot traffic was 1,000 times greater than legitimate traffic.


Now, imagine having two defense strategies to pick from:


The first boasts a Precision of 99.9%. By implementing this algorithm, we blocked all incoming traffic. As a result, our precision stands at 1,000,000 divided by 1,001,000, which is approximately 99.9%. On paper, this looks impressive. However, in practical terms, this algorithm is flawed as it blocks genuine human users as well.


In the second approach, we blocked 99.9% of malicious traffic without impacting genuine human traffic. However, this still means that 0.01% of malicious traffic, equating to 10,000 requests, reached our service. This volume is ten times our usual traffic (1,000 requests) and could seriously jeopardize the service's functionality.

Product Focus

From the business point of view, it is better to look at the True Negative Rate (what share of real users we affect) and the Negative Predictive Rate (what traffic share is good). The former is a good alternative for Precision, while the latter is a viable substitute for Recall.


To remind you, the True Negative Rate gauges the accuracy of identifying legitimate transactions as non-fraudulent. It's the percentage of correctly classified legitimate transactions out of all the legitimate transactions.


A high True Negative Rate indicates that your system can accurately discern genuine transactions from potential threats, contributing to a more efficient and reliable fraud prevention mechanism.


The Negative Predictive Rate measures the accuracy of identifying legitimate transactions. Essentially, it's the percentage of correctly classified legitimate transactions out of all predicted transactions.


A high Negative Predictive Rate indicates that when your system predicts a legitimate transaction, it will likely be correct, enhancing the efficacy of your fraud detection mechanism.

How Does It Work

For simplicity, we will take two algorithms.


The first one is random. With an equal balance of classes, our error matrix will look like this:



We will find half of the fraud and classify half of the legitimate events as fraud.


The second one is almost perfect. With an equal balance of classes, our error matrix will look like this:


Among all fraud, we will find 98% of objects. And we will also mark 98% of all legitimate events correctly.


Now, let's calculate our error matrix for these algorithms for class balances. This can be implemented as follows:


def random_algorithm(count_0, count_1):
    TN = count_0 * 0.5
    FP = count_0 * 0.5
    TP = count_1 * 0.5
    FN = count_1 * 0.5
    return TN, FP, TP, FN

def perfect_algorithm(count_0, count_1):
    TN = count_0 * 0.98
    FP = count_0 * 0.02
    TP = count_1 * 0.98
    FN = count_1 * 0.02
    return TN, FP, TP, FN


We will also calculate the main metrics that are interesting to us:

recall = TP / (TP + FN)
precision = TP / (TP + FP)
TNR = TN / (TN + FP)
NPV = TN / (TN + FN)



Essentially, all of these metrics can be read as “the more, the better.”


Now, let’s compare Recall and Negative Predictive Rate, which show how well we detect fraud, and Precision and True Negative Rate, which demonstrate our methods' accuracy.


We must take different proportions of classes and assume that our error costs are approximately equal. If they are not identical, we can add these as weights to the class balance, and our graphs at the beginning and end along the x-axis will be more stretched, but the general idea will remain the same.



For a random algorithm, Recall is constant and equal to 0.5, while NPV falls. NPV becomes worse as soon as the share of fraud events in the entire flow increases. This shows a lot of fraud in all the events remaining after cleaning.


As for the metrics about "good" traffic, TNR remains constant since we detect the same share of good traffic with our algorithms and the Precision approaches 1. In fact, we still impact half of the legitimate traffic.


Let's look at an algorithm that works much better than a random one.


From the point of view of fraud, Recall remains the same, but NPV drops since, despite the very good quality of the algorithm, even with a large amount of fraud on the entire stream, we will still miss a lot of fraud traffic. We banned 99.9% of the traffic from the DDoS attack, but the remaining 0.01% was so much that the service still suffered from it, and the 99.9% ban did not help.


From the legitimate traffic point of view, TNR remains unchanged, but precision increases. If it is essential for a business to understand what share of users “suffers” from our anti-fraud algorithm, then the TNR metric will more accurately show this problem.


Here's a practical example of an online service with reviews and ratings where we ask users to leave reviews that we use to form a rating of the object. On this service, we see attempts to inflate ratings using fake reviews.


Let's imagine we have a model with almost perfect quality, where we will catch 98% of all fraud and only 2% of reviews from people will not be considered to form a rating and list of reviews.


In the following graph, the good reviews are growing a little and have a weekly trend. The bad ones don't have a clear trend, usually staying low, but we have about 30 days with a big spike.



Let's calculate metrics for our algorithm.


Here, Recall is stable, and we consistently find 98% of all fraud. But NPV shows us that there are fewer actual orders. Previously, we formed a description based on 0.99 legitimate and 0.01 fake reviews, when 0.01 could not significantly affect the overall picture and user experience.


When the balance shifted to 0.94 and 0.06, we had a much greater risk that inflated reviews could mislead our users. Recall does not show this picture.

And here, the TNR is stable, and we continue to ignore 2% of real users. This 2% does not significantly reduce our review coverage and ability to form a rating. Precision increases during the growth of fraud traffic, although there has been no product improvement.



Recall and Precision are excellent metrics for fraud prevention. Still, since the class balance can change sharply or smoothly throughout the year, I recommend using more product metrics based on the error matrix.


So, we've discussed binary classification metrics, but there are other approaches you can use to measure your success. Let's look at them briefly.


Impact on the product. For example, you can allow some fake reviews, but if you rank them very low, no one will scroll to them. If you have 0.1% of such reviews on the entire flow, then only 0.0001% of users will see them. Or you can consider scores to form rankings with different weights. Estimates that you are less confident about will have less impact.


Cost of fraudulent services. Often, there is an open market for rating cheating services where fraudsters can buy fake reviews. You can take the cost of services in this market as a metric, so your task is to make it higher than the economic feasibility of cheating a rating in the system.

Conclusion

Navigating fraud prevention requires a nuanced understanding of classic metrics and those directly tied to business outcomes. In a perfect scenario, various metrics would be used, considering absolute values, diverse data sources, and fundamental quality indicators.


But often, we are forced to choose a minimal set of metrics, for example, for our commits and project goals.


I recommend not stopping only at Recall and Precision. It's crucial also to consider metrics that reflect the real-world consequences. If you want to choose the right metric to measure the impact on the product, look at other options.