How Three ML Models Transform Product Analytics

In product research, a recurring challenge is not merely to describe user behavior, but to influence it — to understand who is likely to churn, who is ready for an upsell, and who should be offered a discount or a new feature. Machine learning models enable us to formalize behavioral patterns in data and to forecast how a specific user will act in the future — or how their behavior might change in response to our interventions.

In this article, I outline three types of models commonly used to address these challenges. The first helps predict a user’s target action and identify those most susceptible to churn, upsell, or activation. The second measures the causal impact of our actions, allowing us to distinguish true intervention effects from organic user behavior. The third selects the optimal intervention for each individual, transforming analytics into a tool for personalized behavioral management.

As you read on, both the reasoning and the models will grow in sophistication — evolving from direct predictions and probabilistic assessments to causal analysis and, ultimately, to selecting the optimal intervention. This evolution reflects the natural progression of machine learning in product analytics — from forecasting events to managing behavior and making data-driven decisions.

Let’s begin.

Model 1: Predicting the Target Action

In product analytics, one of the most in-demand tasks is predicting what action a specific user will take in the future. Will they make a purchase? Renew their subscription? Return to the service — or churn?

To answer these questions, machine learning models are used to calculate the probability of a target event occurring. The target event can be almost anything — from a purchase to product abandonment. Their value lies in enabling audience segmentation: identifying who is at risk of churn, who is ready for an upsell, and who can be effectively reactivated.

Often, the introduction of machine learning in product analytics begins with such models. They are relatively simple to implement, easy for the business to understand, and deliver quick, measurable impact without requiring complex architectures or lengthy experimentation.

Examples:

There are many examples of such models — here are some of the most common ones:

• Churn prediction. The model estimates the probability that a user will stop using the product in the near future. This helps launch retention campaigns targeting those predicted to churn.

• Upsell and cross-sell. The model identifies users who are ready to upgrade to a more expensive plan or purchase an additional product. These scenarios directly increase LTV and ARPU, enabling growth not only through new users but also through the existing customer base.

• Activation probability. A classic task for new users: predicting whether a person will reach their first target action (e.g., place an order or make a purchase). This allows for targeted support of users who “get stuck” on their path to activation.

From an ML perspective, all these tasks differ only in the target event being predicted and the set of features used; the underlying logic of model construction remains the same. Therefore, it’s convenient to consider them together as a single group.

Data

The main advantage of such tasks is that no additional experiments are required — historical user activity logs are enough. We take a sample, label whether the user performed the target action, and build features that could have influenced this outcome.

Obviously, the specific features depend heavily on the task, but overall, feature engineering usually relies on a few fundamental approaches:

1. User characteristics. Everything that describes the user both in the real world and within the product: age, gender, region, device, as well as subscription status, plan type, account status, etc.

2. Aggregates over time windows. We calculate user activity for various time periods (e.g., 7, 30, 90 days): number of events, total amounts, averages. This helps capture both short-term engagement and long-term behavioral patterns.

3. Interval features. Time since the last or first action, average gap between events, and growth/decline rates when comparing different windows.

4. Calendar factors. Day of the week, month, and seasonality. Many behaviors are cyclical, and such features help account for recurring patterns.

5. Derived features. In addition to simple aggregates, it’s often useful to calculate more “intelligent” metrics: ratios between windows, share of active days within a period, normalization of metrics per user, ratio of total purchases to their average, and so on.

Model

Once the data is prepared, the task becomes a classic binary classification problem with probabilistic output: given a set of features, the model must estimate the likelihood of the target event occurring.

There are many models that can handle this type of problem. In practice, gradient boosting (XGBoost, LightGBM, CatBoost) tends to be the optimal choice. Other approaches are used less frequently: logistic regression (when interpretability is a priority), random forest (slower to train but effective), and neural networks (rarely, and typically in more complex setups).

To train and evaluate the model properly, the dataset is usually split into several parts: train — for fitting the model; validation — for hyperparameter tuning and overfitting control; test — for final performance evaluation.

In such tasks, the split is often done chronologically: the model should learn from the past and be tested on the future. This approach may not be crucial for the train/validation split, but it becomes essential when defining the test set.

Deployment and Maintenance

One of the key advantages of these models is that their effectiveness can often be evaluated without running dedicated A/B tests. It’s usually enough to monitor whether the model remains representative — that is, whether it continues to correctly identify users with the desired behavioral patterns and solve the original task.

However, two important factors must be considered when using such models:

• First, these models are typically used not for their predictions alone, but for forming user segments that the business then targets — through promotions, recommendations, or new features. Such interventions inevitably change user behavior, meaning the model can no longer observe their natural (organic) patterns.

• Second, even without external interventions, a model’s quality tends to degrade over time: users with the same characteristics begin to behave differently under the influence of hidden factors. This phenomenon is known as drift — a shift in data distribution or concept.

To control these effects, it’s common to include a control group of users when deploying the model to production — users who do not participate in promotional activities. This group helps: measure the uplift caused by interventions, evaluate the baseline prediction quality, and update the model in the future.

There are two main approaches to maintaining such models:

• Regular retraining on fresh data — reliable but resource-intensive;

• Prediction calibration, where the model structure remains the same but predicted probabilities are adjusted based on new data — faster and cheaper.

Model 2: Uplift Models

In product analytics, the goal is often not just to identify who will perform a target action, but who will do so as a result of our intervention — whether it’s an email, push notification, discount, or new feature. After all, if a user would have made the purchase anyway, the discount is wasted. But if they wouldn’t have purchased without it, then we’ve truly influenced their behavior.

To address this, uplift models are used. These models don’t predict the probability of an event itself, but rather the difference in that probability with and without the intervention. In essence, the model estimates how “susceptible” a given user is to the treatment, helping identify those who are most likely to respond positively.

In other words, uplift models help create relevant user segments, filtering out the “organic” audience — those who would have acted the same way regardless of any intervention. This approach is extremely valuable, as it helps optimize resources (budget) and avoid bothering users who don’t need to be targeted.

Examples

Uplift models are especially valuable when a business needs to understand the true incremental impact of its interventions. Here are a few common use cases:

• User reactivation. The model identifies churned users who are genuinely likely to change their behavior in response to an offer — helping avoid spending the budget on those who would have returned anyway or wouldn’t respond regardless.

• Promotional campaigns in e-commerce. Discounts and cashback offers are often distributed broadly, but uplift models help distinguish “sensitive” users from “organic” buyers. As a result, only the segment where the offer is economically justified receives it.

Data

To build an uplift model, data is collected through an A/B experiment with randomization: users are randomly split into treatment and control groups, and the intervention is assigned strictly at random. Whenever possible, it’s useful to combine data from several experiments with the same mechanics — this increases the diversity of the training dataset and improves model robustness.

Typically, such experiments are run on a relevant audience, selected either by business criteria (customer type, activity level, region) or based on the results of a previous predictive model (e.g., users with a high probability of churn).

It’s crucial that the experiment reflects the same conditions under which the uplift model will later be applied — the same type of intervention, duration, and metric logic.

After the experiment concludes, for each user the following are recorded: whether they performed the target action, whether they belonged to the treatment or control group, and all relevant features describing their behavior prior to the intervention.

The uplift model is then trained on this dataset.

Model

An uplift model aims to estimate the difference in the probability of performing the target action with and without the intervention:

uplift(x) ≈ P(Y=1 | X=x, T=1) − P(Y=1 | X=x, T=0),

where T represents the presence of the intervention, Y is the target action, and X denotes the user’s characteristics (for a binary outcome).

There are several approaches to building such models.

T-learner

Perhaps the most popular approach — it builds two independent models:

• a₁(x) = P(Y=1 | X=x, T=1) — predicts the probability of the target action for users who received the intervention.

• a₂(x) = P(Y=1 | X=x, T=0) — predicts the probability of the target action for users who did not receive the intervention.

The uplift is then estimated as: uplift(x) = a₁(x) – a₂(x)

This method is simple, flexible, and performs well in many cases. Its main drawback is that each model is trained independently and therefore “doesn’t see” the data from the other group.

S-learner

This approach builds a single model to predict the target action, where the presence of the intervention (T) is included as one of the features: a(x) = P(Y=1 | X=x, T=t)

The uplift is estimated as: uplift(x) = a(x, T=1) – a(x, T=0)

However, this method may produce strange artifacts if the uplift effect from the treatment (T) is weak or highly correlated with other features.

X-learner

An enhancement of the T- and S-learner methods — a hybrid approach that works in several stages:

1. Train T-learner models:

• a₁(x) — predicts the probability of the target action for users who received the intervention.

• a₂(x) — predicts the probability of the target action for users who did not receive the intervention.

2. Compute pseudo-effects

For each user, we calculate the difference between their actual response and the prediction from the model trained on the opposite group:

• For users who received the treatment (T = 1): D₁ = Y − a₂(x)

• For users who did not receive the treatment (T = 0): D₂ = a₁(x) − Y

Here, D₁ and D₂ represent local estimates of the gain or loss in the probability of the target action caused by the intervention: a positive value means the treatment increased the likelihood of the target event, a negative value means it decreased the likelihood, a zero value means the treatment had no effect.

3. Train pseudo-effect models

We then train two regression models on the computed pseudo-effects:

• m₁(x) — trained on data where T = 1 with target D₁,

• m₂(x) — trained on data where T = 0 with target D₂.

In essence, these models learn under which conditions the intervention helps and under which it doesn’t.

4. Combine effects and estimate uplift

The final uplift estimate is obtained as a weighted combination of the two models:

uplift(x) = w(x)·m₁(x) + (1 − w(x))·m₂(x)

where w(x) is a weight depending on the probability of belonging to the treatment or control group.

If the treatment and control groups were randomly assigned, then w(x) is simply a constant (for equal-sized groups, w(x) = 0.5).

However, if the assignment was non-random, it’s better to train a separate model to predict the probability of being in the treatment group (T) — with 1 for treatment and 0 for control.

Deployment and Maintenance

To evaluate the quality of an uplift model, a new experiment is required. Users are divided into several groups:

• Random sample without model involvement — serves as a baseline control. In this group, some users receive the intervention while others do not.

• Group selected based on the model’s predictions (for example, the top 20% of users by predicted uplift among the remaining population). Here too, some users receive the intervention and others don’t — this allows for measuring the actual uplift predicted by the model.

This setup enables the calculation of uplift metrics and helps verify whether the model truly identifies “susceptible” users.

Maintaining uplift models is more challenging than standard predictive models.

First, experiments must be repeated regularly to ensure the uplift effect persists and that the model continues to correctly identify responsive users.

Second, any change in the mechanics of the intervention effectively creates a new causal relationship — meaning the model needs to be retrained from scratch.

In addition, uplift models are prone to drift effects: over time, users start responding differently to the same stimuli. Therefore, in production, it’s essential to keep a random control sample that is not influenced by the model. This control group helps track changes in treatment effects and provides a reference for retraining and recalibrating the model.

Model 3: Optimal Treatment Models

In the first two model types, we focused on predicting what the user would do or how they would respond to an intervention. Now, the task is different — to determine which intervention to choose.

This applies to situations where multiple offers or communication scenarios are available, and the goal is to show each user the best possible option (such as a promo code, discount, push notification, banner, or new feature).

Naturally, different users respond differently — there is no single “best” offer for everyone.

Such models are designed to select the optimal offer for each individual user based on their likely response or expected value.

In essence, this is an extension of the uplift modeling approach into a multi-treatment setting: instead of simply deciding “to show or not to show,” the goal is to choose the best action among several alternatives.

This setup is known as multi-treatment uplift modeling — the next step in the evolution of machine learning for personalized communications and product analytics.

Examples

Optimal treatment models are especially useful when a business has multiple communication or promotional options and needs to decide who should receive which one to maximize a target metric — such as conversion, retention, or LTV.

• Promotions and discount campaigns.

In e-commerce, dozens of promotions may run simultaneously: discounts, cashback offers, free delivery, or category-based promo codes. The model helps determine which offer will yield the greatest effect for each user, rather than sending everything to everyone. This improves the economic efficiency of promotions and reduces communication noise.

• Engagement channels.

Users respond differently to push notifications, emails, banners, or SMS messages.

A multi-uplift model selects the channel with the highest expected response probability, effectively turning communication management into an optimization problem.

Data

As with uplift models, training begins with an experiment, but this time it involves multiple treatment groups — one for each intervention — plus a control group.

Users are randomly assigned to these groups so the model can accurately estimate the differences in effects between treatments.

Moreover, running several independent experiments is preferable to a single large one: it increases sample diversity, helps the model distinguish between different contexts, and reduces the risk of overfitting to the specifics of a single campaign.

For each user, the following are recorded:

• group assignment (control or a specific offer),

• whether the target action occurred (e.g., purchase, activation, renewal),

• and a set of features describing the user’s behavior before the intervention.

Conceptually, the logic remains similar to that of uplift models — we aim to capture the causal effect of an intervention rather than mere correlations between features and outcomes.

However, with multiple offers, the goal shifts from simply estimating a “treatment vs. no treatment” effect to evaluating the differences between alternative interventions.

Model

In a classic uplift model, we compare two groups — those who received the treatment and those who did not.

Here, the situation is more complex: there are multiple treatments, and the model must predict the probability of the target action under each possible intervention, then choose the best one:

a(x) = argmax(P(Y=1 | X=x, T=t)) for t ∈ T

where X represents user features, T is the set of all possible treatments (offers), including the “no treatment” option, and Y is the target action.

In essence, for each user, the model compares the expected outcome across different interventions and selects the one likely to produce the greatest benefit.

It’s important to note that this formula makes a strong assumption: maximizing the probability of the target action = choosing the best treatment. We’ll revisit this point later. However, the core idea remains the same — the model selects the intervention that yields the best expected result for a given user.

There are several methods to solve this type of problem — here are the main ones.

One-vs-Rest

The most straightforward approach (analogous to the T-learner) is to train a separate uplift model for each treatment relative to the control group.

For example, if we have three offers A, B, and C, we train:

• uplift_A(x) = effect of treatment A compared to control

• uplift_B(x) = effect of treatment B compared to control

• uplift_C(x) = effect of treatment C compared to control

At inference time, we simply select the offer with the highest uplift.

This approach is easy to implement, interpretable, and performs well when the number of treatments is moderate. However, with many offers, it becomes computationally expensive and requires large amounts of data for proper training.

Joint Multi-Treatment Model

In this approach, the treatment T is included as a categorical feature in the model (similar to the S-learner), and the model is trained on all data simultaneously, with grouping by treatment.

The uplift for each treatment is then estimated as:

uplift(x, t) = P(Y=1 | X=x, T=t) − P(Y=1 | X=x, T=0)

where T=0 denotes the control (no treatment).

This method is efficient for a large number of treatments and scales well, but requires careful handling of class imbalance between treatment groups.

Generalizations and Extended Approaches

The definition of “optimal” treatment can vary significantly depending on the business objective.

For example, imagine a company offering two discounts — 5% and 50%.

From a purely probabilistic perspective, the larger discount almost always increases purchase likelihood.

However, from an economic standpoint, that may not be optimal — the incremental conversion gain may not offset the loss in margin from the high discount.

Therefore, the optimization objective can be defined in different ways.

In practice, one can optimize any business metric R that serves as the key decision criterion for choosing the treatment:

a(x) = argmax(R(X=x, T=t)) for t ∈ T

This logic integrates well into the classic One-vs-Rest uplift and Joint multi-treatment modeling frameworks.

In some cases, the model directly predicts business metrics such as profit, retention, or LTV; in others, additional models are built to estimate missing components — for example, purchase probability or expected costs.

Over time, such solutions can evolve into comprehensive decision-making systems, where multiple models describe different aspects of user behavior and jointly form a unified optimization mechanism.

But that’s already a topic for another article.

Deployment and Maintenance

After training, the model is typically validated through a dedicated A/B test. Users are divided into three groups:

• Control: some users receive no treatment.

• Test 1: treatments are assigned using the current business logic or at random.

• Test 2: treatments are assigned based on the model’s predictions.

This setup allows for a direct comparison of the new strategy’s effectiveness against the existing one, helping assess whether the model truly improves the target metric — such as conversion, profit, or retention — and to measure uplift from promotional mechanisms.

Maintaining such models is more complex than maintaining standard uplift models.

Any change in treatment mechanics — new campaigns, discount rules, or communication logic — effectively creates a new environment, requiring model retraining.

Moreover, user responses to the same stimuli may change over time, leading to data drift.

To monitor these effects, it’s common practice in production to maintain a small random control group of users whose offers are assigned randomly.

This allows the team to track model performance, measure uplift from the new strategy, and preserve a solid foundation for future model updates.

Conclusion

Each of the models described above addresses its own class of product challenges — from predicting user behavior to selecting the optimal intervention.

Together, they form a practical framework for building next-level analytics — where decisions are based not only on metrics, but also on probabilistic estimates and causal effects.

Of course, this list is far from exhaustive. Product analytics also makes active use of other approaches — such as LTV prediction models, Propensity Score Matching for estimating effects without experiments, and ranking models, among others.

All of these methods pursue different goals, but share the same underlying idea: to better understand why users behave the way they do, and how that behavior can be influenced.

Moreover, the models discussed here can be further expanded and combined.

In practice, a single standalone model rarely works — it’s usually a system of interconnected components that evaluate different aspects of user behavior.

If this material proves useful, I’ll prepare a follow-up — in the second part, we’ll explore more advanced and specialized approaches.

In the meantime, I’d love to hear which machine learning models you use in your products — and which ones have actually delivered results. Share your experience in the comments!