Like it or not, the reality is this: just the fact that your product or feature uses AI/ML isn’t enough. At least not if you want a lasting, positive effect. Beyond the hype and the viral buzz, business is still about profit, ROI, and growing metrics. That’s its nature, in most cases. Which means that for us, people building these projects, it’s essential to know how to measure the impact of AI/ML. Both at the early stage and after every improvement. That’s what we’ll talk about today. In this article, we’ll look at the toolkit that helps us measure the effect of AI adoption and get a sense of how these methods work. I’ll simplify a lot of things and keep many details behind the curtain (or for separate sections), just to lower the entry barrier. By the end, you’ll have a grasp of the main approaches to measuring the impact of your project on business. You’ll have a rough map of methods and know how to orient yourself - what fits your case best. From there, you’ll be ready for a deeper dive. Narrative map - to help you navigate We’ll start with the basics - why and when it’s worth measuring at all. Then we’ll look at what can go wrong and why not every change deserves an experiment. From there, we’ll dive into the gold standard - A/B testing - and what to do when you can’t run one. Finally, we’ll go beyond experiments to causal inference methods - and how to trust what you find. Measuring Impact: The Why, When, and How Not To When it’s worth measuring When it’s worth measuring First, the main reasons you might actually want to bother. Risk & Value Risk & Value We already touched on value at the start. We want to measure if a feature delivers enough to justify further investment. How much, what its future should be. And these are quite pleasant chores. But don’t forget the critical factor - risk management. If your shiny new change actually makes things worse - conversion rates have dropped, users are leaving in frustration, there's less money - you definitely want to know that quickly enough to react. Avoiding a failure can matter even more than catching an upside. risk management. definitely Innovation Innovation Since the early Agile days, product development has been about rapid iterations, market arms races, and endless searches for product–market fit. Businesses do dozens of things simultaneously to stay competitive, and you might want to prioritize truly brilliant solutions among the merely good ones. Features that can truly change the game, things users truly need, or areas where a positive impact can be achieved with minimal investment. Numbers are much easier to compare than feelings, aren’t they? Optimization Optimization The beauty of a conscious, measurement-driven approach is that it lets you dig deeper. You start to understand the nature of your results. Maybe revenue didn’t jump immediately, but users love the new functionality and engage with it daily. Maybe it resonates with a particular segment but annoys others. These and other ideas open up opportunities for further optimization. You’re becoming better. nature Organization Organization Do you work at a place that talks about being “data-driven”? Or maybe you’re personally the type who trusts facts over gut feeling? Then this is where you want to be: learning how to measure effect, and making sure your results actually lead you toward the goals you set. When Not to Test When Not to Test That said, there are cases where experiments don’t make sense - or aren’t possible at all. Let’s go through the big ones. are Methodological limits Methodological limits Sometimes it’s simply impossible to apply the methods. Too little data to get a result. Changes too small to detect. Or no hypothesis at all about what should improve (in which case - why was it necessary to develop anything at all?). Non-optional work Non-optional work Some changes are mandatory. Legal or compliance requirements are the classic case. Not AI-specific, but clear: you need to comply with GDPR, adapt your product for accessibility, and so on. You’re not expecting conversion lifts here - you’re doing it because you must. Same goes for critical fixes or infrastructure updates. The site doesn't return a 502 error. How's that for business value? critical fixes or infrastructure updates. Ethical red lines Ethical red lines Some measurement practices cross ethical boundaries, carrying risks of user harm or manipulative design. Think experiments that could trigger financial loss or compromise user data. Not everything is fair game. Better alternatives Better alternatives Sometimes it’s just not worth it. If the effort (resources) spent on measurements may be higher than the value itself, skip it. Time, money, opportunity cost. All matter. The simplest example: young, fast-moving startups with only a handful of clients are usually better off chasing product–market fit through short iterations and direct user feedback. By the time they’d run a full A/B test, they could have built something much better already. How not to measure Before\After Intuitively, the urge is to do the following: See how it was Launch the new feature See how it is now See a positive result Profit See how it was Launch the new feature See how it is now See a positive result Profit But "it seems better now" has a dark side. Many things change over time (seasonality, external events, traffic shifts). You can't isolate the effect of one feature. Core issue: Confounds environment changes with feature impact. Core issue: YoY (Year-over-Year) comparison This familiar and traditional business trick is perfect for answering the question, "Are we growing as a business overall?". And it's useful in many situations, but not in an isolated assessment of a feature's implementation or improvement. Just imagine how much has happened in business this year. Marketers, SEOs, salespeople, you name it - everyone around you has been working tirelessly to ensure growth. The world around us isn't standing still either. Tariffs, Covid-19, and wars are happening. Bloggers and celebrities express their opinions. TikTok trends are changing consumer behavior, and your competitors are doing their part, too. But that 10% increase in turnover this January was only thanks to your AI chatbot (Seriously?). Core issue: Too long a window - dozens of other changes happen in parallel. Any YoY difference reflects everything, not your feature. Misattributes long-term business trends to a single change. Core issue: Correlation <> Causation You've probably heard the phrase, “Correlation does not mean causation.” But what does it really mean in practice? Imagine you launch an AI chatbot, and a little while after that, the number of completed purchases increases. Sounds like the chatbot caused the increase, right? Maybe - or maybe not. When usage and outcomes move together, it looks like cause and effect. But. At the same time, your marketing team launched a new campaign. Or there was a seasonal peak, which is always a sales spike this time of year. Or a competitor ran out of stock. Or... you know, there could be a lot of reasons. And they could all affect the numbers together or instead of your bot. The tricky part is that data can only look related because two things are happening at the same time. Our brains are good at recognizing patterns, but business is full of parallel events and noise. If we don't separate cause from coincidence, we risk making bad decisions - like investing more in a feature that wasn't actually responsible for success. Core issue: Correlation only shows that two things changed at the same time, but does not promise that one caused the other. Core issue: The Gold Standard of the industry - Randomized Controlled Experiments (RCE) 10 times out of 10 you want to be here. Luckily 8-9 times out of 10 you will be here. And it's because of those cases that RCE is not enough that this article came about. Nevertheless, let's start with the good stuff. Classic A/B tests You're probably familiar with this method. It is summarized as follows: We form a hypothesis. For example, that the description of goods and services generated with GenAI will be as good (or better) than the one written by a human. Or the block “Customers Also Bought / Frequently Bought Together” in an online store will stimulate customers to buy more staff. Personalized emails engage users more. And so on. Define one or more metrics by which to determine the success/failure of the hypothesis. Calculate the sample size and duration of the experiment. Consider possible cycles of product use. Randomly split the traffic into two (or more) groups and run the experiment. The control group (a) sees the product without the new feature, and the experimental group sees the new version of the product containing the change we are testing. We check that the groups differ only in the presence or absence of the new feature. Analysis. We apply statistical methods, calculate the difference in metrics and make sure that it is statistically significant. Stage 0 could be an A/A test (when both groups see the same version of the product and we do not see any difference in their behavior) to make sure that the traffic separation and methodology works correctly Decision making and iteration. Based on the analysis, a decision is made: use, refine or reject. Its magical, wonderful world where there is a lot of control, a chance to double-check yourself, to measure your confidence in your results. Plenty of learning resources and experienced colleagues around. What could possibly go wrong? The main reasons why we will have to give up cozy classical AB tests: 1. Spillover Effect is when the behavior of one group affects another. That means the control group also changes - even though it shouldn't. A textbook example is the Facebook friend recommendation block. We hope that it will help users build social connections. Let's imagine that group A doesn't have such a block, but group B sees it. User John from group B sees such a block, sees user Michael from group A in it, and adds him as a friend. Both users have +1 friend, although Michael should have been in a group that is not affected by the tested feature. Let's look at a few different examples where the Spillover Effect might occur 2. Few users or rare events. If we have very few users (unpopular part of the product, B2B, etc.) or we work with a very rare event (buying a very expensive product or someone actually read the Terms & Conditions). In such cases, it will take a huge amount of time to get a little bit significant result. 3. Impact on external factors. If we launch a change that affects the environment and cannot be isolated to individual users. For example, we are testing an advertising auction algorithm - prices will change for all advertisers, not just for those we try to isolate into an experimental group. 4. Brand's Effect. Our chip can change the composition of groups. For example, it repels or attracts certain types of users. For example, if a new feature starts to attract newcomers (this is not our goal) and increases their share in the test group, while in the control group it remains unchanged - the groups will not be comparable. We form a hypothesis. For example, that the description of goods and services generated with GenAI will be as good (or better) than the one written by a human. Or the block “Customers Also Bought / Frequently Bought Together” in an online store will stimulate customers to buy more staff. Personalized emails engage users more. And so on. We form a hypothesis. For example, that the description of goods and services generated with GenAI will be as good (or better) than the one written by a human. Or the block “Customers Also Bought / Frequently Bought Together” in an online store will stimulate customers to buy more staff. Personalized emails engage users more. And so on. We form a hypothesis. Define one or more metrics by which to determine the success/failure of the hypothesis. Define one or more metrics by which to determine the success/failure of the hypothesis. Define one or more metrics Calculate the sample size and duration of the experiment. Consider possible cycles of product use. Calculate the sample size and duration of the experiment. Consider possible cycles of product use. Calculate the sample size and duration of the experiment. Randomly split the traffic into two (or more) groups and run the experiment. The control group (a) sees the product without the new feature, and the experimental group sees the new version of the product containing the change we are testing. We check that the groups differ only in the presence or absence of the new feature. Randomly split the traffic into two (or more) groups and run the experiment. The control group (a) sees the product without the new feature, and the experimental group sees the new version of the product containing the change we are testing. We check that the groups differ only in the presence or absence of the new feature. Randomly split the traffic run the experiment Analysis. We apply statistical methods, calculate the difference in metrics and make sure that it is statistically significant. Stage 0 could be an A/A test (when both groups see the same version of the product and we do not see any difference in their behavior) to make sure that the traffic separation and methodology works correctly Analysis. We apply statistical methods, calculate the difference in metrics and make sure that it is statistically significant. Stage 0 could be an A/A test (when both groups see the same version of the product and we do not see any difference in their behavior) to make sure that the traffic separation and methodology works correctly Analysis Decision making and iteration. Based on the analysis, a decision is made: use, refine or reject. Its magical, wonderful world where there is a lot of control, a chance to double-check yourself, to measure your confidence in your results. Plenty of learning resources and experienced colleagues around. What could possibly go wrong? The main reasons why we will have to give up cozy classical AB tests: 1. Spillover Effect is when the behavior of one group affects another. That means the control group also changes - even though it shouldn't. A textbook example is the Facebook friend recommendation block. We hope that it will help users build social connections. Let's imagine that group A doesn't have such a block, but group B sees it. User John from group B sees such a block, sees user Michael from group A in it, and adds him as a friend. Both users have +1 friend, although Michael should have been in a group that is not affected by the tested feature. Let's look at a few different examples where the Spillover Effect might occur 2. Few users or rare events. If we have very few users (unpopular part of the product, B2B, etc.) or we work with a very rare event (buying a very expensive product or someone actually read the Terms & Conditions). In such cases, it will take a huge amount of time to get a little bit significant result. 3. Impact on external factors. If we launch a change that affects the environment and cannot be isolated to individual users. For example, we are testing an advertising auction algorithm - prices will change for all advertisers, not just for those we try to isolate into an experimental group. 4. Brand's Effect. Our chip can change the composition of groups. For example, it repels or attracts certain types of users. For example, if a new feature starts to attract newcomers (this is not our goal) and increases their share in the test group, while in the control group it remains unchanged - the groups will not be comparable. Decision making and iteration. Based on the analysis, a decision is made: use, refine or reject. Decision making and iteration. Its magical, wonderful world where there is a lot of control, a chance to double-check yourself, to measure your confidence in your results. Plenty of learning resources and experienced colleagues around. What could possibly go wrong? The main reasons why we will have to give up cozy classical AB tests: 1. Spillover Effect is when the behavior of one group affects another. That means the control group also changes - even though it shouldn't. Spillover Effect A textbook example is the Facebook friend recommendation block. We hope that it will help users build social connections. Let's imagine that group A doesn't have such a block, but group B sees it. User John from group B sees such a block, sees user Michael from group A in it, and adds him as a friend. Both users have +1 friend, although Michael should have been in a group that is not affected by the tested feature. Let's look at a few different examples where the Spillover Effect might occur Spillover Effect 2. Few users or rare events. If we have very few users (unpopular part of the product, B2B, etc.) or we work with a very rare event (buying a very expensive product or someone actually read the Terms & Conditions). In such cases, it will take a huge amount of time to get a little bit significant result. 2. Few users rare events 3. Impact on external factors. If we launch a change that affects the environment and cannot be isolated to individual users. For example, we are testing an advertising auction algorithm - prices will change for all advertisers, not just for those we try to isolate into an experimental group. 3. Impact on external factors 4. Brand's Effect. Our chip can change the composition of groups. For example, it repels or attracts certain types of users. For example, if a new feature starts to attract newcomers (this is not our goal) and increases their share in the test group, while in the control group it remains unchanged - the groups will not be comparable. 4. Brand's Effect The good news is that part of the problem is solved without going outside of RCE, using basically the same mechanics. There’s more to split than traffic! There’s more to split than traffic! Some of the above problems can be solved by changing only part of the overall test design. Let's look at one of the actual cases. According to many summaries and analysts, different co-pilots and assistants come out in the top of LLM-based products. They lead both in popularity and “survival rate”, i.e. they have a chance to live longer than MVPs. The common feature of this type of projects is that we have a solution that is designed to simplify/accelerate the work of an employee. It can be call center operators, sales people, finance people and so on. But most often we don't have that many employees to divide them into two groups and measure their speed/efficiency with and without copilot. Here (link) is a real-life example. As part of the experiment, the researchers wanted to see how the use of AI tools affects the work of engineers. Would they close tasks faster if they were given a modern arsenal? But only 16 developers took part in the experiment, which is desperately small enough to hope to get confident results. link The authors instead split tasks and compared completion times. So the sample here is not 16 developers, but 246 tasks. It's still not a huge sample, but: tasks P-value is OK. The authors analyzed and marked up screen recordings, conducted interviews. In short, they did qualitative research. When the results of qualitative and quantitative research are consistent it is a strong signal. P-value is OK. The authors analyzed and marked up screen recordings, conducted interviews. In short, they did qualitative research. When the results of qualitative and quantitative research are consistent it is a strong signal. You can read the results and details of the methodology at the link above. But what is important for us now is to draw conclusions within the framework of our topic, we are not interested in this study itself, but in an understandable example of the approach. Let’s give this idea a skeleton. Let’s give this idea a skeleton. Case: AI Copilots (Contact Centers / Dev Teams / etc) Case: Why not user-split? Why not user-split? “Users” here are agents/devs; small populations + spillovers (shared macros, coaching, shift effects). “Users” here are agents/devs; small populations + spillovers (shared macros, coaching, shift effects). Instead, randomize: Instead, randomize: Ticket / conversation (assign treatment at intake). Or queue / intent as the cluster (billing, tech, returns, etc.). Ticket / conversation (assign treatment at intake). Or queue / intent as the cluster (billing, tech, returns, etc.). Design notes: stratify by channel (chat/email/voice) and priority/SLA; monitor automation bias; analyze with cluster-robust SE. Design notes: Once you understand this principle, you can apply it to other entities as well. You can split time, geoposition, and more. Look for similar cases, get inspired and adapt. Once you understand this principle, you can apply it to other entities as well. You can split time, geoposition, and more. Look for similar cases, get inspired and adapt. I'll leave a note for another frequent type of tasks where the classic AB test may not fit - pricing algorithms. Case: Dynamic Pricing (Retail) Case: Why not user-split? Why not user-split? In-store it’s impossible (and confusing) to show different prices to different people. Online it’s often illegal/unethical and triggers fairness issues. In-store it’s impossible (and confusing) to show different prices to different people. Online it’s often illegal/unethical and triggers fairness issues. Instead, randomize: Instead, randomize: Time (switchback) for the same SKU×store (e.g., by shifts/days). (Optional) Clusters - SKU×store (or store clusters), stratified by category/traffic Time (switchback) for the same SKU×store (e.g., by shifts/days). (Optional) Clusters - SKU×store (or store clusters), stratified by category/traffic Design notes: balance days of week/seasonality; use cluster-robust SE; guardrails on promo/stock overlaps. Design notes: When randomization isn’t an option How do you measure the impact of your core AI feature when it's already live for everyone or you can't run experiment with control group? We’ve established that RCE is the gold standard for a reason, but the clean world of controlled experiments often gives way to the messy reality of business. As we've seen, not all limitations of RCE can be solved even with specialized techniques. Sooner or later, every product team faces a critical question that a classic A/B test can't answer. The only way forward is to expand your arsenal through quasi-experiments. Let's explore some of the most popular ones and try to capture their essence. When the time comes, you'll know where to dig. Methods Overview Propensity Score Matching (PSM) The Gist: You can consider this method when the exposure to a treatment is not random (for example, when a user decides for themselves whether to use a feature you developed). For every user who received the treatment, we find a user who did not but had the same probability of receiving it. In essence, this creates a "statistical twin." We then compare these pairs to determine the effect. Use Case: Use Case: Imagine you've created a very cool, gamified onboarding for your product—for instance, an interactive tutorial with a mascot. You expect this to impact future user efficiency and retention. In this case, motivation is a key factor. Users who choose to complete the onboarding are likely already more interested in exploring the product. To measure the "pure" effect of the onboarding itself, you need to compare them with similar users. Decision Guide Decision Guide Technical Notes: (For the Advanced) (For the Advanced) Matching Strategy Matters: There are several ways to form pairs, each with its own trade-offs. Common methods include one-to-one matching, one-to-many matching, and matching with or without replacement. The choice depends on your data and research question. Always Check for Balance: After matching, you must verify that the characteristics (the covariates used to calculate the propensity score) are actually balanced between the treated and the newly formed control group. If they aren't, you may need to adjust your propensity score model or matching strategy. The Effect is Not for Everyone: The causal effect estimated with PSM is technically the "average treatment effect on the treated" (ATT). This means the result applies only to the types of users who were able to be matched, not necessarily to the entire population. The Result is Sensitive to the Model: The final estimate is highly dependent on how the propensity score (the probability of treatment) was calculated. A poorly specified model will lead to biased results. It's Not Always the Best Tool: PSM is intuitive, but sometimes simpler methods like regression adjustments or more advanced techniques (e.g., doubly robust estimators) can be more powerful or reliable. It's a good tool to have, but it's not a silver bullet. Matching Strategy Matters: There are several ways to form pairs, each with its own trade-offs. Common methods include one-to-one matching, one-to-many matching, and matching with or without replacement. The choice depends on your data and research question. Always Check for Balance: After matching, you must verify that the characteristics (the covariates used to calculate the propensity score) are actually balanced between the treated and the newly formed control group. If they aren't, you may need to adjust your propensity score model or matching strategy. The Effect is Not for Everyone: The causal effect estimated with PSM is technically the "average treatment effect on the treated" (ATT). This means the result applies only to the types of users who were able to be matched, not necessarily to the entire population. The Result is Sensitive to the Model: The final estimate is highly dependent on how the propensity score (the probability of treatment) was calculated. A poorly specified model will lead to biased results. It's Not Always the Best Tool: PSM is intuitive, but sometimes simpler methods like regression adjustments or more advanced techniques (e.g., doubly robust estimators) can be more powerful or reliable. It's a good tool to have, but it's not a silver bullet. Matching Strategy Matters: There are several ways to form pairs, each with its own trade-offs. Common methods include one-to-one matching, one-to-many matching, and matching with or without replacement. The choice depends on your data and research question. Matching Strategy Matters one-to-one one-to-many with or without replacement Always Check for Balance: After matching, you must verify that the characteristics (the covariates used to calculate the propensity score) are actually balanced between the treated and the newly formed control group. If they aren't, you may need to adjust your propensity score model or matching strategy. Always Check for Balance The Effect is Not for Everyone: The causal effect estimated with PSM is technically the "average treatment effect on the treated" (ATT). This means the result applies only to the types of users who were able to be matched, not necessarily to the entire population. The Effect is Not for Everyone The Result is Sensitive to the Model: The final estimate is highly dependent on how the propensity score (the probability of treatment) was calculated. A poorly specified model will lead to biased results. The Result is Sensitive to the Model It's Not Always the Best Tool: PSM is intuitive, but sometimes simpler methods like regression adjustments or more advanced techniques (e.g., doubly robust estimators) can be more powerful or reliable. It's a good tool to have, but it's not a silver bullet. It's Not Always the Best Tool Syntetic Control (SC) The Gist: The goal is to find several untreated units that are similar to the one that received the treatment. From this pool, we create a "synthetic" control group by combining them in a way that makes their characteristics closely resemble the treated unit. This "combination" is essentially a weighted average of the units from the control group (often called the "donor pool"). The weights are chosen to minimize the difference between the treated unit and the synthetic version during the pre-treatment period. weighted average pre-treatment period Use Case: Use Case: Imagine your food delivery company is implementing a new AI-based logistics system to reduce delivery times across an entire city, like Manchester. A classic A/B test is impossible because the system affects all couriers and customers at once. You also can't simply compare Manchester's performance to another city, such as Birmingham, because unique local events or economic trends there would skew the comparison. To measure the true impact, you need to build a "synthetic" control that perfectly mirrors Manchester's pre-launch trends. Here's how that "synthetic twin" is built. The method looks at the period before the launch and uses a "donor pool" of other cities (e.g., Birmingham, Leeds, and Bristol) to create the perfect "recipe" for replicating Manchester's past. By analyzing historical data on key predictors (like population or past delivery times), the algorithm finds the ideal weighted blend. It might discover, for instance, that a combination of "40% Birmingham + 35% Leeds + 25% Bristol" had a performance history that was a near-perfect match for Manchester's own. before "40% Birmingham + 35% Leeds + 25% Bristol" Once this recipe is locked in, it's used to project what would have happened without the new system. From the launch day forward, the model calculates the "Synthetic Manchester's" performance by applying the recipe to the actual, real-time data from the donor cities. This synthetic version represents the most likely path the real Manchester would have taken. The difference between the real Manchester's improved delivery times and the performance of its synthetic twin is the true, isolated effect of your new AI system. Decision Guide Decision Guide Technical Notes: (For the Advanced) (For the Advanced) Weight Transparency and Diagnostics: Always inspect the weights assigned to the units in the donor pool. If one unit receives almost all the weight (e.g., 99%), your "synthetic control" has essentially collapsed into a simple Difference-in-Differences (DiD) model with a single, chosen control unit. This can indicate that your donor pool is not diverse enough. Modern Extensions Exist: The original Synthetic Control method has inspired more advanced versions. Two popular ones are: Generalized Synthetic Control (GSC): An extension that allows for multiple treated units and can perform better when a perfect pre-treatment fit is not achievable. Synthetic Difference-in-Differences (SDID): A hybrid method that combines the strengths of both synthetic controls (for weighting control units) and difference-in-differences (for weighting time periods). It is often more robust to noisy data. Weight Transparency and Diagnostics: Always inspect the weights assigned to the units in the donor pool. If one unit receives almost all the weight (e.g., 99%), your "synthetic control" has essentially collapsed into a simple Difference-in-Differences (DiD) model with a single, chosen control unit. This can indicate that your donor pool is not diverse enough. Modern Extensions Exist: The original Synthetic Control method has inspired more advanced versions. Two popular ones are: Generalized Synthetic Control (GSC): An extension that allows for multiple treated units and can perform better when a perfect pre-treatment fit is not achievable. Synthetic Difference-in-Differences (SDID): A hybrid method that combines the strengths of both synthetic controls (for weighting control units) and difference-in-differences (for weighting time periods). It is often more robust to noisy data. Weight Transparency and Diagnostics: Always inspect the weights assigned to the units in the donor pool. If one unit receives almost all the weight (e.g., 99%), your "synthetic control" has essentially collapsed into a simple Difference-in-Differences (DiD) model with a single, chosen control unit. This can indicate that your donor pool is not diverse enough. Weight Transparency and Diagnostics: Difference-in-Differences Modern Extensions Exist: The original Synthetic Control method has inspired more advanced versions. Two popular ones are: Generalized Synthetic Control (GSC): An extension that allows for multiple treated units and can perform better when a perfect pre-treatment fit is not achievable. Synthetic Difference-in-Differences (SDID): A hybrid method that combines the strengths of both synthetic controls (for weighting control units) and difference-in-differences (for weighting time periods). It is often more robust to noisy data. Modern Extensions Exist: Generalized Synthetic Control (GSC): An extension that allows for multiple treated units and can perform better when a perfect pre-treatment fit is not achievable. Synthetic Difference-in-Differences (SDID): A hybrid method that combines the strengths of both synthetic controls (for weighting control units) and difference-in-differences (for weighting time periods). It is often more robust to noisy data. Generalized Synthetic Control (GSC): An extension that allows for multiple treated units and can perform better when a perfect pre-treatment fit is not achievable. Generalized Synthetic Control (GSC): Synthetic Difference-in-Differences (SDID): A hybrid method that combines the strengths of both synthetic controls (for weighting control units) and difference-in-differences (for weighting time periods). It is often more robust to noisy data. Synthetic Difference-in-Differences (SDID): Difference-in-Differences (DID) The Gist: We take a group where something has changed (e.g., we got a new feature) and a group where everything remains the same. The second group should be such that historically the trend of the key metric in it was the same as in the group with the feature. On the basis of this we assume that without our intervention the trends of metrics would be parallel. We look at the before and after differences in the two groups. Then we compare these two differences. (that's why the method is called Difference-in-Differences). The idea is simple: without us, both groups would have developed the same without change, but with us, the difference between their changes will be the “net” effect of implementing our feature. Use Case(s): Use Case(s): The method is very popular, let's even look at a few case studies. One region (country, city) gets the new discount system (or AI service), while another doesn't. We compare the change in sales or engagement between the two. An LLM is used to generate an optimized XML feed for Google Shopping for one product category. This includes creating more engaging titles and detailed product descriptions. A second, similar category with a standard, template-based feed is used as a control group. We then compare the change in metrics like CTR or conversions between the two groups. Similar mechanics may be at work with SEO. One region (country, city) gets the new discount system (or AI service), while another doesn't. We compare the change in sales or engagement between the two. An LLM is used to generate an optimized XML feed for Google Shopping for one product category. This includes creating more engaging titles and detailed product descriptions. A second, similar category with a standard, template-based feed is used as a control group. We then compare the change in metrics like CTR or conversions between the two groups. Similar mechanics may be at work with SEO. Caveat: A good and understandable case, but it requires careful group selection. Organic traffic trends for different categories (e.g., "laptops" and "dog food") can differ greatly due to seasonality or competitor actions. The method will be reliable if the categories are very similar (e.g., "men's running shoes" and "women's running shoes"). Caveat: Caveat: A good and understandable case, but it requires careful group selection. Organic traffic trends for different categories (e.g., "laptops" and "dog food") can differ greatly due to seasonality or competitor actions. The method will be reliable if the categories are very similar (e.g., "men's running shoes" and "women's running shoes"). Measuring the impact of a feature launched only on Android, using iOS users as a control group to account for general market trends. Caveat: A very common case in practice, but methodologically risky. Android and iOS audiences often have different demographics, purchasing power, and behavioral patterns. Any external event (e.g., a marketing campaign targeting iOS users) can break the parallel trends and distort the results. Measuring the impact of a feature launched only on Android, using iOS users as a control group to account for general market trends. Caveat: A very common case in practice, but methodologically risky. Android and iOS audiences often have different demographics, purchasing power, and behavioral patterns. Any external event (e.g., a marketing campaign targeting iOS users) can break the parallel trends and distort the results. Measuring the impact of a feature launched only on Android, using iOS users as a control group to account for general market trends. Caveat: A very common case in practice, but methodologically risky. Android and iOS audiences often have different demographics, purchasing power, and behavioral patterns. Any external event (e.g., a marketing campaign targeting iOS users) can break the parallel trends and distort the results. Caveat: Caveat: A very common case in practice, but methodologically risky. Android and iOS audiences often have different demographics, purchasing power, and behavioral patterns. Any external event (e.g., a marketing campaign targeting iOS users) can break the parallel trends and distort the results. Decision Guide Decision Guide Technical Notes: (For the Advanced) (For the Advanced) The Core Strength: The power of DiD lies in shifting the core assumption from the often-unrealistic "the groups are identical" to the more plausible "the groups' trends are identical." A simple post-launch comparison between Android and iOS is flawed because the user bases can be fundamentally different. A simple before-and-after comparison on Android alone is also flawed due to seasonality and other time-based factors. DiD elegantly addresses both issues by assuming that while the absolute levels of a metric might differ, their "rhythm" or dynamics would have been the same in the absence of the intervention. This makes it a robust tool for analyzing natural experiments. Deceptive Simplicity: While DiD is simple in its basic 2x2 case, it can become quite complex. Challenges arise when dealing with multiple time periods, different start times for the treatment across groups (staggered adoption), and when using machine learning techniques to control for additional covariates. The problem of "Staggered Adoption" : the classical DiD model is ideal for cases where one group receives the intervention at one point in time. But in life, as you know, different subgroups (e.g. different regions or user groups) often receive the function at different times. and this is when applying standard DiD regression can lead to highly biased results. This is because groups already treated may be implicitly used as controls for groups treated later, which can sometimes even change the sign of the estimated effect. Heterogeneity of the treatment effect: a simple DiD model implicitly assumes that the treatment effect is constant across all and over time. In reality, the effect may evolve (e.g., it may increase as users become accustomed to the feature) or vary between different subgroups. There are studies that show this and there are specific evaluation methods that take this effect into account. At least we think so until a new study comes out, right? The Core Strength: The power of DiD lies in shifting the core assumption from the often-unrealistic "the groups are identical" to the more plausible "the groups' trends are identical." A simple post-launch comparison between Android and iOS is flawed because the user bases can be fundamentally different. A simple before-and-after comparison on Android alone is also flawed due to seasonality and other time-based factors. DiD elegantly addresses both issues by assuming that while the absolute levels of a metric might differ, their "rhythm" or dynamics would have been the same in the absence of the intervention. This makes it a robust tool for analyzing natural experiments. Deceptive Simplicity: While DiD is simple in its basic 2x2 case, it can become quite complex. Challenges arise when dealing with multiple time periods, different start times for the treatment across groups (staggered adoption), and when using machine learning techniques to control for additional covariates. The problem of "Staggered Adoption" : the classical DiD model is ideal for cases where one group receives the intervention at one point in time. But in life, as you know, different subgroups (e.g. different regions or user groups) often receive the function at different times. and this is when applying standard DiD regression can lead to highly biased results. This is because groups already treated may be implicitly used as controls for groups treated later, which can sometimes even change the sign of the estimated effect. Heterogeneity of the treatment effect: a simple DiD model implicitly assumes that the treatment effect is constant across all and over time. In reality, the effect may evolve (e.g., it may increase as users become accustomed to the feature) or vary between different subgroups. There are studies that show this and there are specific evaluation methods that take this effect into account. At least we think so until a new study comes out, right? The Core Strength: The power of DiD lies in shifting the core assumption from the often-unrealistic "the groups are identical" to the more plausible "the groups' trends are identical." A simple post-launch comparison between Android and iOS is flawed because the user bases can be fundamentally different. A simple before-and-after comparison on Android alone is also flawed due to seasonality and other time-based factors. DiD elegantly addresses both issues by assuming that while the absolute levels of a metric might differ, their "rhythm" or dynamics would have been the same in the absence of the intervention. This makes it a robust tool for analyzing natural experiments. The Core Strength: trends Deceptive Simplicity: While DiD is simple in its basic 2x2 case, it can become quite complex. Challenges arise when dealing with multiple time periods, different start times for the treatment across groups (staggered adoption), and when using machine learning techniques to control for additional covariates. Deceptive Simplicity: The problem of "Staggered Adoption" : the classical DiD model is ideal for cases where one group receives the intervention at one point in time. But in life, as you know, different subgroups (e.g. different regions or user groups) often receive the function at different times. and this is when applying standard DiD regression can lead to highly biased results. This is because groups already treated may be implicitly used as controls for groups treated later, which can sometimes even change the sign of the estimated effect. "Staggered Adoption" Heterogeneity of the treatment effect: a simple DiD model implicitly assumes that the treatment effect is constant across all and over time. In reality, the effect may evolve (e.g., it may increase as users become accustomed to the feature) or vary between different subgroups. There are studies that show this and there are specific evaluation methods that take this effect into account. At least we think so until a new study comes out, right? Heterogeneity Regression Discontinuity Design (RDD) The Gist: If a user gets a treatment based on a rule with a cutoff value (e.g., "made 100 orders" or “exist 1 month”), we assume that those just below the cutoff are very similar to those just above it. For example, a user with 99 orders is almost identical to a user with 101 orders. The only difference is that the person with 101 got the treatment, and the person with 99 didn't. This means we can try to compare them to see the effect. Use Case(s): Use Case(s): A loyalty program gives "Gold Status" to users who have spent over $1000 in a year. RDD would compare the behavior (e.g., retention, future spending) of users who spent $1001 with those who spent $999. A sharp difference in their behavior right at the $1000 mark would be the effect of receiving "Gold Status." "Gold Status" $1001 $999 An e-commerce site offers customers different shipping options based on their arrival time. Any customer arriving before noon gets 2-day shipping, while any customer arriving just after noon gets a 3-day shipping window. The site wants to measure the causal effect of this policy on the checkout probability. before noon just after noon Decision Guide Decision Guide Technical Notes: (For the Advanced) (For the Advanced) This article focuses on Sharp RDD, where crossing the cutoff guarantees the treatment. A variation called Fuzzy RDDexists for cases where crossing the cutoff only increases the probability of receiving the treatment. The first step in any RDD analysis is to plot the data. You should plot the outcome variable against the running variable. The "jump" or discontinuity at the cutoff should be clearly visible to the naked eye. A crucial step is choosing the right bandwidth, or how far from the cutoff you look for data. It's a trade-off between bias and variance: Narrow Bandwidth: More accurate assumption (users are very similar), but fewer data points (high variance, low power). Wide Bandwidth: More data points (low variance, high power), but a riskier assumption (users might be too different). This article focuses on Sharp RDD, where crossing the cutoff guarantees the treatment. A variation called Fuzzy RDDexists for cases where crossing the cutoff only increases the probability of receiving the treatment. The first step in any RDD analysis is to plot the data. You should plot the outcome variable against the running variable. The "jump" or discontinuity at the cutoff should be clearly visible to the naked eye. A crucial step is choosing the right bandwidth, or how far from the cutoff you look for data. It's a trade-off between bias and variance: Narrow Bandwidth: More accurate assumption (users are very similar), but fewer data points (high variance, low power). Wide Bandwidth: More data points (low variance, high power), but a riskier assumption (users might be too different). This article focuses on Sharp RDD, where crossing the cutoff guarantees the treatment. A variation called Fuzzy RDDexists for cases where crossing the cutoff only increases the probability of receiving the treatment. Sharp RDD Fuzzy RDD increases the probability The first step in any RDD analysis is to plot the data. You should plot the outcome variable against the running variable. The "jump" or discontinuity at the cutoff should be clearly visible to the naked eye. plot the data A crucial step is choosing the right bandwidth, or how far from the cutoff you look for data. It's a trade-off between bias and variance: Narrow Bandwidth: More accurate assumption (users are very similar), but fewer data points (high variance, low power). Wide Bandwidth: More data points (low variance, high power), but a riskier assumption (users might be too different). bandwidth Narrow Bandwidth: More accurate assumption (users are very similar), but fewer data points (high variance, low power). Wide Bandwidth: More data points (low variance, high power), but a riskier assumption (users might be too different). Narrow Bandwidth: More accurate assumption (users are very similar), but fewer data points (high variance, low power). Narrow Bandwidth: Wide Bandwidth: More data points (low variance, high power), but a riskier assumption (users might be too different). Wide Bandwidth: Bayesian Structural Time Series (BSTS) Bayesian Structural Time Series (BSTS) In Simple Terms: Based on pre-event data, the model builds a forecast of what would have happened without our intervention. To do this, it relies on other, similar time series that were not affected by the change. The difference between this forecast and reality is the estimated effect. We looked at Synthetic Control earlier; think of BSTS as that same idea of estimating impact via similar, unaffected units, but on steroids. In Simple Terms: on steroids Key Idea: To build an "alternate universe" where your feature never existed. The main difference from Synthetic Control is that to build the forecast, it uses a Bayesian model instead of a multiplication of weights. Key Idea: Use Case: You changed the pricing policy for one product category. To measure the effect, the model uses sales from other, similar categories to forecast what the sales in your category would have been without the price change. Use Case: without There are excellent ready-made libraries for working with BSTS (like Google's CausalImpact), with which you can get it done in 10-20 lines of code. Just don't forget to run the tests (see the block below). There are excellent ready-made libraries for working with BSTS (like Google's CausalImpact), with which you can get it done in 10-20 lines of code. Just don't forget to run the tests (see the block below). CausalImpact Instrumental Variables (IV) Instrumental Variables (IV) In Simple Terms: A method for situations where a hidden factor (like motivation) influences both the user's choice and the final outcome. We find an external factor (an "instrument") that pushes the user towards the action but doesn't directly affect the outcome itself. In Simple Terms: Key Idea: To find an "indirect lever" to move only what's needed. Key Idea: Use Case: (academic) You want to measure the effect of TV ads on sales, but the ads are shown in regions where people already buy more. The instrument could be the weather: on rainy days, people watch more TV (and see the ad), but the weather itself doesn't directly make them buy your product. This allows you to isolate the ad's effect from the region's wealth factor. Use Case: instrument Double Machine Learning (DML) In Simple Terms: A modern approach that uses two ML models to "cleanse" both the treatment and the outcome from the influence of hundreds of other factors. By analyzing only what's left after this "cleansing" (the residuals), the method finds the pure cause-and-effect impact. Main strength of DML - where A/B-test is impossible or very difficult to conduct. Most often these are self-selection situations, when users decide for themselves whether to use a feature or not. In Simple Terms: Key Idea: To use ML to remove all the "noise" and leave only the pure "cause-and-effect" signal. Key Idea: Use Case: For example, in a fintech application. You launch a new premium feature: an AI assistant that analyzes spending and gives personalized savings advice. The service is not enabled by default, the user has to activate it himself in the settings. Use Case: It's great for use in tandem with other methods and can often be used when simpler approaches are not suitable. It's great for use in tandem with other methods and can often be used when simpler approaches are not suitable. How do I make sure everything is working correctly? Congratulations, you've come a long way by reading this entire review. Fair enough, you may have had a thought: these methods are quite complex, how can I be sure I've done it right? How can I trust the final results? And heck, that's the most correct view. The general idea of checking the correctness of estimation methods is summarized as follows: We’re measuring the effect where it clearly shouldn’t be — just to make sure it isn’t there. We’re measuring the effect where it clearly shouldn’t be — just to make sure it isn’t there. With RCE, it's pretty simple - we need an A/A test. We run the experiment according to our design: exactly the same metrics, splitting, etc. Except that we do NOT show our new feature to both groups. As a result, we shouldn't see any difference between them. Sometimes it makes sense to do backtesting in the same way: after the feature has worked for a while, roll it back for some traffic and check that the effect is still the same as what we saw when we did the AB test the first time. But quasi-experiments are a bit more complicated. Each of the methods has its own specificity and may contain its own special ways to check the correctness of implementation. Here we will talk about relatively universal methods, which I recommend in most cases. Robustness Checks To make sure that the effect we have found is not an accident or model error, we conduct a series of “stress tests”. The idea is the same: we create conditions in which the effect should not occur. If our method doesn't find it there either, our confidence in the result grows. Here are some key checks: Placebo Tests Placebo Tests This test checks the uniqueness of your effect compared to other objects within your dataset. How to do: Take, for example, the synthetic control method. We have one “treated” subject (who was exposed) and many “clean” subjects in a control group (no exposure). We pretend in turn that each of the objects in the control group was affected, and construct our “synthetic control” for them. How to do: What to expect: In an ideal world, for all these “fake” tests, we should not see as strong an effect as for our real case. The graph of the real effect should stand out prominently against the “placebo effects”. What to expect: Why it's needed: This test shows whether our result is unique. If our method finds significant effects in subjects where nothing happened, it is also likely that our main finding is just noise or a statistical anomaly, not a real effect. Why it's needed: In-time Placebo How to do it: We artificially shift the date of our intervention into the past. For example, if the actual ad campaign started on May 1st, we “tell” the model that it started on April 1st when nothing actually happened. How to do it: What to expect: The model should not detect any meaningful effect on this fake date. What to expect: Why: This helps ensure that the model is responding to our event and not to random fluctuations in the data or some seasonal trend that coincidentally occurred on the date of our intervention. Why: In-space Placebo This test checks the reliability of your model by testing it for its tendency to produce false positives on completely independent data. How to do: If you have data that is similar to your target data but that was definitely not affected by the intervention, use it. For example, you launched a promotion in one region. Take sales data from another region where the promotion did not take place and apply your model to it with the same actual intervention date. How to do: What to expect: The model should find no effect for this “control” data. What to expect: Why: If your model finds effects everywhere you apply it, you can't trust its conclusions on the target series. This test shows that the model is not “hallucinating” by creating effects from nothing. Why: Decision Map (Instead of conclusions) If you've read (or scrolled) all the way down here, I guess you don't need another nice outline of why it's so important to measure the results of AI/ML implementation of a feature. It is much more valuable for you if you get a useful decision-making tool. And I have one. The framework looks like this. Measure through AB test. Measure through the AB test. Seriously. Think about different split units and clusters to still apply RCE. Below is a cheat sheet on choosing a Causal Inference method to quickly figure out which one is right for you. Go back to the part of the article where I explain it in layman's terms. After that, go to the manuals and guides on this method Measure through AB test. Measure through the AB test. Seriously. Think about different split units and clusters to still apply RCE. Below is a cheat sheet on choosing a Causal Inference method to quickly figure out which one is right for you. Go back to the part of the article where I explain it in layman's terms. After that, go to the manuals and guides on this method Helpful materials: Used in writing this article and highly recommended for a deeper dive into the topic Understand the full cycle of creating AI/ML solutions Machine Learning System Design by Valerii Babushkin and Arseny Kravchenko Machine Learning System Design Valerii Babushkin Arseny Kravchenko The path to the world of RCE Trustworthy Online Controlled Experiments by Ron Kohavi, Diane Tang, Ya Xu Trustworthy Online Controlled Experiments Where to understand Causal Inference in detail: Miguel Hernan and Jamie Robins “Causal Inference: What If” Miguel Hernan and Jamie Robins “Causal Inference: What If” Causal Inference for the Brave and True Causal Inference for the Brave and True Causal ML Book Causal ML Book