在AI/ML项目中衡量业务影响的实用指南

喜欢或不喜欢,现实是这样的:仅仅你的产品或功能使用AI / ML是不够的。 至少不是如果你想要一个持久的,积极的效果。超越狂欢和病毒的喧嚣,业务仍然是关于利润,回报和增长指标。这就是其本质,在大多数情况下。这意味着,对于我们,这些项目的建设者,重要的是知道如何衡量AI / ML的影响。 在本文中,我们将看看帮助我们衡量人工智能采用的影响的工具包,并了解这些方法是如何工作的。 到最后,你将了解衡量你的项目对业务的影响的主要方法,你将有一个粗糙的方法地图,并知道如何定位自己 - 什么最适合你的案例。 叙述地图 - 帮助您导航 我们将从基本知识开始 - 为什么以及何时值得测量。 接下来,我们将看看什么可能出错,以及为什么不是每个变化都值得进行实验。 从那里,我们将沉浸在黄金标准 - A / B 测试 - 以及当你无法运行时该怎么做。 最后,我们将超越实验到因果推断方法 - 以及如何信任你发现的东西。 测量影响:为什么,何时,以及如何不 When it’s worth measuring 当它值得测量时 首先,你可能真正想要打扰的主要原因。 Risk & Value 我们已经在开始时触及了价值,我们想要衡量一个功能是否提供足够的价值来证明进一步的投资。 但不要忘记关键因素 - 如果你的闪亮的新变化实际上使事情变得更糟 - 转换率下降,用户在失望中离开,钱更少 - 你 想知道足够快的反应,避免失败可能比抓住一个倒退更重要。 risk management. 绝对 Innovation 自 Agile 早期以来,产品开发一直是关于快速迭代、市场武器竞赛和无休止的产品与市场相匹配的搜索。企业同时做数十件事情,以保持竞争力,你可能希望在纯粹好的解决方案中优先考虑真正出色的解决方案。 Optimization 一个有意识的,以测量为导向的方法的美好之处在于它允许你更深入地挖掘。 也许收入没有立即跳跃,但用户喜欢新的功能,并每天参与它. 也许它与某一特定部分共鸣,但令其他人烦恼。 自然 Organization Do you work at a place that talks about being “data-driven”? Or maybe you’re personally the type who trusts facts over gut feeling? Then this is where you want to be: learning how to measure effect, and making sure your results actually lead you toward the goals you set. When Not to Test 何时不做测试 也就是说,在那里 实验没有意义的例子 - 或者根本是不可能的。 是 Methodological limits 有时应用方法是不可能的,数据太少以获得结果,变化太小以便检测,或者根本没有关于应该改进什么的假设(在这种情况下 - 为什么需要开发任何东西?). Non-optional work 一些更改是强制性的. 法律或合规性要求是经典的案例. 不是人工智能特定的,但很清楚:您需要遵守GDPR,为可访问性调整您的产品等等。 同样适用于 网站不会返回502错误,这对企业价值有何影响? critical fixes or infrastructure updates. Ethical red lines 一些测量实践跨越道德界限,带来用户损害或操纵设计的风险. 想想可能引发财务损失或损害用户数据的实验。 Better alternatives 如果花费在测量上的努力(资源)可能高于价值本身,跳过它。 最简单的例子:只有少数客户的年轻,快速移动的初创公司通常更好地通过短暂的迭代和直接的用户反馈来追逐产品和市场匹配。 如何不测量 前后 直观上,迫使是做以下的事情: See how it was 启动新功能 看看现在怎么样 See a positive result 利润 但“现在看起来更好”有一个黑暗的一面,很多事情随着时间的推移而改变(季节性、外部事件、交通变化)。 混淆环境变化与特征影响。 Core issue: YoY(全年)比较 This familiar and traditional business trick is perfect for answering the question, "Are we growing as a business overall?". 它在许多情况下是有用的,但不是对功能的实施或改进的孤立评估。 想象一下,今年在商业上发生了多少事情,营销人员,SEO,销售人员,你称之为 - 你周围的每个人都在不懈地努力确保增长,我们周围的世界也不会停滞不前。 但是,今年1月的10%的营业额增加仅仅是由于你的AI聊天机器人(认真吗?) 一个窗口太长 - 数十个其他变化同时发生. 任何YoY差异反映一切,而不是您的特性。 Core issue: 因果关系 > 因果关系 你可能听说过“相关性并不意味着因果关系”,但实际上它意味着什么? 想象一下你推出了AI聊天机器人,之后一段时间,完成的购买数量增加了。听起来像聊天机器人导致了增加,对吗?也许 - 或者也许不是。当使用和结果一起移动时,它看起来像是因果。但是在同一时间,你的营销团队推出了新的活动。 复杂的部分是,数据只能看起来相关,因为两件事同时发生。我们的大脑擅长识别模式,但业务充满了平行事件和噪音。 相关性只是表明两件事同时发生了变化,但并不保证一件事造成了另一件事。 Core issue: 行业的黄金标准 - 随机控制实验(RCE) 10次的10你想在这里。幸运的是,8-9次的10你会在这里. 这是因为这些情况,RCE是不够的,这篇文章来了。 然而,让我们从好的东西开始。 经典A/B测试 你可能已经熟悉了这个方法,总结如下: For example, that the description of goods and services generated with GenAI will be as good (or better) than the one written by a human. Or the block “Customers Also Bought / Frequently Bought Together” in an online store will stimulate customers to buy more staff. Personalized emails engage users more. And so on. We form a hypothesis. by which to determine the success/failure of the hypothesis. Define one or more metrics Consider possible cycles of product use. Calculate the sample size and duration of the experiment. into two (or more) groups and . The control group (a) sees the product without the new feature, and the experimental group sees the new version of the product containing the change we are testing. We check that the groups differ only in the presence or absence of the new feature. Randomly split the traffic run the experiment . We apply statistical methods, calculate the difference in metrics and make sure that it is statistically significant. Stage 0 could be an A/A test (when both groups see the same version of the product and we do not see any difference in their behavior) to make sure that the traffic separation and methodology works correctly Analysis Based on the analysis, a decision is made: use, refine or reject. Decision making and iteration. Its magical, wonderful world where there is a lot of control, a chance to double-check yourself, to measure your confidence in your results. Plenty of learning resources and experienced colleagues around. What could possibly go wrong? The main reasons why we will have to give up cozy classical AB tests: 1. is when the behavior of one group affects another. That means the control group also changes - even though it shouldn't. Spillover Effect A textbook example is the Facebook friend recommendation block. We hope that it will help users build social connections. Let's imagine that group A doesn't have such a block, but group B sees it. User John from group B sees such a block, sees user Michael from group A in it, and adds him as a friend. Both users have +1 friend, although Michael should have been in a group that is not affected by the tested feature. Let's look at a few different examples where the might occur Spillover Effect or . If we have very few users (unpopular part of the product, B2B, etc.) or we work with a very rare event (buying a very expensive product or someone actually read the Terms & Conditions). In such cases, it will take a huge amount of time to get a little bit significant result. 2. Few users rare events . If we launch a change that affects the environment and cannot be isolated to individual users. For example, we are testing an advertising auction algorithm - prices will change for all advertisers, not just for those we try to isolate into an experimental group. 3. Impact on external factors . Our chip can change the composition of groups. For example, it repels or attracts certain types of users. For example, if a new feature starts to attract newcomers (this is not our goal) and increases their share in the test group, while in the control group it remains unchanged - the groups will not be comparable. 4. Brand's Effect 好消息是,部分问题可以解决,而不会超越RCE,基本上使用相同的机制。 There’s more to split than traffic! 有比交通更要分开! 上述一些问题只能通过改变整体测试设计的一部分来解决。 根据许多总结和分析师,不同的合作飞行员和助手出现在基于LLM的产品顶部,他们在受欢迎程度和“生存率”方面都领先,即他们有机会活得比MVP更长。 这种类型的项目的共同特点是,我们有一个解决方案,旨在简化/加速员工的工作. 它可以是呼叫中心运营商,销售人员,融资人员等。 这里( 作为实验的一部分,研究人员希望看到人工智能工具的使用如何影响工程师的工作。如果给他们一个现代化武器库,他们会更快地完成任务吗?但是只有16名开发人员参与了实验,这绝望地足够小,希望获得自信的结果。 链接 相反,作者分裂了。 所以这里的样本不是16个开发人员,而是246个任务,这仍然不是一个巨大的样本,但: tasks P值是好的。 作者分析并标记了屏幕记录,进行了采访,简而言之,他们进行了定性研究,当定性和定量研究的结果一致时,这是一个强有力的信号。 但是现在对我们来说重要的是在我们的主题框架内得出结论,我们对这项研究本身并不感兴趣,而是对方法的可理解示例。 Let’s give this idea a skeleton. AI Copilots(联系中心 / 开发团队 / 等) Case: Why not user-split? “Users” here are agents/devs; small populations + spillovers (shared macros, coaching, shift effects). Instead, randomize: 门票 / 对话(在接待时分配治疗)。 或者排队 / 意图作为集群(发票,技术,回报等)。 stratify by channel (chat/email/voice) and priority/SLA; monitor automation bias; analyze with cluster-robust SE. Design notes: 一旦你理解了这个原则,你也可以将其应用到其他实体中,你可以分割时间、地理位置等等,寻找类似的案例,得到灵感和适应。 一旦你理解了这个原则,你也可以将其应用到其他实体中,你可以分割时间、地理位置等等,寻找类似的案例,得到灵感和适应。 我会留下另一个常见的任务类型,经典AB测试可能不适合 - 定价算法。 动态价格(零售) Case: Why not user-split? 在商店里,不可能(而且令人困惑)向不同的人展示不同的价格。 在线往往是非法的/不道德的,引发公平问题。 Instead, randomize: 相同 SKU×商店的时间(逆转)(例如,按班次/天)。 (可选)集群 - SKU×商店(或商店集群),按类别/流量分层 平衡日的周 / 季节性;使用集群坚固的SE;在促销 / 股票重叠的警戒。 Design notes: 当随机化不是一种选择 How do you measure the impact of your core AI feature when it's already live for everyone or you can't run experiment with control group? 我们已经确定RCE是黄金标准的理由,但控制实验的清洁世界往往会给商业的混乱现实带来机会.正如我们所看到的,RCE的所有局限性都无法解决,即使使用专业技术。 早或晚,每一个产品团队都面临一个关键的问题,经典的A / B测试无法回答。 让我们探索一些最流行的,并尝试捕捉他们的本质. 当时间到来时,你会知道在哪里挖掘。 方法概述 倾向分数匹配(PSM) 是的,GIST: 你可以考虑这种方法,当接触治疗不是随机的(例如,当用户自己决定是否使用您开发的功能时)。 Use Case: Imagine you've created a very cool, gamified onboarding for your product—for instance, an interactive tutorial with a mascot. You expect this to impact future user efficiency and retention. 在这种情况下,动机是关键因素.选择完成登机的用户可能已经对探索产品更感兴趣. 为了衡量登机本身的“纯”效果,您需要将其与类似的用户进行比较。 Decision Guide 决策指南 技术说明: (For the Advanced) : There are several ways to form pairs, each with its own trade-offs. Common methods include matching, matching, and matching . The choice depends on your data and research question. Matching Strategy Matters one-to-one one-to-many with or without replacement : After matching, you must verify that the characteristics (the covariates used to calculate the propensity score) are actually balanced between the treated and the newly formed control group. If they aren't, you may need to adjust your propensity score model or matching strategy. Always Check for Balance : The causal effect estimated with PSM is technically the "average treatment effect on the treated" (ATT). This means the result applies only to the types of users who were able to be matched, not necessarily to the entire population. The Effect is Not for Everyone : The final estimate is highly dependent on how the propensity score (the probability of treatment) was calculated. A poorly specified model will lead to biased results. The Result is Sensitive to the Model : PSM is intuitive, but sometimes simpler methods like regression adjustments or more advanced techniques (e.g., doubly robust estimators) can be more powerful or reliable. It's a good tool to have, but it's not a silver bullet. It's Not Always the Best Tool 匹配策略重要:有几种方法来组成对,每种都有自己的妥协方式. 常见的方法包括一对一匹配,一对多匹配,并与或没有替换匹配. 选择取决于您的数据和研究问题。 始终检查平衡:匹配后,您必须确认特性(用于计算倾向分数的可变性)实际上在被处理和新形成的控制组之间均衡,如果不是,您可能需要调整倾向分数模型或匹配策略。 效果不适用于每个人:PSM估计的因果效应在技术上是“治疗对待者的平均效应”(ATT)。 结果是对模型敏感:最终估计取决于如何计算倾向分数(治疗的概率)。 它并不总是最好的工具:PSM是直观的,但有时更简单的方法,如回归调整或更先进的技术(例如,双重强大的估计器)可以更强大或可靠。 合成控制(SC) 是的,GIST: 目标是找到几种类似于接受治疗的未经处理的单元,从这个池中,我们通过将它们结合起来,使它们的特性与被治疗的单元非常相似,从而创建了一个“合成”的控制组。 这种“组合”基本上是一个 来自控制组的单元(常被称为“捐赠者池”)的重量被选择以最大限度地减少所处理的单元和合成版本之间的差异 . weighted average pre-treatment period Use Case: Imagine your food delivery company is implementing a new AI-based logistics system to reduce delivery times across an entire city, like Manchester. A classic A/B test is impossible because the system affects all couriers and customers at once. You also can't simply compare Manchester's performance to another city, such as Birmingham, because unique local events or economic trends there would skew the comparison. To measure the true impact, you need to build a "synthetic" control that perfectly mirrors Manchester's pre-launch trends. 以下是如何构建“合成双胞胎”的方法。 the launch and uses a "donor pool" of other cities (e.g., Birmingham, Leeds, and Bristol) to create the perfect "recipe" for replicating Manchester's past. By analyzing historical data on key predictors (like population or past delivery times), the algorithm finds the ideal weighted blend. It might discover, for instance, that a combination of 他的表现历史几乎是曼彻斯特自己的完美比赛。 before "40% Birmingham + 35% Leeds + 25% Bristol" 一旦这个配方被锁定了,它被用来预测如果没有新系统会发生什么。从发布日开始,该模型通过将配方应用于捐赠城市的实际实时数据来计算“合成曼彻斯特”的性能。 Decision Guide 决策指南 技术说明: (For the Advanced) Always inspect the weights assigned to the units in the donor pool. If one unit receives almost all the weight (e.g., 99%), your "synthetic control" has essentially collapsed into a simple (DiD) model with a single, chosen control unit. This can indicate that your donor pool is not diverse enough. Weight Transparency and Diagnostics: Difference-in-Differences The original Synthetic Control method has inspired more advanced versions. Two popular ones are: Modern Extensions Exist: An extension that allows for multiple treated units and can perform better when a perfect pre-treatment fit is not achievable. Generalized Synthetic Control (GSC): A hybrid method that combines the strengths of both synthetic controls (for weighting control units) and difference-in-differences (for weighting time periods). It is often more robust to noisy data. Synthetic Difference-in-Differences (SDID): Always inspect the weights assigned to the units in the donor pool. If one unit receives almost all the weight (e.g., 99%), your "synthetic control" has essentially collapsed into a simple (DiD) model with a single, chosen control unit. This can indicate that your donor pool is not diverse enough. Weight Transparency and Diagnostics: Difference-in-Differences The original Synthetic Control method has inspired more advanced versions. Two popular ones are: Modern Extensions Exist: An extension that allows for multiple treated units and can perform better when a perfect pre-treatment fit is not achievable. Generalized Synthetic Control (GSC): A hybrid method that combines the strengths of both synthetic controls (for weighting control units) and difference-in-differences (for weighting time periods). It is often more robust to noisy data. Synthetic Difference-in-Differences (SDID): 区别差异(DID) 是的,GIST: 我们采取了一组,其中某些东西发生了变化(例如,我们获得了一项新功能)和一组,其中一切都保持不变。 The second group should be such that historically the trend of the key metric in it was the same as in the group with the feature. On the basis of this we assume that without our intervention the trends of metrics would be parallel. We look at the before and after differences in the two groups. Then we compare these two differences. (that's why the method is called Difference-in-Differences). 這個想法很簡單:如果沒有我們,兩個團體都會在沒有改變的情況下發展相同,但在我們的情況下,他們的改變之間的差異將是實施我們的功能的“純”效應。 Use Case(s): 该方法非常受欢迎,让我们甚至看看一些案例研究。 一个地区(国家,城市)获得了新的折扣系统(或AI服务),而另一个地区则没有。 LLM 用于为 Google Shopping 创建一个优化的 XML 源,用于一个产品类别。这包括创建更有吸引力的标题和详细的产品描述。 作为控制组使用了与标准、基于模板的类别的第二类类,然后我们比较了 CTR 或两组之间的转换等指标的变化。 Caveat: 对于不同类别(例如“笔记本电脑”和“狗食物”)的有机流量趋势可能因季节性或竞争对手的行为而大大不同,如果类别非常相似(例如“男性跑鞋”和“女性跑鞋”)则该方法可靠。 Caveat: Measuring the impact of a feature launched only on Android, using iOS users as a control group to account for general market trends. Caveat: A very common case in practice, but methodologically risky. Android and iOS audiences often have different demographics, purchasing power, and behavioral patterns. Any external event (e.g., a marketing campaign targeting iOS users) can break the parallel trends and distort the results. 警告: Decision Guide 决策指南 技术说明: (For the Advanced) The power of DiD lies in shifting the core assumption from the often-unrealistic "the groups are identical" to the more plausible "the groups' are identical." A simple post-launch comparison between Android and iOS is flawed because the user bases can be fundamentally different. A simple before-and-after comparison on Android alone is also flawed due to seasonality and other time-based factors. DiD elegantly addresses both issues by assuming that while the absolute levels of a metric might differ, their "rhythm" or dynamics would have been the same in the absence of the intervention. This makes it a robust tool for analyzing natural experiments. The Core Strength: trends While DiD is simple in its basic 2x2 case, it can become quite complex. Challenges arise when dealing with multiple time periods, different start times for the treatment across groups (staggered adoption), and when using machine learning techniques to control for additional covariates. Deceptive Simplicity: The problem of : the classical DiD model is ideal for cases where one group receives the intervention at one point in time. But in life, as you know, different subgroups (e.g. different regions or user groups) often receive the function at different times. and this is when applying standard DiD regression can lead to highly biased results. This is because groups already treated may be implicitly used as controls for groups treated later, which can sometimes even change the sign of the estimated effect. "Staggered Adoption" of the treatment effect: a simple DiD model implicitly assumes that the treatment effect is constant across all and over time. In reality, the effect may evolve (e.g., it may increase as users become accustomed to the feature) or vary between different subgroups. There are studies that show this and there are specific evaluation methods that take this effect into account. At least we think so until a new study comes out, right? Heterogeneity The power of DiD lies in shifting the core assumption from the often-unrealistic "the groups are identical" to the more plausible "the groups' are identical." A simple post-launch comparison between Android and iOS is flawed because the user bases can be fundamentally different. A simple before-and-after comparison on Android alone is also flawed due to seasonality and other time-based factors. DiD elegantly addresses both issues by assuming that while the absolute levels of a metric might differ, their "rhythm" or dynamics would have been the same in the absence of the intervention. This makes it a robust tool for analyzing natural experiments. The Core Strength: trends 欺骗性简单性:虽然DiD在其基本的2x2案例中很简单,但它可以变得相当复杂. 挑战发生在处理多个时间段,不同组的治疗开始时间(标记式采用)以及使用机器学习技术来控制额外的可变性时。 “Staggered Adoption”的问题:经典的DiD模型对于一个群体在某个时刻接受干预的情况来说是理想的,但在生活中,正如你所知道的,不同的子群体(例如不同的地区或用户群体)经常在不同的时间接受功能,而这就是当应用标准的DiD回归可能会导致高度偏见的结果时。 治疗效应的异性:一个简单的DiD模型暗示地假定治疗效应在所有情况下和时间之间都是恒定的。实际上,效果可能会演变(例如,随着用户习惯该功能而增加),或者在不同的子组之间有所不同。 Regression Discontinuity Design (RDD) 是的,GIST: 例如,如果一个用户根据一个规则获得了切割值的治疗(例如,“完成了100个订单”或“存在1个月”),我们假定在切割下面的人非常相似,例如,99个订单的用户几乎与101个订单的用户相同。 Use Case(s): 忠诚计划提供 RDD将比较花费超过1000美元的用户的行为(例如,保留,未来支出) 那些花钱的人 在1000美元标志上,他们行为的明显差异将是获得“黄金状态”的效果。 "Gold Status" $1001 $999 电子商务网站根据抵达时间为客户提供不同的运输选择。 收到2天航班,而任何客户到达 gets a 3-day shipping window. The site wants to measure the causal effect of this policy on the checkout probability. before noon just after noon Decision Guide 决策指南 技术说明: (For the Advanced) This article focuses on , where crossing the cutoff guarantees the treatment. A variation called exists for cases where crossing the cutoff only of receiving the treatment. Sharp RDD Fuzzy RDD increases the probability The first step in any RDD analysis is to . You should plot the outcome variable against the running variable. The "jump" or discontinuity at the cutoff should be clearly visible to the naked eye. plot the data A crucial step is choosing the right , or how far from the cutoff you look for data. It's a trade-off between bias and variance: bandwidth More accurate assumption (users are very similar), but fewer data points (high variance, low power). Narrow Bandwidth: More data points (low variance, high power), but a riskier assumption (users might be too different). Wide Bandwidth: 本文专注于Sharp RDD,其中跨越切割保证治疗.一种叫做模糊的RDD的变异对于跨越切割只增加接受治疗的概率的情况存在。 The first step in any RDD analysis is to . You should plot the outcome variable against the running variable. The "jump" or discontinuity at the cutoff should be clearly visible to the naked eye. plot the data A crucial step is choosing the right , or how far from the cutoff you look for data. It's a trade-off between bias and variance: bandwidth More accurate assumption (users are very similar), but fewer data points (high variance, low power). Narrow Bandwidth: More data points (low variance, high power), but a riskier assumption (users might be too different). Wide Bandwidth: Bayesian Structural Time Series (BSTS) 贝雅斯结构时间系列(BSTS) 基于事件前的数据,该模型构建了一个预测,如果没有我们的干预会发生什么。 要做到这一点,它依赖于其他类似的时间序列,这些时间序列没有受到变化的影响。 这种预测和现实之间的差异是估计效果。 我们早些时候看到了合成控制;考虑BSTS是通过类似,未受影响的单位估计影响的同一想法,但 . In Simple Terms: 关于 固醇 要构建一个“替代宇宙”,你的功能从未存在过,与合成控制的主要区别在于,要构建预测,它使用了贝叶斯模型,而不是重量的倍数。 Key Idea: 为了衡量效果,该模型使用来自其他类似类别的销售额来预测您的类别的销售额。 价格变化。 Use Case: 没有 有很好的准备好的库来工作BSTS(如谷歌的CausalImpact),你可以用10到20行代码完成它,只是不要忘记运行测试(见下面的块)。 有很好的准备好的库来工作BSTS(如谷歌的CausalImpact),你可以用10到20行代码完成它,只是不要忘记运行测试(见下面的块)。 Instrumental Variables (IV) 工具变量(四) 隐藏因素(如动机)影响用户的选择和最终结果的情况下,我们发现一个外部因素(“工具”)推动用户采取行动,但不会直接影响结果本身。 In Simple Terms: 找到一个“间接杠杆”,只移动所需的东西。 Key Idea: (学术) 您想要衡量电视广告对销售的影响,但广告显示在人们已经购买更多地区。 可能是天气:在雨天,人们看更多的电视(并看到广告),但天气本身并不直接使他们购买您的产品。 Use Case: instrument Double Machine Learning (DML) 一个现代的方法,使用两个ML模型来“清理”治疗和结果,以免受其他数百个因素的影响。 通过仅分析这种“清理”(残留物)后剩下的东西,该方法发现了纯粹的因果影响。 DML的主要力量 - 其中A / B测试是不可能的或非常困难的进行。 最常见的是,这些都是自我选择的情况,当用户自己决定是否使用一个功能。 In Simple Terms: 使用ML去除所有“噪音”,只留下纯粹的“因果”信号。 Key Idea: 例如,在一个FinTech应用程序中,您推出了一种新的高级功能:AI助理分析支出并提供个性化的储蓄建议。 Use Case: 它非常适合与其他方法一起使用,并且通常可以在不适合更简单的方法时使用。 它非常适合与其他方法一起使用,并且通常可以在不适合更简单的方法时使用。 如何确保一切都正常工作? 恭喜,你通过阅读整个评论走了一段很长的路。 很公平,你可能有个想法:这些方法非常复杂,我怎么能确定我做得对?我怎么能相信最终的结果? 哈哈哈,这是最正确的观点。 检查估计方法的正确性的一般观点如下概述: We’re measuring the effect where it clearly shouldn’t be — just to make sure it isn’t there. 我们按照我们的设计运行实验:完全相同的指标,分割等,除非我们不会向两组显示我们的新功能.因此,我们不应该看到它们之间的任何差异。 但是,半实验有点复杂,每个方法都有自己的具体性,可能包含自己的特殊方式来验证实施的正确性。 可靠性检查 为了确保我们发现的效果不是偶然或模型错误,我们进行了一系列的“压力测试”。这个想法是相同的:我们创造条件,在这种情况下效果不应该发生。 以下是一些关键的检查: Placebo Tests Placebo 测试 此测试检查您的效果与数据集中的其他对象相比的独特性。 我们有一个“被处理”的对象(被曝)和许多“清洁”的对象在一个控制组(没有暴露)。 How to do: 在一个理想的世界里,对于所有这些“假”的测试,我们不应该看到像我们的真实案例那样强烈的效果。 What to expect: 如果我们的方法在没有发生任何事件的受试者中发现了显著效果,那么我们主要的发现也可能只是噪音或统计异常,而不是真正的效果。 Why it's needed: 时时彩平台 例如,如果实际的广告活动开始于5月1日,我们会“告诉”模型,它开始于4月1日,当时什么都没有发生。 How to do it: 该模型不应该在这个假日期上检测到任何有意义的影响。 What to expect: This helps ensure that the model is responding to our event and not to random fluctuations in the data or some seasonal trend that coincidentally occurred on the date of our intervention. Why: 太空中的Placebo 该测试通过测试模型在完全独立的数据上产生虚假阳性的倾向来验证您的模型的可靠性。 如果您有与目标数据相似的数据,但该数据没有受到干预的影响,请使用它,例如,您在一个区域启动了促销活动,从另一个地区获取促销活动没有发生的销售数据,并将您的模型应用到相同的实际干预日期。 How to do: 该模型不应对这些“控制”数据产生影响。 What to expect: 如果你的模型在你应用它的任何地方都能找到效果,那么你就不能相信其对目标系列的结论。 Why: 决策地图(而不是结论) 如果你已经阅读(或滚动)到这里,我猜你不需要另一个很好的概述为什么测量AI / ML实现功能的结果如此重要。 如果你得到一个有用的决策工具,它对你来说更有价值。 框架看起来就是这样。 通过AB测试进行测量。 通过AB测试,认真对待 想想不同的分割单元和集群仍然适用RCE。 下面是关于选择因果推断方法的骗局页面,以便快速找出哪一种适合您。 回到文章的部分,我用世俗的术语解释它。 之后,转到此方法的手册和指南 有用的材料: Used in writing this article and highly recommended for a deeper dive into the topic 了解构建AI/ML解决方案的完整循环 由 和 机器学习系统设计 Valerii Babushkin 阿森尼克·克拉夫申科 通往RCE世界的道路 作者 Ron Kohavi,Diane Tang, Ya Xu 可靠的在线控制实验 如何详细了解因果断定: 米格尔·埃尔南和杰米·罗宾斯《因果推断:如果是怎么回事》 勇敢和真理的因果推断 因果 ML 书