Roughly Half of Data Scientists Consider Model Monitoring a Major Nuisance: Does It Have to Be So?

Typically, Machine Learning (ML) life cycle involves six key stages: data collection, data processing, data labeling, model training/evaluation, model deployment, and model monitoring.

A recent large-scale project by one of the bigger data science communities produced a novel ML value chain landscape that explored the entire ML life cycle in detail. Shaped by numerous data scientists and ML engineers, the study – among other things – revealed that the last stage of the chain (i.e., model monitoring) was one of the least liked and most dreaded stages of the whole life cycle.

Facing the music

Few practitioners would stop to wonder why that’s the case. The obvious answer has to do with the additive effect, that is, each stage of the chain may contain problems, and by the time you get to the very end, they all tend to come together to rain on your parade. In fact, as some respondents of the study explained, this troublesome pattern emerges already during development, because, it turns out, “most solutions simply aren’t optimized at all for monitoring models past the initial stage.”

As a result, many ML engineers working on AI products are often forced to “face the music” in the final stages of the life cycle, which can sometimes look more like coded cacophony. To top it off, there’s the problem of project records. Model monitoring, as one of my colleagues put it, involves “a lot of sifting through logs to find intelligent insights quickly,” which can be exceptionally difficult because different specialists tend to have different working preferences and habits.

Live data

But even these issues can be overcome, so long as you’re diligent, disciplined, and consistently on the ball. Regardless of this, however, the biggest problem for ML engineers still stands – model monitoring isn’t static. Say your logs are clean, your data is adequately labeled, and your model is supposedly up to the task. Be that as it may, it doesn’t mean you can have a breather. On the contrary, in many ways, that’s the start of the most important part of the process – monetizable, live-product stage, which is presumably why you began this arduous journey in the first place. The big question is how do you know if your model is doing a good job when new user data starts pouring in?

A telling example of how new user data may affect your deployed model is scented candle reviews during COVID-19. Let’s imagine you manufacture or sell scented candles and receive user feedback that points to the fact that your candles have little or no scent. Today, we know that this phenomenon was observed because people infected with the coronavirus would temporarily lose their sense of smell, i.e., it’s not at all about the product.

In other words, here we’re witnessing a clear case of data (or distribution) drift due to external factors: the same reviews under varying conditions can lead to very different conclusions. But if we unknowingly use these erroneous conclusions to retrain our model – which may also include label drifts (changes in the task’s details) or training skews (non-representative data) – it can and will lead to potentially undesirable outcomes.

Human-in-the-loop monitoring

With all of that in mind, the prospect of model monitoring can make us ML engineers feel uneasy. After all, not only do we have to fix possible earlier problems but also process tons of new data with every hurdle it brings along. One way I’ve been able to make the process less tedious and frantic is by adopting the so-called human-in-the-loop approach.

Typically, your ML inference process may look something like this:

In this scenario, we can use saved predictions in the database. The trouble is that there’s no way of verifying their validity within the current pipeline. But what we can do to determine where our model stands is obtain ground-truth answers using human annotators and compare them to machine answers. Our monitoring process will then look something more like this:

In this expanded scenario, the data is given to real people, who in the case of crowdsourcing platforms are specific, task-selected crowd contributors.

Afterward, there are two ways of doing it. The first one is a two-step process that involves (a) getting the annotators to answer the same questions the model faced (i.e., producing “golden sets”) and then (b) comparing these new sets to the model’s predictions. Another way is a one-step process that involves getting the annotators to look at the model’s input data and rate the model’s predictions from the database based on that (at the same time). Both approaches will provide information about your model’s performance, and while the second approach is simpler and more efficient, it can only be used to gauge but not fine-tune the model (unlike the first option).

In both cases, the evaluation data you gather can be plotted on a dashboard where particular patterns and mistakes usually emerge in ways that can be easily detected. Luckily, some platforms actually allow you to utilize their API alongside Python which sits well with most ML models – there are no incompatibility issues or extra work involved.

Test-drive and caveats

Practice makes perfect, so here’s a mock example to see how the whole process works in real life. It contains a hands-on demo with API for model prediction and ready code to set up and run a short monitoring project.

In this purposefully simplified, by now “classical” example, the task is to identify images of cats and dogs based on a hypothetical ML model, with the options being: (1) cat, (2) dog, and (3) neither (which is a red flag that something isn’t right, either with your model or incoming data).

As you’ll be able to see from the example, human-in-the-loop monitoring does provide a neat solution; however, it’s no magic pill. It’s important to consider the following caveats to ensure the best results if you decide to utilize this approach:

Since judging your model’s metrics is subject to ongoing user data evaluation, it’s crucial to also keep an eye on the quality of your human-handled data that you’ll be treating as your ground truth. You can, for example, assign a small number of tested annotators to act as quality assurance for the rest and/or use different quality control techniques.

Another useful approach that ensures trustworthy results is to give the same data point not to 1 but 3 or 5 human annotators. After that, you may use one of the methods provided in Crowd-kit, for example, to carry out aggregation.

Make sure you check aggregation metrics like aggregation confidence, consistency, uncertainty and/or inter-annotator agreement. For classification, these metrics are contained within Crowd-kit already.

Don’t forget to calculate margins of error add error bars when plotting your graphs.

Always test for statistically or marginally significant changes to the model before moving forward.

Bottom line

As we’ve seen, the whole ML life cycle may often be a rocky road for ML engineers. My advice is to use a combination of methods to combat ongoing challenges and get the highest quality results. It seems that in the model monitoring phase, the human-in-the-loop approach can offer a practical solution, which also happens to be quite fast and affordable.

Based on the results from my own projects, human-in-the-loop model monitoring can improve your AI product, provided no issues in the previous stages were swept under the carpet. At the same time, if something important was missed along the way, this approach may also help reveal these problems by having human annotators as your additional point of reference.