Typically, Machine Learning (ML) life cycle involves six key stages: data collection, data processing, data labeling, model training/evaluation, model deployment, and model monitoring.
A recent large-scale project by one of the bigger data science communities
Few practitioners would stop to wonder why that’s the case. The obvious answer has to do with the additive effect, that is, each stage of the chain may contain problems, and by the time you get to the very end, they all tend to come together to rain on your parade. In fact, as some respondents of the study explained, this troublesome pattern emerges already during development, because, it turns out, “most solutions simply aren’t optimized at all for monitoring models past the initial stage.”
As a result, many ML engineers working on AI products are often forced to “face the music” in the final stages of the life cycle, which can sometimes look more like coded cacophony. To top it off, there’s the problem of project records. Model monitoring, as one of my colleagues put it, involves “a lot of sifting through logs to find intelligent insights quickly,” which can be exceptionally difficult because different specialists tend to have different working preferences and habits.
But even these issues can be overcome, so long as you’re diligent, disciplined, and consistently on the ball. Regardless of this, however, the biggest problem for ML engineers still stands – model monitoring isn’t static. Say your logs are clean, your data is adequately labeled, and your model is supposedly up to the task. Be that as it may, it doesn’t mean you can have a breather. On the contrary, in many ways, that’s the start of the most important part of the process – monetizable, live-product stage, which is presumably why you began this arduous journey in the first place. The big question is how do you know if your model is doing a good job when new user data starts pouring in?
A telling example of how new user data may affect your deployed model is
In other words, here we’re witnessing a clear case of data (or distribution) drift due to external factors: the same reviews under varying conditions can lead to very different conclusions. But if we unknowingly use these erroneous conclusions to retrain our model – which may also include label drifts (changes in the task’s details) or training skews (non-representative data) – it can and will lead to potentially undesirable outcomes.
With all of that in mind, the prospect of model monitoring can make us ML engineers feel uneasy. After all, not only do we have to fix possible earlier problems but also process tons of new data with every hurdle it brings along. One way I’ve been able to make the process less tedious and frantic is by adopting the so-called human-in-the-loop approach.
Typically, your ML inference process may look something like this:
In this scenario, we can use saved predictions in the database. The trouble is that there’s no way of verifying their validity within the current pipeline. But what we can do to determine where our model stands is obtain ground-truth answers using human annotators and compare them to machine answers. Our monitoring process will then look something more like this:
In this expanded scenario, the data is given to real people, who in the case of
Afterward, there are two ways of doing it. The first one is a two-step process that involves (a) getting the annotators to answer the same questions the model faced (i.e., producing “golden sets”) and then (b) comparing these new sets to the model’s predictions. Another way is a one-step process that involves getting the annotators to look at the model’s input data and rate the model’s predictions from the database based on that (at the same time). Both approaches will provide information about your model’s performance, and while the second approach is simpler and more efficient, it can only be used to gauge but not fine-tune the model (unlike the first option).
In both cases, the evaluation data you gather can be plotted on a dashboard where particular patterns and mistakes usually emerge in ways that can be easily detected. Luckily, some platforms actually allow you to utilize their API alongside Python which sits well with most ML models – there are no incompatibility issues or extra work involved.
Practice makes perfect, so here’s a
In this purposefully simplified, by now “classical” example, the task is to identify images of cats and dogs based on a hypothetical ML model, with the options being: (1) cat, (2) dog, and (3) neither (which is a red flag that something isn’t right, either with your model or incoming data).
As you’ll be able to see from the example, human-in-the-loop monitoring does provide a neat solution; however, it’s no magic pill. It’s important to consider the following caveats to ensure the best results if you decide to utilize this approach:
As we’ve seen, the whole ML life cycle may often be a rocky road for ML engineers. My advice is to use a combination of methods to combat ongoing challenges and get the highest quality results. It seems that in the model monitoring phase, the human-in-the-loop approach can offer a practical solution, which also happens to be quite fast and affordable.
Based on the results from my own projects, human-in-the-loop model monitoring can improve your AI product, provided no issues in the previous stages were swept under the carpet. At the same time, if something important was missed along the way, this approach may also help reveal these problems by having human annotators as your additional point of reference.