Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.
Experiment tracking is the process of saving all experiment related information that you care about for every experiment you run.
Experiment tracking is the process of saving all experiment related information that you care about for every experiment you run. This “metadata you care about” will strongly depend on your project, but it may include:
Of course, you want to have this information available after the experiment has finished, but ideally, you’d like to see some of it as your experiment is running as well.
Why?
Because for some experiments, you can see (almost) right away that there is no way they will get you better results. Instead of letting them run (which can take days or weeks), you are better off simply stopping them and trying something different.
To do experiment tracking properly, you need some sort of a system that deals with all this metadata. Typically, such a system will have 3 components:
Of course, you can implement each component in many different ways, but the general picture will be very similar.
Wait, so isn’t experiment tracking like MLOps or something?
Experiment tracking (also referred to as experiment management) is a part of MLOps: a larger ecosystem of tools and methodologies that deals with the operationalization of machine learning.
MLOps deals with every part of ML project lifecycle from developing models by scheduling distributed training jobs, managing model serving, monitoring the quality of models in production, and re-training those models when needed.
That is a lot of different problems and solutions.
Experiment tracking focuses on the iterative model development phase when you try many things to get your model performance to the level you need.
So how is experiment tracking different from ML model management?
ML model management starts when models go to production:
But not every model gets deployed.
Experiment tracking is useful even if your models don’t make it to production (yet).
Experiment tracking is useful even if your models don’t make it to production (yet). And in many projects, especially those that are research-focused, they may never actually get there. But having all the metadata about every experiment you run ensures that you will be ready when this magical moment happens.
Ok, if you are a bit like me, you may be thinking:
Cool, so I know what experiment tracking is. …but why should I care?
Let me explain.
Building a tool for ML practitioners has one huge benefit. You get to talk to a lot of them.
And after talking to hundreds of people who track their experiments in Neptune, I saw 4 ways in which experiment tracking can actually improve your workflow.
There are many ways to run your ML experiments or model training jobs:
Sometimes you just want to test something quickly and run an experiment in a notebook. Sometimes you want to spin up a distributed hyperparameter tuning job.
Either way, during the course of a project (especially when there are more people working on it), you can end up having your experiment results scattered across many machines.
With the experiment tracking system, all of your experiment results are logged to one experiment repository by design. And keeping all of your experiment metadata in a single place, regardless of where you run them, makes your experimentation process so much easier to manage.
“[experiment tracking system] allows us to keep all of our experiments organized in a single space. Being able to see my team’s work results any time I need makes it effortless to track progress and enables easier coordination.” – Michael Ulin VP, Machine Learning @Zesty.ai
Specifically, a centralized experiment repository makes it easy to:
Additionally, you can sleep peacefully knowing that all the ideas you tried are safely stored, and you can always go back to them later.
WANT TO EXPLORE THIS TOPIC?
Read more about ML experiment organization
Whether you are debugging training runs, looking for improvement ideas, or auditing your current best models, comparing experiments is important.
But when you don’t have any experiment tracking system in place:
In those situations, something as simple as comparing and analyzing experiments can get difficult or even impossible.
With an experiment tracking system, your experiments are stored in a single place, you follow the same protocol for logging them, so those comparisons can go really deep. And you don’t have to do much extra.
“Tracking and comparing different approaches has noticeably boosted our productivity, allowing us to focus more on the experiments [and] develop new, good practices within our team…” – Tomasz Grygiel, Data Scientist @idenTT
Proper experiment tracking makes it easy to:
Modern experiment tracking tools will give you many of those comparison features (almost) for free. Some tools even go as far as to automatically find diffs between experiments or show you which parameters have the biggest impact on model performance.
When you have all the pieces in one place, you might be able to find new insights and ideas just by looking at all the metadata you logged. That is especially true when you are not working alone.
Speaking off…
When you are part of a team, and many people are running experiments, having one source of truth for your entire team is really important.
“[An experiment tracking system] makes it easy to share results with my teammates. I’m sending them a link and telling what to look at, or I’m building a view on the experiments dashboard. I don’t need to generate it by myself, and everyone in my team has access to it.” – Maciej Bartczak, Resarch Lead @Banacha Street
Experiment tracking lets you organize and compare not only your past experiments but also see what everyone else was trying and how that worked out.
Sharing results becomes easier, too.
Modern experiment tracking tools let you share your work by sending a link to a particular experiment or dashboard view. You don’t have to send screenshots or “have a quick meeting” to explain what is going on in your experiment. It saves a ton of time and energy.
For example, here is a link to an experiment comparison I did months ago. Pretty easy, right?
Apart from sharing things you see in a web UI, most experiment tracking setups let you access experiment metadata programmatically. This comes in handy when your experiments and models go from experimentation to production.
For example, you can connect your experiment tracking tool to a CI/CD framework and integrate ML experimentation into your teams’ workflow. A visual comparison between the models on branches `master` and `develop` (and a way to explore details) adds another sanity check before you update your production model.
When you are training a model on your local computer, you can see what is going on at any time. But if your model is running on a remote server at work, university, or in the cloud, it may not be as easy to see how the learning curve looks like or even if the training job crashed.
Experiment tracking systems solve this problem because, while it may be a big security no-no to allow remote access to all of your data and servers, letting people see ONLY their experiment metadata is usually fine.
When you can see your running experiments right next to your previous runs, you can compare them quickly and decide whether it makes sense to continue. You can see that your cloud training job has crashed, and you can close it (or fix the bug and re-run).
Why waste those precious GPU hours on something that is not converging.
Speaking of GPU, some experiment tracking tools keep track of hardware consumption as well. This can help you see whether you are using your resources efficiently.
For example, looking at GPU consumption over time can help you see that your data loaders are not working correctly or that your multi-GPU setup is actually using just one card (which happened to me more times than I’d like to admit).
“Without information I have in the monitoring section I wouldn’t know that my experiments are running 10 times slower than they could.” – Michał Kardas, Machine Learning Researcher @TensorCell
So far, we’ve covered what experiment tracking is and why it matters.
It’s time to get into details.
As I said initially, the kind of information, you may want to track depends on the project characteristics.
That said, there are some things that you should keep track of regardless of the project you are working on. Those are:
Keeping track of those things will let you reproduce experiments, do basic debugging, and understand what happened at a high-level.
That said, you can always log more things to gain even more insights.
What else you could keep track of
The additional things you may want to keep track of are related to the type of project you are working on.
Below are some of my recommendations for various ML project types.
Machine Learning
Deep Learning
Computer Vision
Natural Language Processing
Structured Data
Reinforcement Learning
Hyperparameter optimization:
Ok, those are nice guidelines, but how do you actually implement experiment tracking in your project?
There are (at least) a few options. The most popular being:
Let’s talk about those now.
A common approach is to simply create a big spreadsheet where you put all of the information that you can (metrics, parameters, etc) and a directory structure where things are named in a certain way. Those names usually end up being really long like ‘model_v1_lr01_ batchsize64_ no_preprocessing_ result_accuracy082.h5’.
Whenever you run an experiment, you look at the results and copy them to the spreadsheet.
What is wrong with that?
To be honest, in some situations, it can be just enough to solve your experiment tracking problems. It may not be the best solution but it is quick and simple.
…things can fall apart really quickly
But things can fall apart really quickly. There are (at least) a few major reasons why tracking experiments in spreadsheets doesn’t work for many people:
Another option is to version all of your experiment metadata in Github.
The way you can go about it, is to commit metrics, parameters, charts, and whatever you want to keep track of to Github when running your experiment. It can be done with post-commit hooks where you create or update some files (configs, charts, etc) automatically after your experiment finishes.
… Github wasn’t built for … machine learning
It can work in some setups but:
What should you do instead?
While you can try and adjust general tools to work for machine learning experiments, you could just use one of the solutions built specifically for tracking, organizing, and comparing experiments.
“Within the first few tens of runs, I realized how complete the tracking was – not just one or two numbers, but also the exact state of the code, the best-quality model snapshot stored to the cloud, the ability to quickly add notes on a particular experiment. My old methods were such a mess by comparison.” – Edward Dixon, Data Scientist @intel
They have slightly different interfaces but they usually work in a similar way:
Step 1
Connect to the tool by adding a snippet to your training code.
For example:
import neptune.new as neptune
run = neptune.init(...) # create a Run credentials
Step 2
Specify what you want to log (or use an ML framework integration that does it for you):
from neptune.new.types import File
run['accuracy'] = 0.92
for prediction_image in worst_predictions:
run['worst predictions'].log(File.as_image(prediction_image))
Step 3
Run your experiment as you normally would:
python train.py
And that’s it!
Your experiment is logged to a central experiment database and displayed in the experiment dashboard, where you can search, compare, and drill down to whatever information you need.
Today there are at least a few good tools for experiment tracking and I would strongly recommend using one of them. They were designed to treat machine learning experiments as first-class citizens, and they will always:
Experiment tracking is a practice even more than a tool or a logging method. It will take some time to really understand and implement:
Hopefully, after reading this article, you have a good idea of whether experiment tracking can improve your (or your teams’) machine learning workflow.
Do you want to start tracking your experiments?
Are you hungry for more on the subject?
Here are some additional resources:
Happy experimenting!
This article was originally written by Jakub Czakon and posted on the Neptune blog. You can find more in-depth articles for machine learning practitioners there.