ML Experiment Tracking: Everything You Need to Know

Let me share a story that I’ve heard too many times. ”… We were developing an ML model with my team, we ran a lot of experiments and got promising results… …unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions… …after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything” – unfortunate ML researcher. And the truth is, when you develop you will run a lot of experiments. ML models Those experiments may: use different models and model hyperparameters, use different training or evaluation data, run different code (including this small change that you wanted to test quickly), run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed) And as a result, they can produce completely different evaluation metrics. Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result. This is where ML experiment tracking comes in. In this article, you will learn: What is ML experiment tracking 4 ways in which it can improve your work What are the best practices of ML experiment tracking How to add experiment tracking into your workflow What is ML experiment tracking? Experiment tracking is the process of saving all experiment related information that you care about for every experiment you run. Experiment tracking is the process of saving all experiment related information that you care about for every experiment you run. This “metadata you care about” will strongly depend on your project, but it may include: Scripts used for running the experiment Environment configuration files Versions of the data used for training and evaluation Parameter configurations Evaluation metrics Model weights Performance visualizations (confusion matrix, curve) ROC Example predictions on the validation set (common in computer vision) Of course, you want to have this information available after the experiment has finished, but ideally, you’d like to see some of it as your experiment is running as well. Why? Because for some experiments, you can see (almost) right away that there is no way they will get you better results. Instead of letting them run (which can take days or weeks), you are better off simply stopping them and trying something different. To do experiment tracking properly, you need some sort of a system that deals with all this metadata. Typically, such a system will have 3 components: : A place where experiment metadata is stored and can be logged and queried Experiment database : A visual interface to your experiment database. A place where you can see your experiment metadata. Experiment dashboard : Which gives you methods for logging and querying data from the experiment database. Client library Of course, you can implement each component in many different ways, but the general picture will be very similar. Wait, so isn’t experiment tracking like MLOps or something? Experiment tracking vs ML model management vs MLOps Experiment tracking (also referred to as experiment management) is a part of MLOps: a larger ecosystem of tools and methodologies that deals with the operationalization of machine learning. from developing models by scheduling distributed training jobs, managing model serving, monitoring the quality of models in production, and re-training those models when needed. MLOps deals with every part of ML project lifecycle That is a lot of different problems and solutions. when you try many things to get your model performance to the level you need. Experiment tracking focuses on the iterative model development phase So how is experiment tracking different from ML model management? ML model management starts when models go to production: streamlines moving models from experimentation to production helps with model versioning organizes model artifacts in an ML model registry helps with testing various model versions in the production environment enables rolling back to an old model version if the new one seems to be going crazy But not every model gets deployed. Experiment tracking is useful even if your models don’t make it to production (yet). (yet). And in many projects, especially those that are research-focused, they may never actually get there. But having all the metadata about every experiment you run ensures that you will be ready when this magical moment happens. Experiment tracking is useful even if your models don’t make it to production Ok, if you are a bit like me, you may be thinking: Cool, so I know what experiment tracking is. …but why should I care? Let me explain. Why does experiment tracking matter? Building a tool for ML practitioners has one huge benefit. You get to talk to a lot of them. And after talking to hundreds of people who track their experiments in Neptune, I saw 4 ways in which experiment tracking can actually improve your workflow. All of your ML experiments are organized in a single place There are many ways to run your ML experiments or model training jobs: Private laptop PC at work A dedicated instance in the Cloud University cluster Kaggle kernel or Google Colab And many more. Sometimes you just want to test something quickly and run an experiment in a notebook. Sometimes you want to spin up a distributed job. hyperparameter tuning Either way, during the course of a project (especially when there are more people working on it), across many machines. you can end up having your experiment results scattered With the all of your experiment And keeping all of your experiment metadata in a single place, regardless of where you run them, makes your experimentation process so much easier to manage. experiment tracking system, results are logged to one experiment repository by design. – Michael Ulin VP, Machine Learning @Zesty.ai “[experiment tracking system] allows us to keep all of our experiments organized in a single space. Being able to see my team’s work results any time I need makes it effortless to track progress and enables easier coordination.” Specifically, a centralized experiment repository makes it easy to: Search and filter experiments to find the information you need quickly Compare their metrics and parameters with no additional work Drill down and see what exactly it was that you tried (code, data versions, architectures) Reproduce or re-run experiments when you need to Access experiment metadata even if you don’t have access to the server where you ran them See this view in app Additionally, you can sleep peacefully knowing that all the ideas you tried are safely stored, and you can always go back to them later. WANT TO EXPLORE THIS TOPIC? Read more about ML experiment organization Compare experiments, analyze results, debug model training with little extra work Whether you are debugging training runs, looking for improvement ideas, or auditing your current best models, comparing experiments is important. But when you don’t have any experiment tracking system in place: the way you log things can change, you may forget to log something important, you may simply lose some information accidentally. In those situations, something as simple as comparing and analyzing experiments can get difficult or even impossible. With an experiment tracking system, your experiments are stored in a single place, you follow the same protocol for logging them, so those comparisons can go really deep. And you don’t have to do much extra. – Tomasz Grygiel, Data Scientist @idenTT “Tracking and comparing different approaches has noticeably boosted our productivity, allowing us to focus more on the experiments [and] develop new, good practices within our team…” Proper experiment tracking makes it easy to: Compare parameters and metrics Overlay learning curves Group and compare experiments based on data versions or parameter values Compare Confusion Matrices, ROC curves, or other performance charts Compare best/worst predictions on test or validation sets View code diffs (and/or notebook diffs) Look at hardware consumption during training runs for various models Look at prediction explanations like Feature Importance, SHAP or Lime Compare rich-format artifacts like video or audio …Compare anything else you logged See this view in app Modern will give you many of those comparison features (almost) for free. Some tools even go as far as to automatically find diffs between experiments or show you which parameters have the biggest impact on model performance. experiment tracking tools When you have all the pieces in one place, you might be able to find new insights and ideas just by looking at all the metadata you logged. That is especially true when you are not working alone. Speaking off… Improve collaboration: see what everyone is doing, share experiment results easily, access experiment data programmatically When you are part of a team, and many people are running experiments, having one source of truth for your entire team is really important. – Maciej Bartczak, Resarch Lead @Banacha Street “[An experiment tracking system] makes it easy to share results with my teammates. I’m sending them a link and telling what to look at, or I’m building a view on the experiments dashboard. I don’t need to generate it by myself, and everyone in my team has access to it.” Experiment tracking lets you organize and compare not only your past experiments but also see what everyone else was trying and how that worked out. Sharing results becomes easier, too. to a particular experiment or dashboard view. You don’t have to send screenshots or “have a quick meeting” to explain what is going on in your experiment. It saves a ton of time and energy. Modern experiment tracking tools let you share your work by sending a link For example, here is a I did months ago. Pretty easy, right? link to an experiment comparison Apart from sharing things you see in a web UI, most This comes in handy when your experiments and models go from experimentation to production. experiment tracking setups let you access experiment metadata programmatically. For example, you can connect your experiment tracking tool to a CI/CD framework and integrate ML experimentation into your teams’ workflow. A visual comparison between the models on branches `master` and `develop` (and a way to explore details) adds another sanity check before you update your production model. See your ML runs live: manage experiments from anywhere and anytime When you are training a model on your local computer, you can see what is going on at any time. But if your model is at work, university, or in the cloud, how the learning curve looks like or even if the training job crashed. running on a remote server it may not be as easy to see Experiment tracking systems solve this problem because, while it may be a big security no-no to allow remote access to all of your data and servers, letting people see ONLY their experiment metadata is usually fine. When you can see your running experiments right next to your previous runs, you can compare them quickly and decide whether it makes sense to continue. You can see that your training job has crashed, and you can close it (or fix the bug and re-run). cloud Why waste those precious GPU hours on something that is not converging. Speaking of GPU, as well. This can help you see whether you are using your resources efficiently. some experiment tracking tools keep track of hardware consumption See this view in app For example, looking at GPU consumption over time can help you see that your data loaders are not working correctly or that your multi-GPU setup is actually using just one card (which happened to me more times than I’d like to admit). – Michał Kardas, Machine Learning Researcher @TensorCell “Without information I have in the monitoring section I wouldn’t know that my experiments are running 10 times slower than they could.” Experiment tracking best practices So far, we’ve covered what experiment tracking is and why it matters. It’s time to get into details. What you should keep track of in any ML experiment As I said initially, the kind of information, you may want to track depends on the project characteristics. That said, there are some things that you should keep track of regardless of the project you are working on. Those are: : preprocessing, training and evaluation scripts, notebooks used for designing features, other utilities. All the code that is needed to run (and re-run) the experiment. Code : The easiest way to keep track of the environment is to save the environment configuration files like `Dockerfile` (Docker), `requirements.txt` (pip) or `conda.yml` (conda). You can also save the Docker image on Docker Hub, but I find saving configuration files easier. Environment : saving data versions (as a hash or locations to data files) makes it easy to see what your model was trained on. You can also use modern data versioning tools like (and save the .dvc files to your experiment tracking tool). Data DVC : saving your run configuration is absolutely crucial. Be especially careful when you pass parameters via the command line (argparse, click, hydra) as this is a place where you can easily forget to track (I have some horror stories to share). . Parameters hyperparameters : logging evaluation metrics on train, validation, and test sets for every run is pretty obvious. But different frameworks do it differently. Metrics Keeping track of those things will let you reproduce experiments, do basic debugging, and understand what happened at a high-level. That said, you can always log more things to gain even more insights. What else you could keep track of The additional things you may want to keep track of are related to the type of project you are working on. Below are some of my recommendations for various ML project types. Machine Learning Model weights Evaluation charts (ROC Curve, Confusion matrix) Prediction distributions See this view in app Deep Learning Model checkpoints (both during and after training) Gradient norms (to control for vanishing or exploding gradient problems) Best/worst predictions on validation/test set after training Hardware resources: especially useful in debugging data loaders and multi GPU setups Computer Vision Model Predictions after every epoch (labels, overlayed masks or bounding boxes) See this view in app Natural Language Processing Prediction explanations ( ) on evaluation/test data eli5 text explainer is good Structured Data Input data snapshot ( `.head()` on training data if you are using pandas) Feature importances (permutation importance) Prediction explanations like SHAP or partial dependence plots (they are all available in DALEX). See this view in app Reinforcement Learning Episode return and episode length Total environment steps, wall time, steps per second Value and police function losses Aggregate statistics over multiple environments and/or runs Hyperparameter optimization: Run score: metric you are optimizing with HPO after every iteration Run parameters: parameters tried at each iteration Best parameters: best parameters so far and best parameters after the HPO sweep is finished Parameter comparison charts: there are various visualizations that you may want to log during or after training, like parallel coordinates plot or slice plot ( by the way). they are all available in Optuna See this view in app How to set up experiment tracking Ok, those are nice guidelines, but how do you actually implement experiment tracking in your project? There are (at least) a few options. The most popular being: Spreadsheets + naming conventions Versioning configuration files with Github Using modern experiment tracking tools Let’s talk about those now. You can use Spreadsheets and naming conventions (but please don’t) A where you put all of the information that you can (metrics, parameters, etc) and a directory structure where things are named in a certain way. Those names usually end up being really long like . common approach is to simply create a big spreadsheet ‘model_v1_lr01_ batchsize64_ no_preprocessing_ result_accuracy082.h5’ Whenever you run an experiment, you look at the results and copy them to the spreadsheet. What is wrong with that? To be honest, in some situations, it can be just enough to solve your experiment tracking problems. It may not be the best solution but it is quick and simple. …things can fall apart really quickly But things can fall apart really quickly. There are (at least) a few major reasons why tracking experiments in spreadsheets doesn’t work for many people: You have to them. If something doesn’t happen automatically, things get messy, especially with more people involved. remember to track You have to be sure that you or your team in the spreadsheet by accident. Spreadsheets are not easy to version, so if this happens, you are in trouble. will not overwrite things You have to . If someone on your team messes this up, you may not know where the experiment artifacts (model weights, performance charts) for the experiments you ran are. remember to use the naming conventions You have to (remember that things break). back up your artifact directories When your . Searching for things and comparing hundreds of experiments in a spreadsheet (especially if you have multiple people that want to use it at the same time) is not a great experience. spreadsheet grows, it becomes less and less usable You can version metadata files in GitHub Another option is to version all of your experiment metadata in Github. The way you can go about it, is to commit metrics, parameters, charts, and whatever you want to keep track of to Github when running your experiment. It can be done with where you create or update some files (configs, charts, etc) automatically after your experiment finishes. post-commit hooks … Github wasn’t built for … machine learning It can work in some setups but: .git and Github wasn’t built for comparing machine learning objects. . Compare in .git systems was designed for comparing two branches, master and develop, for example. If you want to compare multiple experiments, take a look at metrics and overlay learning curves you are out of luck. Comparing more than two experiments is not going to work (if not impossible). You can have branches with ideas or a branch per experiment but the more experiments you run the less usable it becomes Organizing many experiments is difficult , the information will be saved after your experiment is finished. You will not be able to monitor your experiments live What should you do instead? You can use one of the modern experiment tracking tools While you can try and adjust general tools to work for machine learning experiments, you could just use one of the solutions built specifically for tracking, organizing, and comparing experiments. – Edward Dixon, Data Scientist @intel “Within the first few tens of runs, I realized how complete the tracking was – not just one or two numbers, but also the exact state of the code, the best-quality model snapshot stored to the cloud, the ability to quickly add notes on a particular experiment. My old methods were such a mess by comparison.” They have slightly different interfaces but they usually work in a similar way: Step 1 Connect to the tool by adding a snippet to your training code. For example: neptune.new neptune run = neptune.init(...) import as # create a Run credentials Step 2 Specify what you want to log (or use an ML framework integration that does it for you): neptune.new.types File run[ ] = prediction_image worst_predictions: run[ ].log(File.as_image(prediction_image)) from import 'accuracy' 0.92 for in 'worst predictions' Step 3 Run your experiment as you normally would: python train.py And that’s it! Your experiment is logged to a central experiment database and displayed in the experiment dashboard, where you can search, compare, and drill down to whatever information you need. See this view in app Today there are at least a few good tools for experiment tracking and I would strongly recommend using one of them. They were as first-class citizens, and they will always: designed to treat machine learning experiments be person than general tools easier to use for a machine learning have with the ML ecosystem better integrations have than the general solutions more experiment-focused features Next steps Experiment tracking is a practice even more than a tool or a logging method. It will take some time to really understand and implement: for your project, what to keep track of to improve future experiments, how to use that information how to with it, improve your teams’ unique workflow experiment tracking. when to even use Hopefully, after reading this article, you have a good idea of whether experiment tracking can improve your (or your teams’) machine learning workflow. Do you want to start tracking your experiments? Create a free Neptune account (zero setup, no registration) Try Neptune on Colab Are you hungry for more on the subject? Here are some additional resources: Article: A Complete Guide to Monitoring ML Experiments Live in Neptune Article: How to Set Up Continuous Integration for Machine Learning with Github Actions and Neptune Happy experimenting! This article was originally written by Jakub Czakon and posted on the Neptune blog . You can find more in-depth articles for machine learning practitioners there.