Consider a scenario - a lone Data Scientist works away at her system trying to wade through a huge amount of data; cleaning, sorting, processing, and then building a model to run prediction on the newly processed data. The scientist has a bunch of tools at her disposal - Jupyter Notebooks, Airflow, Anaconda, Pandas, data storage, and a cloud virtual machine.
She trains it for hours and hours, only to fall short of perfection - the model doesn’t perform as well as it should have. She looks out the window - it’s nightfall already. She has yet to test her model with a different set of parameters and track a set of different metrics of her experiments.
She switches off her system, calls it a day, and will try the next day with another model, a different approach with a bunch of new data and parameters. This is a long process that might stretch for days…weeks…and months.
It is difficult to jump back to a point when she had tried a specific combination of parameters for the experiment, knowledge is sometimes lost, as all the experiments and every artifact related to the model might not be saved. Tracking is crucial for the improvement of the ML model.
I think this lone ranger scenario can be avoided if we had a comprehensive IDE-style environment where we can run multiple experiments, do data management, and track our code, experiment metrics, plots, model, and data artifacts as well. How cool would that be?
Sounds too good to be true, but this is what DVC VSCode Extension is attempting to do.
DVC is an excellent tool to track your experiments, models, and related artifacts, but it’s a CLI - which many in the data science community might not be comfortable or familiar with.
Gone are the days when you had to learn a bunch of pesky CLI commands like this:
Using DVC got a whole lot easier and more fun.
Iterative Team brings you a VS Code extension that combines the power of DVC CLI commands for data management, versioning, and experimentation with the sleek elegant coding experience of Visual Studio Code IDE.
The extension in its current form provides you with the following features:
Integrated into VS Code command palette menu. Press F1 to open the palette and type DVC to view a whole bunch of DVC-related commands at your disposal.
Gives you an in-depth view of the experiments run in the workspace. The equivalent of the command
dvc exp show in the CLI mode.
You can view the plots generated by the experiment run in the workspace. Can compare the plots of different experiments. Even view the plots updated in real-time.
You can check the status of the workspace using this feature. You can
dvc push &
dvc pull from this view.
A small window for tracking your resources in the workspace. From here you can perform file actions,
pull specific resources and manage the data within tracked datasets.
The View Container can be activated by clicking the DVC icon in VS Code icon bar. It gives general information about the experiments and resources in the workspace.
Here are some advantages compared to CLI alone when you use the extension:
Using the DVC Extension can be summarized into 4 steps
Make sure you have DVC installed on your system. You can run the following command in your terminal:
$ pip3 install dvc
Or you can follow the guide given here for OS-specific installation.
Go to VS Code and in the extension menu, search for DVC. Click Install.
Now you have the DVC extension ready to go. To get familiar with the usage of the extension we will download a sample ML project
You can download the sample project from the repo. Open the folder in VS Code. The DVC extension should detect the DVC binary and the python environment.
If you have a specific environment you can press F1 and select
DVC: Setup The Workspace
Provide the compiler path and the python environment binary path.
You can view the DVC experiments in the current workspace in the DVC view container tab.
To begin our experimentation, we need to pull the data. Press F1 to open VS Code command palette and select
You can view the output by selecting
DVC: Show DVC Output
Note: As of now the team is still working on the DVC remote storage option in the VS Code plugin, you will have to set your storage remote via command line or config file
You can change the parameters in the
params.yaml file and select
DVC: Modify Experiment Param(s),Rest and Run in the VS Code command palette.
You can check your experiments and view the plotted graphs using the extension as well.
And the cherry on top is that the extension allows you to cherry-pick your experiments. Pun Intended!
That’s not all, you can run individual experiments and change specific parameters.
If you wish to view your graphs live, for experiments that take a lot of time - say a DL model maybe with a lot of epochs.
You can view them in real-time as well. Just run your experiment and click on the plots button in the DVC tray.
When all is well and done, you can commit and push your changes as well.
The Iterative team is going to add more exciting features to the extension soon. Stay tuned.
Don’t let us keep you, go ahead and start experimenting. Happy DVC time!
As an ML Ops practitioner, I deal with various challenges when working with different data science teams. There are various tools available in the market - both paid and open-source. I tend to lean towards open-source tools, as there is a kinship with a community that is actively helping out strangers across the world solve similar problems.
This approach is of great significance for the ML community as we are still in the adoption stage where a good tool can help your solve your problems faster and with more confidence. A centralized tool integrated with multiple stages of the ML pipeline goes a long way in helping the data science teams solve problems; they can focus more on the model improvement than on the infrastructure and setups - this is what drew me to the DVC tool.
A shout out to the team at Iterative for creating this wonderful extension, hoping to see more magic in the future.