One of the wonders of software development is the invention of Git. With Git, you can manage different versions of your code base. The benefit of this is that you can introduce and test changes in the code with the assurance that if things go wrong you can always revert to the previous working version.
Another benefit of Git is breeze of collaboration. A project can be organized around a central repository. Each developer or subteam working on a particular feature can push changes into that repository through a specific branch. Added to this benefit are Github and Gitlab, where the project repositories can be managed remotely.
Data scientists and engineers have the same needs for their data. They need to have a way to manage different versions of data and collaborate. Git, technically speaking, can do the job. However, it’s not ideal for several reasons:
This is where Data Version Control (DVC) comes in. Simply put, DVC is a data-focused version of Git. In fact, it’s almost exactly like Git in terms of features and workflows associated with it.
While in Git, the repository keeps everything about each version, DVC only keeps information (or metadata) about each version of the data. The actual data can be hosted remotely in data storage platforms.
What follows is an overview of how to start using DVC for your data science/engineering projects. By no means is this intended to be a comprehensive introduction/manual. But I hope this is enough to help you hit the ground running with DVC.
Installation
DVC works on Windows, Mac, and Linux. The official documentation page provides more detailed instructions on installation. For our purposes here, though, we’ll be demonstrating installing it on Linux.
Installing DVC is very simple. Fire up a terminal and type the below command:
Initialize DVC
DVC can now be used along with a Git tracked project. So if you want to use DVC, please make sure you first have a project that’s already been initialized on Git.
Ok now so inside such a Git tracked project/repository, you may initialize DVC by running:
You will see that a couple of files are created. So checking in with Git, we should find those two files:
We now need to commit these files to Git:
At this point, we have successfully installed and initialized DVC. We can now use it to track our data and changes to it.
In this section, we will just be covering the two most basic tasks you must be familiar with in using DVC - tracking data and accessing/reading data.
While you can do so much more with DVC, the official documentation can better help you navigate all the other features.
Start Tracking Data
Every piece of data tracked by DVC will have its information stored in a .dvc file. This file has information specific to the data stored but not the data itself. Git tracks this .dvc file (i.e., new versions of the data create new versions of this same file).
You can run the `dvc add` command to start tracking a specific data file or a whole directory. For instance, if you want to track a data file called `names.json` inside the `data` directory inside your repository, you do the following:
This creates a `data/names.json.dvc` file.
You can now commit this new file into the Git repository:
While using Github alone is useful for smaller projects, larger projects may require a remote repository for data versioning. I personally use the free tool offered by DAGsHub. Think of DAGsHub as a GitHub for our data science. DAGsHub Storage, which is their new feature, is a DVC remote that can be configured with just five commands.
While we could also use other cloud storage options such as AWS S3 or Google Cloud, these services require more setup and configuration. The convenience of a fresh-out-the-box solution is what we have with DAGsHub Storage; you don’t need too much configuration before you can start pushing your data into it.
First, of course, you need to go to DAGsHub and create an account.
Next, just like in Github, you need to create a new repository:
All the images in this section have been sourced from DAGsHub’s documentation.
It will then open up a dialog to input information about this repository, such as name and description.
If you have an existing Github repository, you may connect it to your DAGsHub repository. You can do this by clicking on Create from the DAGsHub navigation:
In the option that follows, just fill up the necessary details about your Github repository, and you’ve connected your two repos.
With DAGsHub Storage, we automatically have a DVC remote where we can push data into. As you see in the preceding steps, it’s so easy to do.
Let’s start by running this command:
After this, we must ensure that DVC can ask us for our DAGsHub credentials:
Now, to send our data to the cloud storage, simply run:
Now the last part of this is making sure DAGsHub Storage is directly connected.
Fortunately for us, DAGsHub automatically detects that bucket. So since we’re already tracking our files with DVC, we can now view them in our file viewer and pipeline views:
The files now have links that can allow us to view or download them. Before connecting our remote storage to DAGsHub, the links to these files appear greyed out:
The benefit of this is that to share these stored files, we can simply send them the link to our DAGsHub repository. This is not like Github, where they’d need to clone the whole repository.
DAGsHub Storage is evidently a great tool to use in managing large data science and modeling projects. It provides the benefits of version control and storage without the hassle of too much setup and configuration. With DAGsHub Storage, we can seamlessly integrate our data sets and models and our codebase repository. Moreover, we can store any kind of data (i.e., text data and images).
DVC is such a marvelous tool for data scientists and engineers. It’s, therefore, essential to master it like you’d master Git or other development tools. So the best step forward is to explore DVC’s official documentation to get a grip on the different commands and features.