423 reads

Optimize Model Training with a Data Streaming Client

by Eryk LewinsonFebruary 1st, 2023

Too Long; Didn't Read

I will show you how to solve potential bottlenecks in training models using DagsHub’s new data streaming client.

featured image - Optimize Model Training with a Data Streaming Client

As data scientists, we often would like to start training models as soon as possible. And that is not just because humans are impatient by nature. For example, we might want to test a very small change in someone’s project and we would like to avoid a lengthy setup phase just to test such a minor modification.

That is especially apparent when working with computer vision and large image datasets. Were you ever annoyed when you had to pull a massive dataset (versioned using DVC) before training your model? Or maybe you had to pull the entire dataset just to inspect/modify a single image or a few of them? If that sounds familiar, I will show you how to solve those potential bottlenecks using DagsHub’s new data streaming client.

What is the data streaming client?

DagsHub’s open-source streaming client is called Direct Data Access (DDA) and essentially it allows to stream data from any DagsHub repository. In other words, our Python code will act as if all the files exist locally, but in reality, the client will dynamically download them on request. Additionally, we can also upload the data to any project hosted on DagsHub. But most importantly, using the streaming client does not require making extensive changes to our codebase. One or two additional lines of code are often enough to fully enjoy the benefits of the new client.

In a nutshell, DDA offers the following:

the ability to stream data in batches from any DagsHub repository,
appending new data to existing datasets without pulling the entire dataset first,
smart caching - files remain available after they are streamed,
saving time, resources, and memory.

After the high-level overview of the new client, let’s dive a bit deeper into how it actually works. Under the hood, there are two main implementations of DDA.

The first one employs Python Hooks and it detects any calls to Python’s built-in file operations (such as open, write, etc.) and modifies them to retrieve the files from the DagsHub repository if they are not found locally. The biggest advantage of this approach is that most of the popular Python libraries will automatically work with this method.

While this already sounds great, we should be aware of some limitations of this approach:

The method will not work with frameworks that rely on I/O routines written in C/C++, for example, OpenCV.
DVC commands such as dvc repro and dvc run for stages that have DCV-tracked files in deps won’t work.

The second implementation is called Mounted Filesystem and it is based on FUSE (Filesystem in USErspace). Behind the scenes, the client creates a virtual mounted filesystem reflecting our DagsHub repository. That filesystem behaves like a part of our local filesystem, for example, we see all the files in the directory and we can view them.

As always, the burning question is which implementation to use. The documentation suggests that if you are working on a Windows or Mac and do not have any frameworks/libraries relying on C, use the Python Hooks approach. Otherwise, use the Mounted Filesystem. Please refer to the documentation for a comprehensive comparison of the two approaches.

Use cases of the streaming client

We already know what the streaming client is and how it works. Let’s now cover a few of its potential use cases to show how we can use it in our workflows.

First and foremost, we can use it to reduce the time needed to start a training run. Let’s imagine we are working with an image classification problem. With the streaming client, we do not have to download the entire dataset before we actually start training a model, as Python will pretend that the files are already available and start training the models immediately. Data loaders are frequently optimized to load the next batch while the current one is being used for training. In the background, the streaming client will simply download images in batches as they are requested for training and cache them for later use. With this approach, we can potentially save a lot of time and resources. The latter can be measured by, for example, the runtime of a virtual machine.

Second, we can train our models only on subsets of the entire dataset. Imagine having a dataset with images from thousands of classes, but for your particular experiment, you only want to train using a few of the selected categories. With the streaming client, your Python code will only load the data that is actually requested for training.

Lastly, we can also use the streaming client to upload data to our DVC-versioned repositories. In an especially frustrating scenario, we might have tens of thousands of images in our dataset and we would like to add 10 more images that we gathered. Normally, we would have to pull all the data, add the new images and version the entire modified dataset again. With DDA, we can just directly add the 10 new images without downloading the entire dataset.

A hands-on example

I hope that the mentioned use cases got you excited about giving the streaming client a go! To use the benefits of the streaming client, we need to store our project on DagsHub and version the data with DVC. Please refer to this article to see how to set everything up.

The goal of this tutorial is to show how quickly we can get a project up and running using the streaming client. That is why we forked an already existing project and we will show how to quickly start the training of our neural network in the cloud. For that purpose, we will use Google Colab, but you can use other cloud platforms such as AWS, etc.

The project we forked is a TensorFlow implementation of CheXNet, which is a 121-layer CNN used for detecting pneumonia from chest X-rays. The original implementation is trained on the ChestX-ray14 dataset (available on Kaggle), which is currently the largest publicly available chest X-ray dataset with over 100,000 images. The dataset is over 45 GB and is divided into 14 different directories, each one dedicated to a different disease. To reduce the training time of this experiment, we use a subsampled version of the dataset which is only 1GB.

Before starting, it is worth mentioning that it is not always beneficial to use DDA. Before starting this project, I tried playing around with DDA for my Mario vs. Wario classifier. That is when I encountered an edge case in which passing a single batch during the training of the neural network was faster than downloading the batch on demand. This made the training of the first epoch much longer than downloading the entire dataset up front and only then running the training step normally. Hence, the images used for training must be large enough (in terms of their resolution/size) so that the training step of a single batch takes longer than downloading the next one.

To follow along, you can use this Notebook in Google Colab.

Setting up the project

I chose to fork the project to my own account in order to use features such as experiment tracking with MLflow. However, you can just as well directly clone a repo and train the model without creating a fork.

After opening a Notebook, we need to install some libraries:

!pip install dagshub omegaconf mlflow

Then, we import the required libraries:

import datetime
import os
from IPython.display import display, Image

The list of imports is rather short, as most of the libraries (for example, keras used for training the NNs) are loaded in the individual scripts. Then, we define a few constants containing values such as the owner of the repository, the name of the repository we want to clone, and the branch we want to use.

# repo to use
REPO_OWNER= "eryk.lewinson"
REPO_NAME= "CheXNet"
REPO_BRANCH= "streaming"

Having defined the required constants, we can use the handy init functionality to initialize the DagsHub repository. We can then easily log the experiments with MLFlow, get access tokens, etc. We run the following snippet to set up the project in Colab:

import dagshub
dagshub.init(repo_name=REPO_NAME, repo_owner=REPO_OWNER)

After running the cell, we will be prompted to authenticate in our browser. Then, we can clone the repository using the git clone command:

!git clone https://{USER_NAME}:{dagshub.auth.get_token()}@dagshub.com/{REPO_OWNER}/{REPO_NAME}.git

As the next step, we change the directory to the CheXNet one:

%cd /content/{REPO_NAME}

And then we change the branch to the one that contains the codebase we can use together with the streaming client:

!git checkout {REPO_BRANCH}

As we have mentioned before, the master branch uses the full 45GB dataset, while the streaming branch trains the neural network on a subsampled dataset.

At this point, we are done with the setup and are ready to take the streaming client for a spin.

Inspecting the dataset

First, let’s explore our dataset. After cloning the codebase, we can see that the data is versioned with DVC as we can see the corresponding tracking files. However, we cannot actually inspect it from Colab. At least not yet :)

Let’s use the streaming client for the task. In this tutorial, we use the Python Hooks implementation of DDA. As such, we need to run the following two lines:

from dagshub.streaming import install_hooks
install_hooks()

At this point, we can inspect the images using code, even though the folders are not visible in the directory tree. As we know the structure of the directories containing the images, we can use the following snippet with a relative path to print all the images in a requested directory:

os.listdir("data_labeling/data/images_001/images")

As you can see, this is regular Python code and the streaming client handles everything for us in the background. Executing the snippet prints the following list:

['00000417_005.png',
 '00000583_047.png',
 '00000683_002.png',
 '00000796_001.png',
 '00001088_024.png',
 '00000013_006.png',
 '00000547_009.png',
 '00000640_000.png',
 '00000732_004.png',
…]

We can also display the image using the following snippet:

display(Image(filename="data_labeling/data/images_001/images/00000417_005.png"))

As we have mentioned before, the streaming client downloads the data on request and caches it for later use. As such, the image we have displayed was actually downloaded and we can see it in the directory tree:

Now we can take the next step and start training the neural network!

Training a neural network

As we have already mentioned, we do not have to spend time modifying the code used for training the NN. We just need to add the already familiar two lines of code to the train.py script. You might ask why are we doing it again when we have already executed that cell before. That is because when we are calling a script from Colab, the script runs in a separate Python context. As such, the streaming setup is no longer applied to it.

To make the codebase even more generic, we can use the following conditional statement combined with the install_hooks setup:

if "STREAM" in sys.argv:
    from dagshub.streaming.filesystem import install_hooks
    install_hooks()

sys.argv is essentially an array containing the command line arguments, that is, the values that are passed while calling the script. To illustrate, we can run the training script using the following snippet:

%run modeling/src/train.py STREAM

To run the script without utilizing the streaming client, we simply have to remove the STREAM argument.

As soon as we run the command, we will see that the directory tree will start populating with the images that were requested for a particular batch. And that was it! By adding only two lines of code we can leverage the streaming client to start downloading the versioned data on demand.

Additionally, you might want to track your experiment with MLflow. To do so using DagsHub’s setup, you also need to modify the MLFLOW_TRACKING_URL constant in the src/const.yaml file.

Wrapping up

In this article, we demonstrated how to use DagsHub’s streaming client to download the dataset on demand. This way, we can almost immediately start the training of our models, without first pulling all of the versioned data to our machine, be it locally or in the cloud. As you could see, the streaming client really shines for tasks such as computer vision, but it can come in handy while solving other problems as well.

You can find the Notebook used in this article here. The project’s codebase is available here. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

Resources

https://dagshub.com/docs/feature_guide/direct_data_access/
Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., ... & Ng, A. Y. (2017).
Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia