Image Classification on Azure With Dagshub Direct Data Access

Training an Image Classification Model on Azure without storing data on-prem using Dagshub Direct Data Access and Azure ML SDK

Introduction

Machine Learning and Artificial Intelligence have become ubiquitous and one of the key skills of a Data Scientist is to be able to build models that are scalable, deal with sensitive data and also build reliable data and model pipelines — that are reproducible. While Cloud Technologies allow us to build scalable pipelines- we need to store all the files on our cloud before we can use them for training models incurring GPU cost and time to pull the data.

In this article, we are going to train a model on Azure with data stored on Dagshub Repo, using the Direct Data Access (DDA) feature by Dagshub which allows us to pull data in batches when required— reducing the time taken to pull and store the data on Azure thereby reducing GPU cost as well.

We will use Azure ML SDK to create training job right from our local machine and push it to Azure. We will also see how we can monitor the job right from our jupyter-notebook.

We will use the data from Kaggle's, Mayo Clinic — STRIP AI challenge, where the focus is on differentiating between the two major acute ischemic strokes (AIS) etiology subtypes: cardiac and large artery atherosclerosis. Detecting the etiology of strokes can help physicians mitigate recurrent strokes.

The code and the data used in this analysis can be found here.

About the Data

The data originally present for the challenge is in tiff format and is 356GB in size. For the purpose of this analysis, we downscale the images and store them in png format. The code for this downscaling of data can be found here. The downscaled images can be found here. To map the images to the classes they belong to, the competition data provides us with a CSV file that contains information about the data. This data can be found here.

The “label” column in the training dataset consists of the class images belongs. The train data folder consists of 754 images out of which 547 belong to class CE (cardiac) and 207 belong to class LAA (large artery atherosclerosis). For our analysis, we have created a column called “int_labels” that assigns a value of 1 if the image belongs to class “LAA” else it is given a value of 0.

Approach

To use Azure to train our models, we will use Azure ML SDK and run our code on Azure. This will require us to create an ML Workspace and a GPU Compute on Azure. Generally, we upload our data as well onto Azure Storage and create Datastores to allow ML Workspace to access our data. We are not going to store the data on Azure but instead stream the data from Dagshub Repository.

To set up Azure Workspace and create a compute using Azure ML SDK refer here.

Dagshub is referred to as “Github for ML” as it allows us not only to maintain our code but also enables version control of our data.

Data Version Control (DVC) allows us to track changes in large data files similar to how Git helps us to track changes in our code. To enable DVC we can use any of the cloud storage like AWS, GCP and Azure which can be expensive. Similar to how we do a git pull to get the latest code from Git, we do a dvc pull to get the data that are versioned using DVC. Similar to Git commands, we have dvc push, pull and commit commands as well.

For this project, I created a Repo on Dagshub and uploaded all my data files and code. All the files and folders that use DVC are marked as DVC in the repo.

Once we have the data pushed to Dagshub, we will create scripts that will use the streaming functionality from DDA to fetch data from the repository without having to do a dvc pull which can take longer.

Data Streaming using Daghub’s Direct Data Access

Traditionally, if we want to build a model or read data that is present in any Data Version Control system — we take a pull and extract all the data files that are present. But, many times we do not need all the files that are present and pulling large files can be time-consuming and incur CPU/GPU time as well. To avoid this, Dagshub introduced the Streaming API which is part of Dagshub's Direct Data Access (DDA) feature.

pip install dagshub # Install the Dagshub package for DDA.

We then clone the DagshubRepo using Gitpython library.

## Cloning the Repo
git_url="https://"+DAGSHUB_USER_NAME+":"+DAGSHUB_TOKEN+"@dagshub.com/"
        +DAGSHUB_USER_NAME+"/"+DAGSHUB_REPO_NAME+".git"

import git

git.Git().clone(git_url)

Once you clone the repo , you can see the Repo in your local system - but you will not find the folders that are versioned using DVC (data and models folder) in the repo.

There are two ways of using the streaming client.

Python-only “Lite” hooks which can automatically detect calls to Python’s built-in file operations like open(), listdir() and is compatible with most python ML Libraries. using install_hooks(), one can simply access the file as if it was stored on the local machine itself.

from dagshub.streaming import install_hooks
import pandas as pd

install_hooks(DAGSHUB_REPO_NAME,password=DAGSHUB_TOKEN)


train_data=pd.read_csv(DAGSHUB_REPO_NAME+"/data/raw/train.csv")
train_data.shape

Using this method, when we use os.listdir() we can see the data and the models folder which are DVC versioned as well.

On calling install_hooks() we can see that the “data” and “models” folders are visible using os.listdir()

This method may not work with Tensorflow and OpenCV — since they have input/output frameworks written in low level language like C/C++ which needs to be handled in a different way as compared to library like Pytorch . In such cases, you can use File Streaming by loading the Dagshub FileSystem.

File Streaming using DDA is as simple as loading the Dagshub FileSystem.

from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem(project_root=DAGSHUB_REPO_NAME,username=DAGSHUB_USER_NAME,
                        password=DAGSHUB_TOKEN)

We can then replace any use of open(), os.stat(), os.listdir(), and os.scandir() with fs.open(), fs.stat(), fs.listdir(), and fs.scandir() respectively. Using fs.listdir() we can see now that the data and models folders are present, even though they are not there on our machine.

Just as a check, let us use both fs and os to list files in the directory. As we can see below, we cannot access the data folders when using os — which means that we are accessing files that are on Dagshub Repo using the streaming functionality

To read, say train.csv that is present in the data/raw/ folder we can use the open function

Once you open a file, the file is stored in the cache in your local machine and hence will be visible when you use os.listdir() function.

The train.csv file that was opened is stored in the cache and becomes visible in the Repo in Local System

The main advantage of using the streaming functionality is that you do not download all the files at once instead only access the files that you require. This means we do not have to wait for all the files to download before we start training our models saving GPU cost and time.

Visualize the Images using DDA

We create a show_image function that takes the image and the streaming client as input and reads the images — converts them into (512,512) images and also flips them horizontally. Here, we use fs.open() to cache the file onto our local system and then use im.read() function to read the image.

## Code to visualise Images - by reading the image from Dagshub
import cv2  
import matplotlib.pyplot as plt


def show_img(fs,img_id):
    IMAGE_FOLDER=DAGSHUB_REPO_NAME+"/data/raw/train/"
    img_path = os.path.join(IMAGE_FOLDER, '{}.png'.format(img_id))
    print(img_path)
    #slide = Image.open(img_path)
    fs.open(img_path) ## This will cache the image
    slide = cv2.imread(img_path, cv2.IMREAD_UNCHANGED)
    img=cv2.resize(slide,(512,512))
    print(type(slide))
    size = slide.shape
    print(size)
    #region = slide.read_region((0, 0), 0, size)
    flipHorizontal = cv2.flip(img, 1)
    

    
    plt.figure(figsize=(8, 8))
    plt.imshow(flipHorizontal)
    plt.show()
show_img(fs,"008e5c_0")

Creating Scripts for Training Image Classification Models on Azure

The first step to train a model on Azure is to create your own workspace and compute. This can be done using Azure ML SDK (refer here) or you can go to portal.azure.com and create a workspace. To ensure reproducibility — I have created my workspace and compute clusters using Azure ML SDK.

The next step is to create python scripts that will be used for training. All the scripts that are to be used are to be put in the same folder as that folder will be uploaded to Azure.

## Creating Script Folder to store all the scripts that are needed to run job on Azure
import pathlib
script_folder="../src/azure_ml_scripts"
pathlib.Path(script_folder).mkdir(parents=True,exist_ok=True)
print(script_folder)

In the scripts folder — since I need to be able to access Dagshub Repo I create dagshub_config.py which contains the Dagshub username, token, and repo name.The token is to be kept secret and should not be shared.

util.py file that contains the code to create a streaming client, clone the git repo, list the images in the training folder using data streaming, read the train.csv data frame, and split the data into train and test sets and function to download EfficientNet model using data streaming.

Now, the last and the final script (train.py) — where all the magic happens. To train an image classification model, we use the ImageDataGenerator class but this requires the files to be present in the system. Since, we are going to use the streaming client to read the images as well as the train.csv file to get the image filename and the class, and create a Custom DataGenerator that will read the images in batches from the Dagshub Repo, using the streaming functionality.

We will use the EfficientNet B5 model for our Blood Clot Prediction. EfficientNet allows provides a systematic way to scale CNN models by balancing network depth, width, and resolution. We have stored this model in “data/raw/Pretrained_Efficient_Models” in the repo which contains multiple EfficientNet Models.

But, since we only want the B5 Model, we read the weights of only that model. This is possible because of the streaming API — else we will have to download the entire folder and then read that particular file.

def download_EfficientNet(fs,model_filename):
    '''
      This function will download the model file into local system and return the path.
    '''
    fs.open(os.path.join(EFFICIENT_NET_MODEL_PATH,model_filename)) 

    return os.path.join(EFFICIENT_NET_MODEL_PATH,model_filename)


efficientWeight=download_EfficientNet(fs,'efficientnet-b5_tf24_imagenet_1000_notop.h5')

Resizing the image to (512,512) and creating the validation and training data generator using our CustomDataGenerator is also defined in our training script. We monitor the model metrics using MLFlow in Azure to track the metrics using mlflow.autolog().

Creating Environment and Running Script on Azure

Now, that we have our scripts ready, we need to have an environment on Azure where we can run our script — this means installing our libraries and other dependencies. While we can create our own environment with the conda packages and pip packages, we can also use curated environments.

In this case, I created my own Environment using an existing docker image that has CUDA installed.

## Creating Environment using CUDA docker image and installing dependencies
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies


env = Environment('blood-clot-env')

conda_dep = CondaDependencies()
conda_dep.add_conda_package("scikit-learn==0.22.2.post1")

for package in ['azureml-dataset-runtime[pandas,fuse]', 
               'azureml-defaults',
              'dagshub',
              'GitPython','opencv-python-headless','Pillow','mlflow']:
    conda_dep.add_pip_package(package)

other_pip_packages=['azureml-core==1.45.0' ,'azureml-defaults==1.45.0' ,
                                          'azureml-mlflow==1.45.0' ,
                                          'azureml-telemetry==1.45.0' ,
                                          'tensorboard~=2.7.0' ,
                                          'tensorflow-gpu~=2.7.0' ]
for package in other_pip_packages:
    conda_dep.add_pip_package(package)

                                
env.python.conda_dependencies=conda_dep

## Loading the Dockr Image with CUDA
env.docker.base_image = (
    "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04:20220902.v1"
)
# Register environment to re-use later
env.register(workspace = ws)

# Save the environment
env.save_to_directory("Environment/",overwrite=True)

The next step is to create an experiment. Think of an experiment as a folder within which your jobs are organized. We then create ScriptRunConfig Object. You can think of ScriptRunConfig as a configuration for saving all the details required to run the job on Azure.

## Creating an Experiment and ScriptRunConfig to save details of the job to run on Azure
from azureml.core import ScriptRunConfig
from azureml.core import Experiment

experiment_name = 'blood-clot-classification-mlflow'


exp = Experiment(workspace=ws, name=experiment_name)

src = ScriptRunConfig(source_directory=script_folder,
                      script='train.py', 
                      arguments=None,
                      compute_target=gpu_cluster,
                      environment=env)

To run the job on Azure, we submit the experiment with the ScriptRunConfig object and monitor the progress right from our notebook using RunDetails Widget.

We can track the metrics from our notebook using run.get_metrics()

We can see that the model achieved a validation accuracy of 0.75 and a training accuracy of 0.76.

We can also download the saved model using run.download_files().

## Download the model folder saved after training into our local system 

run.download_files(output_directory="outputs/efficientNet_Model")

Oh!! But what if I wanted to write the metrics to a file and push it to my Dagshub Repo — do I need to pull all the data and then push it??

SURPRISE!!!

Dagshub introduced the Upload functionality as part of the DDA feature that will allow us to upload our files to our Repo without having to do a pull.

## Uploading files to Dagshub Repo using upload() functionality of DDA
from dagshub.upload import Repo

repo = Repo(DAGSHUB_USERNAME,DAGSHUB_REPO_NAME,branch="main",
             username=DAGSHUB_USERNAME,token=DAGSHUB_TOKEN)

repo.upload(file="../metrics.txt",path="efficientNet_Metrics.txt",
            commit_message="Updating Metrics File",versioning="dvc")

VOILA!!! The file is uploaded to Dagshub Repo with DVC versioning.

Wrapping it Up…

In this article, we saw how we can deploy our code on Azure and also train a model without having to spend a lot of time transferring the data to Azure.

While I have taken the example of Image Classification to show how we can merge cloud technologies like Azure with Dagshub’s DDA feature, this can be extended to other data types as well — like Audio or Video Data Types as well. Generally, Video Files are large in size and often take up a lot of storage space, and pulling them every single time can be a time-consuming activity.

Interestingly, using the streaming functionality is not limited to deep-learning models alone but can also be used when you have data from multiple tables stored as files that are frequently updated and your analysis just requires a few files — this way using data streaming, one can avoid downloading all the files and also simply re-run the script to get the updated results.

Using Azure, we can create our own environment with their dependencies without a lot of familiarity with docker containers. Also, for those who are accustomed to python — Azure ML SDK gives us a way to run our models on Azure without even having to leave the familiar jupyter-notebook environment.