paint-brush
How to Build a Training Pipeline on Multiple GPUs by@mmahesh
1,830 reads
1,830 reads

How to Build a Training Pipeline on Multiple GPUs

by Mahesh Chandra MukkamalaMay 20th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In the current big data regime, it is hard to fit all the data into a single CPU. In such a case, one relies on multiple workers(CPUs) to handle the data. Even with the cloud, configuring infrastructure can be quite painful and having it automated can save a lot of time and make the process a lot more reproducible. In this blog post, we’ll use dstack to build a pipeline for continuous training in the cloud using multiple GPUs. Once you’ll define your workflows and their infrastructure requirements with the code, you’ll be able to quickly run it any time and dstack will provision all the required infrastructure in the configured cloud automatically.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How to Build a Training Pipeline on Multiple GPUs
Mahesh Chandra Mukkamala HackerNoon profile picture


DISCLAIMER: I am a developer advocate at dstack.

Introduction


In the current big data regime, it is hard to fit all the data into a single CPU.


In such a case, one relies on multiple workers(CPUs) to handle the data. Even with the cloud, configuring infrastructure can be quite painful and having it automated can save a lot of time and make the process a lot more reproducible.


In this blog post, we’ll use dstack to build a pipeline for continuous training in the cloud using multiple GPUs.


Once you’ll define your workflows and their infrastructure requirements with the code, you’ll be able to quickly run it any time and dstack will provision all the required infrastructure in the configured cloud automatically.


At the same time, it will track all the code, parameters, and outputs. To do this, we’ll  use the following tools:


  • PyTorch DDP,

  • PyTorch Lightning,

  • dstack,

  • WandB.


Let me briefly explain what these tools do.


PyTorch DDP: DDP stands for “Distributed Data-Parallel”. The idea behind it is that the model is replicated on every worker while each replica worker on a different set of data samples which enables scalability.


Furthermore, the gradients involved are computed independently on each worker to later accumulate via communication between them.


PyTorch Lightning: The main advantage of Pytorch Lightning compared to Pytorch, is that there is no need to write a lot of boilerplate code.


dstack: dstack is a framework to automate machine learning pipelines. In particular, the user can define workflows and their details such as a workflow provider (a program that runs your workflow), a script to run, input and output artifacts (e.g. other workflows that the current workflow depends on and the folders with output files), required resources (such as the amount of memory or the number of GPU) etc via declarative configuration files. Once defined, workflows can be run via the dstack CLI. dstack takes care of provisioning the required resources using one of the linked clouds (such as AWS, GCP, or Azure). The submitted runs can be monitored in dstack’s user interface.


WandB: WandB is great to track various metrics, including accuracy, training loss, validation loss, the utilization of GPUs, memory, etc.


Now that we understand the tools involved, let's have a look into the steps required to build our pipeline.

Steps

dstack Configuration

Once you’ve signed in to your account at dstack.ai, click the Settings tab on the left-hand side, and further click on the AWS tab.


Here, you have to provide your AWS credentials and specify which AWS instance types dstack is allowed to use.


Since multiple GPUs are required for our workflow, we may want to add a p3.8xlarge instance type. Make sure to select the region where you have GPU quotes to use this type of instance.

To add allowed instance types, click on the button Add a limit.  After you’re done, the user interface should look like the image provided below.




Keep dstack UI open, as we will keep coming back to see the progress of our running workflows.

WandB Configuration


As we are going to use Wandb, we’ll have to specify our WandB API key as a secret in the Settings of dstack. Your WandB API key  can be found in “Settings” as shown below.


Copy the appropriate API key, and add it as a secret in the Settings of your dstack account.


Click the “Add secret“ button and set the key to WANDB_API_KEY and in the value field paste your WandB API key that you copied earlier. Finally, your dstack Settings  should look like below:


Install Required Packages


Here’s the requirements.txt file that we are going to use in our project:


dstack
pytorch-lightning
torch
torchvision
wandb


Go ahead and install it using the following command:


pip install -r requirements.txt


Now, all the required packages are installed, including the dstack CLI.


Directory Structure

For project, we have to follow the directory structure provided below:


<project folder>/
   .dstack/
       workflows.yaml
   train.py
   requirements.txt


Model and Trainer

The file train.py handles the model and the trainer objects of the machine learning training pipeline.


Depending on the number of GPUs on the device (can be checked via torch.cuda.is_available()) we have to set the arguments of the trainer object appropriately.


For the CPU, we need to set accelerator = 'cpu' and for the rest of the cases, we set accelerator = 'gpu', which we do via the variable accelerator_name.


   # trainer instance with appropriate settings
   trainer = pl.Trainer(accelerator=accelerator_name,
                        limit_train_batches=0.5, 
                        max_epochs=10,
                        logger=wandb_logger,
                        devices=num_devices, 
                        strategy="ddp")


Note that we use strategy="ddp" to ensure that Pytorch Lightning relies on the DDP training strategy mentioned earlier in the Introduction of this blog.


We also set  logger=wandb_logger in the pl.Trainer object to use WandB for tracking the metrics and other system information.


The full code from this tutorial can be found here.

dstack Workflows

Now, we specify the workflow via the .dstack/workflows.yaml file. The contents of the .dstack/workflows.yaml file  looks like below.


workflows:
 - name: train-mnist-multi-gpu
   provider: python
   version: 3.9
   requirements: requirements.txt
   script: train.py
   artifacts:
     - data
     - model
   resources:
     gpu: 4


dstack Runs

In order to run the workflow all we have to do is to execute the following command in the terminal.


dstack run train-mnist-multi-gpu


Now, open dstack.ai to see the workflows (after you perform the login). You will see contents of the “Runs” tab like below.


You can click the Run to see the progress of the workflow. In the “Logs” tab, you will see the cloud server running the train.py after a few minutes of starting the job.


In the “Jobs” tab, you will see information like below.


After the run is over, we can monitor the artifacts by clicking the “data + 1” button and the output looks like below after clicking “data” and “model” folders.


It is possible to download the contents of the artifacts via dstack CLI.


In the “Runners” tab on the left side, you will find information on the specific instances being used.

WandB Monitoring

After the run is completed, you'll find in WandB the metrics you tracked. You will also find information regarding the GPU utilization and various other crucial details.


In principle, you can use any other experiment tracking service together with dstack.

Conclusion

I hope you enjoyed reading this blog.


The code used in this blog post can be found at the git repository here.


The references I used are the following.



DISCLAIMER: I am a developer advocate at dstack.