DISCLAIMER: I am a developer advocate at dstack. Introduction In the current big data regime, it is hard to fit all the data into a single CPU. In such a case, one relies on multiple workers(CPUs) to handle the data. Even with the cloud, configuring infrastructure can be quite painful and having it automated can save a lot of time and make the process a lot more reproducible. In this blog post, we’ll use dstack to build a pipeline for continuous training in the cloud using multiple GPUs. Once you’ll define your workflows and their infrastructure requirements with the code, you’ll be able to quickly run it any time and dstack will provision all the required infrastructure in the configured cloud automatically. At the same time, it will track all the code, parameters, and outputs. To do this, we’ll  use the following tools: PyTorch DDP, PyTorch Lightning, dstack, WandB. Let me briefly explain what these tools do. DDP stands for “Distributed Data-Parallel”. The idea behind it is that the model is replicated on every worker while each replica worker on a different set of data samples which enables scalability. PyTorch DDP: Furthermore, the gradients involved are computed independently on each worker to later accumulate via communication between them. The main advantage of Pytorch Lightning compared to Pytorch, is that there is no need to write a lot of boilerplate code. PyTorch Lightning: dstack is a framework to automate machine learning pipelines. In particular, the user can define workflows and their details such as a workflow provider (a program that runs your workflow), a script to run, input and output artifacts (e.g. other workflows that the current workflow depends on and the folders with output files), required resources (such as the amount of memory or the number of GPU) etc via declarative configuration files. Once defined, workflows can be run via the dstack CLI. dstack takes care of provisioning the required resources using one of the linked clouds (such as AWS, GCP, or Azure). The submitted runs can be monitored in dstack’s user interface. dstack: WandB is great to track various metrics, including accuracy, training loss, validation loss, the utilization of GPUs, memory, etc. WandB: Now that we understand the tools involved, let's have a look into the steps required to build our pipeline. Steps dstack Configuration Once you’ve signed in to your account at dstack.ai, click the Settings tab on the left-hand side, and further click on the AWS tab. Here, you have to provide your AWS credentials and specify which AWS instance types dstack is allowed to use. Since multiple GPUs are required for our workflow, we may want to add a instance type. Make sure to select the region where you have GPU quotes to use this type of instance. p3.8xlarge To add allowed instance types, click on the button .  After you’re done, the user interface should look like the image provided below. Add a limit Keep dstack UI open, as we will keep coming back to see the progress of our running workflows. WandB Configuration As we are going to use , we’ll have to specify our WandB API key as a secret in the Settings of dstack. Your WandB API key  can be found in “Settings” as shown below. Wandb Copy the appropriate API key, and add it as a secret in the Settings of your dstack account. Click the “Add secret“  button and set the key to WANDB_API_KEY and in the value field paste your WandB API key that you copied earlier. Finally, your dstack Settings  should look like below: Install Required Packages Here’s the file that we are going to use in our project: requirements.txt dstack
pytorch-lightning
torch
torchvision
wandb Go ahead and install it using the following command: pip install -r requirements.txt Now, all the required packages are installed, including the dstack CLI. Directory Structure For project, we have to follow the directory structure provided below: <project folder>/
   .dstack/
       workflows.yaml
   train.py
   requirements.txt Model and Trainer The file handles the model and the trainer objects of the machine learning training pipeline. train.py Depending on the number of GPUs on the device (can be checked via ) we have to set the arguments of the trainer object appropriately. torch.cuda.is_available() For the CPU, we need to set and for the rest of the cases, we set , which we do via the variable . accelerator = 'cpu' accelerator = 'gpu' accelerator_name # trainer instance with appropriate settings
   trainer = pl.Trainer(accelerator=accelerator_name,
                        limit_train_batches=0.5, 
                        max_epochs=10,
                        logger=wandb_logger,
                        devices=num_devices, 
                        strategy="ddp") Note that we use to ensure that Pytorch Lightning relies on the DDP training strategy mentioned earlier in the Introduction of this blog. strategy="ddp" We also set in the object to use WandB for tracking the metrics and other system information. logger=wandb_logger pl.Trainer The full code from this tutorial can be found . here dstack Workflows Now, we specify the workflow via the file. The contents of the file  looks like below. .dstack/workflows.yaml .dstack/workflows.yaml workflows:
 - name: train-mnist-multi-gpu
   provider: python
   version: 3.9
   requirements: requirements.txt
   script: train.py
   artifacts:
     - data
     - model
   resources:
     gpu: 4 dstack Runs In order to run the workflow all we have to do is to execute the following command in the terminal. dstack run train-mnist-multi-gpu Now, open to see the workflows (after you perform the login). You will see contents of the “Runs” tab like below. dstack.ai You can click the Run to see the progress of the workflow. In the “Logs” tab, you will see the cloud server running the after a few minutes of starting the job. train.py In the “Jobs” tab, you will see information like below. After the run is over, we can monitor the artifacts by clicking the “data + 1” button and the output looks like below after clicking “data” and “model” folders. It is possible to download the contents of the artifacts via dstack CLI. In the “Runners” tab on the left side, you will find information on the specific instances being used. WandB Monitoring After the run is completed, you'll find in WandB the metrics you tracked. You will also find information regarding the GPU utilization and various other crucial details. In principle, you can use any other experiment tracking service together with dstack. Conclusion I hope you enjoyed reading this blog. The code used in this blog post can be found at the git repository . here The references I used are the following. dstack Quickstart Pytorch Lightning Documentation WandB Pytorch DDP DISCLAIMER: I am a developer advocate at dstack.

Different

Keep

How to Build a Training Pipeline on Multiple GPUs

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Reproducible ML Training Pipelines With dstack And WandB

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

Reproducible ML Training Pipelines With dstack And WandB

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps