DISCLAIMER: I am a developer advocate at dstack.
In the current big data regime, it is hard to fit all the data into a single CPU.
In such a case, one relies on multiple workers(CPUs) to handle the data. Even with the cloud, configuring infrastructure can be quite painful and having it automated can save a lot of time and make the process a lot more reproducible.
In this blog post, we’ll use dstack to build a pipeline for continuous training in the cloud using multiple GPUs.
Once you’ll define your workflows and their infrastructure requirements with the code, you’ll be able to quickly run it any time and dstack will provision all the required infrastructure in the configured cloud automatically.
At the same time, it will track all the code, parameters, and outputs. To do this, we’ll use the following tools:
Let me briefly explain what these tools do.
PyTorch DDP: DDP stands for “Distributed Data-Parallel”. The idea behind it is that the model is replicated on every worker while each replica worker on a different set of data samples which enables scalability.
Furthermore, the gradients involved are computed independently on each worker to later accumulate via communication between them.
PyTorch Lightning: The main advantage of Pytorch Lightning compared to Pytorch, is that there is no need to write a lot of boilerplate code.
dstack: dstack is a framework to automate machine learning pipelines. In particular, the user can define workflows and their details such as a workflow provider (a program that runs your workflow), a script to run, input and output artifacts (e.g. other workflows that the current workflow depends on and the folders with output files), required resources (such as the amount of memory or the number of GPU) etc via declarative configuration files. Once defined, workflows can be run via the dstack CLI. dstack takes care of provisioning the required resources using one of the linked clouds (such as AWS, GCP, or Azure). The submitted runs can be monitored in dstack’s user interface.
WandB: WandB is great to track various metrics, including accuracy, training loss, validation loss, the utilization of GPUs, memory, etc.
Now that we understand the tools involved, let's have a look into the steps required to build our pipeline.
Once you’ve signed in to your account at dstack.ai, click the Settings tab on the left-hand side, and further click on the AWS tab.
Here, you have to provide your AWS credentials and specify which AWS instance types dstack is allowed to use.
Since multiple GPUs are required for our workflow, we may want to add a
p3.8xlarge instance type. Make sure to select the region where you have GPU quotes to use this type of instance.
To add allowed instance types, click on the button
Add a limit. After you’re done, the user interface should look like the image provided below.
Keep dstack UI open, as we will keep coming back to see the progress of our running workflows.
As we are going to use
Copy the appropriate API key, and add it as a secret in the Settings of your dstack account.
Click the “Add secret“ button and set the key to WANDB_API_KEY and in the value field paste your WandB API key that you copied earlier. Finally, your dstack Settings should look like below:
requirements.txt file that we are going to use in our project:
dstack pytorch-lightning torch torchvision wandb
Go ahead and install it using the following command:
pip install -r requirements.txt
Now, all the required packages are installed, including the dstack CLI.
For project, we have to follow the directory structure provided below:
<project folder>/ .dstack/ workflows.yaml train.py requirements.txt
train.py handles the model and the trainer objects of the machine learning training pipeline.
Depending on the number of GPUs on the device (can be checked via
torch.cuda.is_available()) we have to set the arguments of the trainer object appropriately.
For the CPU, we need to set
accelerator = 'cpu' and for the rest of the cases, we set
accelerator = 'gpu', which we do via the variable
# trainer instance with appropriate settings trainer = pl.Trainer(accelerator=accelerator_name, limit_train_batches=0.5, max_epochs=10, logger=wandb_logger, devices=num_devices, strategy="ddp")
Note that we use
strategy="ddp" to ensure that Pytorch Lightning relies on the DDP training strategy mentioned earlier in the Introduction of this blog.
We also set
logger=wandb_logger in the
pl.Trainer object to use WandB for tracking the metrics and other system information.
The full code from this tutorial can be found
Now, we specify the workflow via the
.dstack/workflows.yaml file. The contents of the
.dstack/workflows.yaml file looks like below.
workflows: - name: train-mnist-multi-gpu provider: python version: 3.9 requirements: requirements.txt script: train.py artifacts: - data - model resources: gpu: 4
In order to run the workflow all we have to do is to execute the following command in the terminal.
dstack run train-mnist-multi-gpu
You can click the Run to see the progress of the workflow. In the “Logs” tab, you will see the cloud server running the
train.py after a few minutes of starting the job.
In the “Jobs” tab, you will see information like below.
After the run is over, we can monitor the artifacts by clicking the “data + 1” button and the output looks like below after clicking “data” and “model” folders.
It is possible to download the contents of the artifacts via dstack CLI.
In the “Runners” tab on the left side, you will find information on the specific instances being used.
After the run is completed, you'll find in WandB the metrics you tracked. You will also find information regarding the GPU utilization and various other crucial details.
In principle, you can use any other experiment tracking service together with dstack.
I hope you enjoyed reading this blog.
The code used in this blog post can be found at the git repository
The references I used are the following.
DISCLAIMER: I am a developer advocate at dstack.