paint-brush
Reproducible ML Training Pipelines With dstack And WandBby@mmahesh
709 reads
709 reads

Reproducible ML Training Pipelines With dstack And WandB

by Mahesh Chandra MukkamalaJune 23rd, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

It is crucial to keep the pipeline reproducible, in the sense that yourself as well as others can replicate it and obtain similar results. In order to ensure that pipelines are reproducible, it is crucial to track infrastructure, code, data, hyper-parameters, experiment metrics, etc. In this article, I’ll give an overview on how WandB and dstack together can help build reproducible training pipelines.

Company Mentioned

Mention Thumbnail
featured image - Reproducible ML Training Pipelines With dstack And WandB
Mahesh Chandra Mukkamala HackerNoon profile picture

Introduction To Training Pipelines

The typical machine learning training process consists of various tasks such as data preparation, model training, model validation, model deployment and others.


Because you may want to iterate on your model even after you deploy it to production, it is paramount to automate these tasks, e.g. to run them with new data, hyper-parameters, changes in the code, etc. Training pipelines help automate these tasks.


At the same time, it is crucial to keep the pipeline reproducible, in the sense that you as well as others can replicate it and obtain similar results. In order to ensure that pipelines are reproducible, it is crucial to track infrastructure, code, data, hyper-parameters, experiment metrics, etc.


In this article, I’ll give an overview on how WandB and dstack together can help build reproducible training pipelines.

Experiment tracking

First, we’ll start with experiment tracking.


Experiment tracking involves tracking

  • hyper-parameters (e.g., step-size, batch size, etc.)
  • experiment metrics (e.g., accuracy, training loss, validation loss, etc.)
  • hardware system metrics (e.g., the utilization of GPUs, memory, etc.)


WandB provides one of the easiest ways to perform experiment tracking, as it

  • is easy to integrate into the ML code (addition of only a few lines of code)
  • uses cloud to store the results so you can always find these metrics by your run name (don’t underestimate the importance of this, as storing metrics on you local machine may be rather an anti-pattern in regard to reproducibility)
  • automatically provides the visualization of various metrics with nice charts


WandB is very easy to integrate into your existing codebase. Below is an example of how to do it if you use PyTorch Lightning.


We need to import the WandbLogger object. Then, instantiate the object and pass a project name as its argument, as shown below.


from pytorch_lightning.loggers import WandbLogger

wandb_logger = WandbLogger(project="my-test-project")


The createdwandb_logger can be passed into the logger argument in the Trainer object as shown below.


trainer = Trainer(logger=wandb_logger)


Once the run is completed, the information of the run is assigned to my-test-project.


You may check wandb.ai and click your project for the results (loss, accuracy, system metrics etc) pertaining to various runs of the project, as shown below.

In order to store hyper-parameters or other configuration parameters, wandb.config is used and its integration into the ML code is easy as shown below.


import wandb
wandb.init()
wandb.config.epochs = 10
wandb.config.batch_size = 32


The experiment configuration for each run is automatically stored in the cloud, and it can be found in theconfig.yaml file in the Files tab of a run at wandb.ai.

Automating workflows

Now, let’s talk about the automation of tasks and tracking the rest of our pipeline, which may include data, code, and infrastructure.

For this purpose, we use dstack. Here’s a brief list of what dstack can do:

  • version data, code, and infrastructure
  • automate training tasks through declarative configuration files
  • automatically provision infrastructure in a linked cloud account (it supports AWS, GCP, and Azure),
  • Most importantly, it can be used from your favorite IDE (or terminal)


To automate workflows with dstack, one needs to define them in the ./dstack/workflows.yaml file.


Here’s a very simple example:


workflows:
  - name: prepare
    provider: python
    script: "prepare.py"
    requirements: "requirements.txt"
    artifacts: ["data"]

 - name: train
   provider: python
   version: "3.9"
   requirements: "requirements.txt"
   script: "train.py"
   depends-on:
      - prepare
   artifacts: ["model"]
   resources:
     gpu: 4


Here, we can define multiple workflows (i.e. tasks) and configure dependencies between them.


In this particular example, we have two workflows: prepare and train.


The train workflow depends on the prepare workflow.


As you see, each workflow defines how to run the code, including what folders to store as output artifacts, and what infrastructure is needed (e.g. the number of GPUs, amount of memory, etc).


dstack CLI is used to run the workflow from your local terminal. To run the above-mentioned workflow, execute the following command in your local terminal.


dstack run train


Once the run is submitted, the user can access relevant logs, changes in code, the artifacts etc in the dstack UI or CLI.


In order to monitor the run, login to your dstack.ai account, and the contents in the dstack UI should look like below.

Artifacts can be browsed through the user interface.


Moreover, it is possible to download the contents of artifacts using the dstack CLI with the following command.


dstack artifacts download <run-name>


Versioning data


In order to ensure reproducibility of the training pipelines, it is crucial to track data too.

With dstack, data artifacts can be versioned by assigning tags, which can later be referenced in other workflows.


In the example above, the train workflow depended on the prepare workflow. Each time you ran the train workflow, dstack also ran the download workflow, and then passed the output artifacts of the prepare workflow to the train workflow.


Now let’s imagine that we’d like to run the prepare workflow independently and then reuse the output artifacts of that particular run with the train workflows.


In order to do that, you have to run the prepareworkflow, and then assign a tag to it (e.g. through the UI as shown below or the CLI).


Then, you can refer to this tag from the train workflow:


workflows:
  - name: prepare
    provider: python
    script: "prepare.py"
    requirements: "requirements.txt"
    artifacts: ["data"]

 - name: train
   provider: python
   version: "3.9"
   requirements: "requirements.txt"
   script: "train.py"
   depends-on:
      - prepare:latest
   artifacts: ["model"]
   resources:
     gpu: 4


By decoupling the workflows of preparing data and training the model, it becomes easier to train the model iteratively and keep every run reproducible.


dstack + WandB Configuration

It is possible to seamlessly use dstack and WandB together.


To do that, obtain the WandB API key from “Settings” in Wandb as shown below.

And add it to dstack Secrets.


You can do it via “Settings” in dstack.ai. Click the “Add secret” button and add the key WANDB_API_KEY secret with the copied WandB API key value.


The dstack settings should look like below.

In order to use the same run names across dstack and WandB, you can use the environment variable RUN_NAME to get the dstack run name and pass it as a display name to WandbLogger as shown below:


run_name = os.environ['RUN_NAME']

wandb log results to a project

wandb_logger = WandbLogger(name=run_name, project="my-test-project")


That’s it for now. The source code used in this blog post can be found here.


You’re very welcome to share your feedback, ask questions, and of course give a spin to WandB and dstack yourself.


I want to try it. Where do I start?


dstack Quickstart

WandB Quickstart