The typical machine learning training process consists of various tasks such as data preparation, model training, model validation, model deployment and others.
Because you may want to iterate on your model even after you deploy it to production, it is paramount to automate these tasks, e.g. to run them with new data, hyper-parameters, changes in the code, etc. Training pipelines help automate these tasks.
At the same time, it is crucial to keep the pipeline reproducible, in the sense that you as well as others can replicate it and obtain similar results. In order to ensure that pipelines are reproducible, it is crucial to track infrastructure, code, data, hyper-parameters, experiment metrics, etc.
In this article, I’ll give an overview on how WandB and dstack together can help build reproducible training pipelines.
First, we’ll start with experiment tracking.
Experiment tracking involves tracking
WandB provides one of the easiest ways to perform experiment tracking, as it
WandB is very easy to integrate into your existing codebase. Below is an example of how to do it if you use PyTorch Lightning.
We need to import the
WandbLogger object. Then, instantiate the object and pass a project name as its argument, as shown below.
from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(project="my-test-project")
The created
wandb_logger can be passed into the logger argument in the
Trainer object as shown below.
trainer = Trainer(logger=wandb_logger)
Once the run is completed, the information of the run is assigned to
my-test-project.
You may check
In order to store hyper-parameters or other configuration parameters,
wandb.config is used and its integration into the ML code is easy as shown below.
import wandb
wandb.init()
wandb.config.epochs = 10
wandb.config.batch_size = 32
The experiment configuration for each run is automatically stored in the cloud, and it can be found in the
config.yaml file in the
Files tab of a run at
Now, let’s talk about the automation of tasks and tracking the rest of our pipeline, which may include data, code, and infrastructure.
For this purpose, we use dstack. Here’s a brief list of what dstack can do:
To automate workflows with dstack, one needs to define them in the
./dstack/workflows.yaml file.
Here’s a very simple example:
workflows:
- name: prepare
provider: python
script: "prepare.py"
requirements: "requirements.txt"
artifacts: ["data"]
- name: train
provider: python
version: "3.9"
requirements: "requirements.txt"
script: "train.py"
depends-on:
- prepare
artifacts: ["model"]
resources:
gpu: 4
Here, we can define multiple workflows (i.e. tasks) and configure dependencies between them.
In this particular example, we have two workflows:
prepare and
train.
The
train workflow depends on the
prepare workflow.
As you see, each workflow defines how to run the code, including what folders to store as output artifacts, and what infrastructure is needed (e.g. the number of GPUs, amount of memory, etc).
dstack CLI is used to run the workflow from your local terminal. To run the above-mentioned workflow, execute the following command in your local terminal.
dstack run train
Once the run is submitted, the user can access relevant logs, changes in code, the artifacts etc in the dstack UI or CLI.
In order to monitor the run, login to your
Artifacts can be browsed through the user interface.
Moreover, it is possible to download the contents of artifacts using the dstack CLI with the following command.
dstack artifacts download <run-name>
In order to ensure reproducibility of the training pipelines, it is crucial to track data too.
With dstack, data artifacts can be versioned by assigning tags, which can later be referenced in other workflows.
In the example above, the
train workflow depended on the
prepare workflow. Each time you ran the
train workflow, dstack also ran the
download workflow, and then passed the output artifacts of the
prepare workflow to the
train workflow.
Now let’s imagine that we’d like to run the
prepare workflow independently and then reuse the output artifacts of that particular run with the
train workflows.
In order to do that, you have to run the
prepareworkflow, and then assign a tag to it (e.g. through the UI as shown below or the CLI).
Then, you can refer to this tag from the
train workflow:
workflows:
- name: prepare
provider: python
script: "prepare.py"
requirements: "requirements.txt"
artifacts: ["data"]
- name: train
provider: python
version: "3.9"
requirements: "requirements.txt"
script: "train.py"
depends-on:
- prepare:latest
artifacts: ["model"]
resources:
gpu: 4
By decoupling the workflows of preparing data and training the model, it becomes easier to train the model iteratively and keep every run reproducible.
It is possible to seamlessly use dstack and WandB together.
To do that, obtain the WandB API key from “Settings” in Wandb as shown below.
And add it to dstack Secrets.
You can do it via “Settings” in dstack.ai. Click the “Add secret” button and add the key
WANDB_API_KEY secret with the copied WandB API key value.
The dstack settings should look like below.
In order to use the same run names across dstack and WandB, you can use the environment variable
RUN_NAME to get the dstack run name and pass it as a display name to
WandbLogger as shown below:
run_name = os.environ['RUN_NAME']
wandb log results to a project
wandb_logger = WandbLogger(name=run_name, project="my-test-project")
That’s it for now. The source code used in this blog post can be found here.
You’re very welcome to share your feedback, ask questions, and of course give a spin to WandB and dstack yourself.
