The typical machine learning training process consists of various tasks such as data preparation, model training, model validation, model deployment and others.
Because you may want to iterate on your model even after you deploy it to production, it is paramount to automate these tasks, e.g. to run them with new data, hyper-parameters, changes in the code, etc. Training pipelines help automate these tasks.
At the same time, it is crucial to keep the pipeline reproducible, in the sense that you as well as others can replicate it and obtain similar results. In order to ensure that pipelines are reproducible, it is crucial to track infrastructure, code, data, hyper-parameters, experiment metrics, etc.
In this article, I’ll give an overview on how WandB and dstack together can help build reproducible training pipelines.
First, we’ll start with experiment tracking.
Experiment tracking involves tracking
WandB provides one of the easiest ways to perform experiment tracking, as it
WandB is very easy to integrate into your existing codebase. Below is an example of how to do it if you use PyTorch Lightning.
We need to import the WandbLogger
object. Then, instantiate the object and pass a project name as its argument, as shown below.
from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(project="my-test-project")
The createdwandb_logger
can be passed into the logger argument in the Trainer
object as shown below.
trainer = Trainer(logger=wandb_logger)
Once the run is completed, the information of the run is assigned to my-test-project
.
You may check
In order to store hyper-parameters or other configuration parameters, wandb.config
is used and its integration into the ML code is easy as shown below.
import wandb
wandb.init()
wandb.config.epochs = 10
wandb.config.batch_size = 32
The experiment configuration for each run is automatically stored in the cloud, and it can be found in theconfig.yaml
file in the Files
tab of a run at
Now, let’s talk about the automation of tasks and tracking the rest of our pipeline, which may include data, code, and infrastructure.
For this purpose, we use dstack. Here’s a brief list of what dstack can do:
To automate workflows with dstack, one needs to define them in the ./dstack/workflows.yaml
file.
Here’s a very simple example:
workflows:
- name: prepare
provider: python
script: "prepare.py"
requirements: "requirements.txt"
artifacts: ["data"]
- name: train
provider: python
version: "3.9"
requirements: "requirements.txt"
script: "train.py"
depends-on:
- prepare
artifacts: ["model"]
resources:
gpu: 4
Here, we can define multiple workflows (i.e. tasks) and configure dependencies between them.
In this particular example, we have two workflows: prepare
and train
.
The train
workflow depends on the prepare
workflow.
As you see, each workflow defines how to run the code, including what folders to store as output artifacts, and what infrastructure is needed (e.g. the number of GPUs, amount of memory, etc).
dstack CLI is used to run the workflow from your local terminal. To run the above-mentioned workflow, execute the following command in your local terminal.
dstack run train
Once the run is submitted, the user can access relevant logs, changes in code, the artifacts etc in the dstack UI or CLI.
In order to monitor the run, login to your
Artifacts can be browsed through the user interface.
Moreover, it is possible to download the contents of artifacts using the dstack CLI with the following command.
dstack artifacts download <run-name>
In order to ensure reproducibility of the training pipelines, it is crucial to track data too.
With dstack, data artifacts can be versioned by assigning tags, which can later be referenced in other workflows.
In the example above, the train
workflow depended on the prepare
workflow. Each time you ran the train
workflow, dstack also ran the download
workflow, and then passed the output artifacts of the prepare
workflow to the train
workflow.
Now let’s imagine that we’d like to run the prepare
workflow independently and then reuse the output artifacts of that particular run with the train
workflows.
In order to do that, you have to run the prepare
workflow, and then assign a tag to it (e.g. through the UI as shown below or the CLI).
Then, you can refer to this tag from the train
workflow:
workflows:
- name: prepare
provider: python
script: "prepare.py"
requirements: "requirements.txt"
artifacts: ["data"]
- name: train
provider: python
version: "3.9"
requirements: "requirements.txt"
script: "train.py"
depends-on:
- prepare:latest
artifacts: ["model"]
resources:
gpu: 4
By decoupling the workflows of preparing data and training the model, it becomes easier to train the model iteratively and keep every run reproducible.
It is possible to seamlessly use dstack and WandB together.
To do that, obtain the WandB API key from “Settings” in Wandb as shown below.
And add it to dstack Secrets.
You can do it via “Settings” in dstack.ai. Click the “Add secret” button and add the key WANDB_API_KEY
secret with the copied WandB API key value.
The dstack settings should look like below.
In order to use the same run names across dstack and WandB, you can use the environment variable RUN_NAME
to get the dstack run name and pass it as a display name to WandbLogger
as shown below:
run_name = os.environ['RUN_NAME']
wandb log results to a project
wandb_logger = WandbLogger(name=run_name, project="my-test-project")
That’s it for now. The source code used in this blog post can be found here.
You’re very welcome to share your feedback, ask questions, and of course give a spin to WandB and dstack yourself.
I want to try it. Where do I start?