Quite some time ago, I have experimented with combining two of my interests: data science and video games. For a small personal project, I have scraped images from two of my favorite GameBoy games (Super Mario Land 2: 6 Golden Coins and Wario Land: Super Mario Land 3) and built an image classifier detecting from which game an image comes. It was lots of fun!
But this project pretty much never left a Jupyter Notebook 🙈 And while I personally like Notebooks as a storytelling tool for structured narrative (and some ad-hoc experimentation), such a setup comes with quite a few limitations. For starters, building any ML/DL project is an iterative process, often with dozens or hundreds of experiments. Each of those either preprocesses the data slightly differently, adds some new feature(s), uses a different model and its hyperparameters, etc.
How to track all of those? We could create a spreadsheet and manually note down all the details. But I am sure it would become very annoying and cumbersome after just a few iterations. And while this could work for some of the points I have mentioned, it would not solve the issue with data. We simply cannot track data and its transformations that easily in a spreadsheet.
That is why this time I would like to approach this project differently, that is, use the available tools to have a properly versioned (including data) project with experiment tracking and some sanity checks for the data. Sounds much better than experiment tracking in a spreadsheet, right?
In this article, I will show you how to use tools such as DagsHub, DVC, MLFlow, and GitHub Actions to create a full-fledged ML/DL project. We will cover the following topics:
Let’s jump right into it!
As I have already mentioned in the introduction, we will be trying to solve a binary image classification problem. When I previously worked on this project, I compared the performance of a logistic regression model to that of a Convolutional Neural Network. In this project, we will implement the latter using keras
.
Below you can find an example of the images that we will be working with.
I will not go into much detail in terms of getting the data, processing it, building the CNN, or even evaluating the models, as I have already covered those quite extensively in the past. If you are interested in those parts, please refer to my previous articles:
keras
,
For the v2 approach to this image classification task, we use the following project structure:
.
├── README.md
├── .github
│ └── workflows
│ └── data_validation.yml
├── data
│ ├── processed
│ │ ├── test
│ │ │ ├── mario
│ │ │ ── wario
│ │ └── train
│ │ ├── mario
│ │ └── wario
│ ├── raw
│ │ ├── mario
│ │ └── wario
│ └── videos
├── data_validation.html
├── dvc.lock
├── dvc.yaml
├── metrics.csv
├── models
├── notebooks
│ ├── 1_downloading_data.ipynb
│ ├── 2_extracting_images.ipynb
│ ├── 3_train_test_split.ipynb
│ ├── 4_cnn_classifier.ipynb
│ ├── 5_deepchecks.ipynb
│ └── README.md
├── params.yml
├── requirements.txt
└── src
├── config.py
├── create_train_test_split.py
├── extract_frames.py
├── get_videos.py
├── train.py
├── utils.py
└── validate_data.py
We will cover all the elements throughout the article, but for now we can briefly mention the following:
data
directory and each subdirectory stores data from different stages..github
hidden directory is responsible for the GitHub Actions workflows.notebooks
directory contains Notebooks used for exploration, those are not relevant for the project’s functionalities.src
directory contains the project’s entire codebase. Each script covers a different part of the pipeline.requiremets.txt
contains the list of required libraries for running the project. poetry
could work just as well.
Setting a clear project structure is definitely helpful in keeping everything organized and facilitates running experiments that modify only parts of the entire pipeline. There are still quite a few things we could add to the structure, but let’s keep it simple and focus on the other elements of the project’s setup. We will mention some potential extensions at the end of the article.
As the first building block of our project, we will use DagsHub. In a nutshell, it is something like GitHub, but tailor-made for data scientists and ML engineers (as opposed to software engineers). On DagsHub we can easily host and version not only our code but also our data, models, experiments, etc.
You might be thinking now “sounds great, but we need to sign up for one more service and, on top of that, our entire codebase is already on GitHub”. Fortunately, that is not a problem. We can either fully migrate a repository or — which is even more convenient — mirror an existing one. This way, we can continue working using the existing GitHub repository and it will be mirrored in real-time to DagsHub. For this project, we will be using the repo mirroring option — the main GitHub repository will be mirrored to this one on DagsHub.
As you can see in the image below, the UI of DagsHub is very similar to the one of GitHub. This way, we do not have to learn yet another tool from scratch, as everything seems very familiar from the very beginning. In the image below, you can already see that there are some new tabs available (experiments, annotations) and we will cover them later on. In the image, we display all the files in the repository, but we can easily filter them, for example, to display only Notebooks or files tracked with DVC.
We will not cover all of the functionalities of DagsHub, but it is also worth mentioning that it offers the following:
You are certainly already familiar with the concept of versioning the code with Git. Unfortunately, GitHub does not work that well with data, as it has a file limit of 100MB. This means that uploading a binary file (or a video file in our case) can already exceed this limit. On top of that, comparing different versions of data sets of any kind is also not the most pleasant experience. That’s why we need another tool for the job.
DVC (data version control) is an open-source Python library that essentially serves the same purpose as Git (even with the same syntax) but for data instead of code. The idea of DVC is that we keep the information about different versions of our data in Git, while the original data is stored somewhere else (cloud storage like AWS, GCS, Google Drive, etc.).
Such a setup requires a bit of DevOps know-how. Thankfully, DagsHub can save us quite some hassle, as each DagsHub account comes together with 10GB of free storage for DVC.
First, we need to install the library:
pip install dvc
Then, we need to instantiate a DVC repo:
dvc init
Running this command creates 3 files: .dvc/.gitignore
, .dvc/config
, and .dvcignore
. You can find more information about what they contain here. Then, we need to connect our freshly created DVC repo to DagsHub. As we have mentioned before, by using DagsHub we do not have to manually set the connection to cloud storage. The only thing we need to do is run the following commands in the terminal:
dvc remote add origin https://dagshub.com/eryk.lewinson/mario_vs_wario_v2.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user XXXX
dvc remote modify origin --local password XXXX
To make it even easier, DagsHub provides us with all of those under the Remote tab. We just need to copy-paste them into the terminal.
Having connected to the DVC remote, we commit the previously mentioned files to Git:
git add.
git commit -m "Initialized DVC"
git push
Thanks to the mirroring functionality, we only need to push our files to GitHub, as DagsHub will automatically sync all the changes from there.
Now it is time to build a full pipeline, where the intermediate data outputs are tracked by DVC. To make the process easier, we created a separate .py
file for each step of the pipeline. In our case, the steps are as follows:
get_videos.py
— downloads the videos of the two games (full gameplays, from start to finish) from YouTube. The downloaded videos are stored in the data/videos
directory.extract_frames.py
— extracts the images from mp4 video files. The output is stored in the data/raw
directory.create_train_test_split.py
— splits the extracted images into training and test sets. The outputs of this stage are stored in the data/processed
directory.train.py
— trains the CNN to classify the images. Outputs the trained model to the models
directory and some other files (metrics.csv
and params.yml
) to the root directory.
Based on those steps, we can create a pipeline using the dvc run
command. For readability, the steps are divided into 4 chunks, each corresponding to a separate stage of the pipeline. In practice, you do not have to commit and push after each step. We did it for full transparency and tractability.
dvc run -n get_videos -d src/config.py -o data/videos python src/get_videos.py
git add data/.gitignore dvc.lock dvc.yaml
git commit -m "added get_videos step"
git push
dvc push -r origin
dvc run -n extract_frames -d src/config.py -d src/utils.py -d data/videos -o data/raw python src/extract_frames.py
git add dvc.yaml dvc.lock data/.gitignore
git commit -m "added extract_frames step"
git push
dvc push -r origin
dvc run -n create_train_test_split -d src/config.py -d data/raw -o data/processed python src/create_train_test_split.py
git add dvc.lock data/.gitignore dvc.yaml
git commit -m "executed train_test_split stage"
git push
dvc push -r origin
dvc run -n train -d src/config.py -d data/processed/ -o models -o metrics.csv -o params.yml python src/train.py
git add dvc.lock data/.gitignore dvc.yaml
git commit -m "executed train step"
git push
dvc push -r origin
As you can see in the committed files, DVC saves the pipeline stages into two files: dvc.yaml
(stored in a human-readable format) and a dvc.lock
(pretty much unreadable). While creating the pipeline, we used the following DVC commands:
-n
- the name of the stage,-d
- the dependency of the stage,-o
- the output of the stage.Below, you can see what the pipeline looks like in the YAML file.
stages:
get_videos:
cmd: python src/get_videos.py
deps:
- src/config.py
outs:
- data/videos
extract_frames:
cmd: python src/extract_frames.py
deps:
- data/videos
- src/config.py
- src/utils.py
outs:
- data/raw
create_train_test_split:
cmd: python src/create_train_test_split.py
deps:
- data/raw
- src/config.py
outs:
- data/processed
train:
cmd: python src/train.py
deps:
- data/processed/
- src/config.py
outs:
- metrics.csv
- models
- params.yml
DVC will automatically track all directories and files under the outs
.
On top of that, DagsHub offers a visual preview of the DVC pipeline. You can find it under the list of files in the repository. As you can see below, it makes it much easier to understand the entire pipeline than reading the dvc.yml
file.
Having defined the entire DVC pipeline, we can use the dvc repro
command to reproduce the complete or partial pipeline by executing the stages defines in dvc.yml
.
Lastly, it is worth mentioning that we can access and inspect all data stored with DVC on DagsHub, including quite some metadata. You can see an example below.
The next point on our wish list for this project is experiment tracking. To do so, we will use another open-source library — MLFlow (MLFlow Tracking to be precise). With MLFlow’s functionalities we will log quite a lot of details about our experiments, starting with the name through the model’s hyperparameters and ending with the corresponding scores.
Similar to DVC, we also need a server to host MFLow. And just as before, that is also facilitated by DagsHub. We can find the details required for authentication under the Remote tab in our DagsHub repository. After we set up the MLFlow remote, all of our experiments will be logged under the Experiments tab in our DagsHub repository.
In the script below, you can see how to implement tracking with MLFlow. It is an abbreviated version of the full training script, which excludes some not strictly relevant elements (creating data generators, defining the NN’s architecture, etc.). However, we will mention some things about the MLFlow implementation:
with
statement).
from config import PROCESSED_IMAGES_DIR, MODELS_DIR
import os
import tensorflow.keras
import mlflow
from dagshub import dagshub_logger
mlflow.set_tracking_uri("https://dagshub.com/eryk.lewinson/mario_vs_wario_v2.mlflow")
os.environ['MLFLOW_TRACKING_USERNAME'] = USER_NAME
os.environ['MLFLOW_TRACKING_PASSWORD'] = PASSWORD
if __name__ == "__main__":
mlflow.tensorflow.autolog()
IMG_SIZE = 128
LR = 0.001
EPOCHS = 10
with mlflow.start_run():
training_set, valid_set, test_set = get_datasets(validation_ratio=0.2,
target_img_size=IMG_SIZE,
batch_size=32)
model = get_model(IMG_SIZE, LR)
print("Training the model...")
model.fit(training_set,
validation_data=valid_set,
epochs = EPOCHS)
print("Training completed.")
print("Evaluating the model...")
test_loss, test_accuracy = model.evaluate(test_set)
print("Evaluating completed.")
# dagshub logger
with dagshub_logger() as logger:
logger.log_metrics(loss=test_loss, accuracy=test_accuracy)
logger.log_hyperparams({
"img_size": IMG_SIZE,
"learning_rate": LR,
"epochs": EPOCHS
})
# mlflow logger
mlflow.log_params({
"img_size": IMG_SIZE,
"learning_rate": LR,
"epochs": EPOCHS
})
mlflow.log_metrics({
"test_set_loss": test_loss,
"test_set_accuracy": test_accuracy,
})
print("Saving the model...")
model.save(MODELS_DIR)
print("done.")
As a small bonus, it is worth mentioning that there is an alternative, more lightweight way of tracking experiments — using Git and DagsHub. To do so, we have to use the dagshub
library. In lines 36–43 we have shown how to log some metrics and hyperparameters using the dagshub
library.
The dagshub
logger creates two files in the project’s root directory (unless specified otherwise): metrics.csv
and params.yml
. These are the two files we indicated in the last step of our DVC pipeline as outputs of the train.py
script. When we commit those two files to git, DagsHub will automatically recognize them and put their values under the Experiments tab. We can clearly locate them when looking at experiments marked with the Git label in the source column.
The biggest advantage of using the dagshub
client is that those experiments are fully reproducible — as long as we are using DVC to track data, we can switch to the project’s state at the time of finishing an experiment with a single git checkout
. Such a thing is also possible with MLFlow, but not as simple.
You can also write your own custom logger which combines the best of the two approaches to tracking experiments. You can find an example here.
That would be all regarding the details of the implementation of experiment tracking. Moving on to inspecting some results, the following image presents the state at which an experiment called dream hookworm was still running — we see the accuracy and loss updated as the model is being trained.
In the Experiments tab, we can mark the variants we want to compare and press the compare button.
In the following image, we see some of the hyperparameters tracked by MLFlow. As you might have guessed, we see that many hyperparameters because we are using MLFlow’s TensorFlow autologger. At the end of the list, you can also see the hyperparameter added by us manually — img_size
. After that, we can see the relevant metrics.
The two analyzed experiments differ by two hyperparameters — the number of epochs and the considered image size (the size of the square image being passed into the first layer of the NN). You can also see the values of the hyperparameters and the corresponding training set accuracies in the following parallel coordinate plot.
Lastly, we can dive even deeper into analyzing various performance metrics of the experiments.
As the goal of this part was just to showcase the functionalities of MLFlow tracking, we will not spend more time analyzing the results of the experiments.
As the very last step of our project, we would like to create automated sanity checks for our data. Let’s illustrate that with a hypothetical example that could be applied to our project. In a video game such as the ones considered (the genre of side-scrolling platformers), the vast majority of time is spent trying to finish some level or just exploring. We see our main character running around and doing things (mostly jumping).
However, as in all games, there are some other screens as well (menus, black/white transition or loading screens, end credits, etc.). We could argue that those should not be included in the data. So let’s imagine we manually went through images and deleted the ones we deemed unsuitable for our data sample. This could raise a question: did our actions significantly change something in the data, for example, the balance of classes? Or maybe we introduced some other bias?
That could also be very relevant when some of our data transformations involve cropping the images — we could cut some HUDs (gaming lingo, heads-up display, or simply the status bar*)* from one class while keeping them there for the other. This would lead to creating a classifier which simply checks if that particular pixel has a value of X and then confidently decides from which game the image comes.
For exactly such a scenario it would be great to have some automated data sanity checks. We will show how to build those using GitHub Actions and Deepchecks. But first, we need to answer some helper questions.
GitHub Actions is a tool used for automating software workflows. For example, software engineers use GitHub Actions to automate actions such as merging branches, handling issues, running unit or application tests, etc.
However, that does not mean that they are useless for data scientists. We can use GitHub Actions for many things, including:
Some things to keep in mind about GitHub Actions:
In short, deepchecks
is an open-source Python library for testing ML/DL models and data. The library can help us out with various testing and validation needs throughout our projects — we can verify the data’s integrity, inspect the distributions, confirm valid data splits (for example, the train/test split), and evaluate the performance of our model, and more!
At this point, it is time to combine the two building blocks to automatically generate a data validity report every time we push some changes to the codebase.
First, we create a script generating the data validity report with deepchecks
. We will use the library’s default suite (in the deepchecks
’s lingo, this corresponds to a collection of checks) used to verify the correctness of the train/test split. Currently, deepchecks
provides suites for tabular and computer vision tasks, however, the company is also working on an NLP variant.
from deepchecks.vision.simple_classification_data import load_dataset
from deepchecks.vision.suites import train_test_validation
from config import PROCESSED_IMAGES_DIR
train_ds = load_dataset(PROCESSED_IMAGES_DIR, train=True, object_type="VisionData", image_extension="jpg")
test_ds = load_dataset(PROCESSED_IMAGES_DIR, train=False, object_type="VisionData", image_extension="jpg")
suite = train_test_validation()
result = suite.run(train_ds, test_ds)
result.save_as_html("data_validation.html")
Additionally, we could have specified an instance of a fitted model while using the run
method. However, at the time of writing this article, deepchecks
supports fitted scikit-learn
and PyTorch models. Unfortunately, this means that our keras
CNN will not work. But this will probably change in the near future, as deepchecks
is still a relatively new and constantly developed library. Running the script will result in the generation of an HTML report in the root directory.
Now it is time to schedule running the sanity checks. Thankfully, using GitHub Actions is very easy. We just have to add a .yaml
file to the .github/workflows
directory. In this .yaml
file, we specify things such as:
name: Data validation with deepchecks
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
permissions: write-all
jobs:
run_data_checks:
runs-on: ubuntu-latest
env:
DVC_USERNAME: ${{ secrets.DVC_USERNAME }}
DVC_PASSWORD: ${{ secrets.DVC_PASSWORD }}
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Get data
run: |
# dvc remote add origin https://dagshub.com/eryk.lewinson/mario_vs_wario_v2.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user $DVC_USERNAME
dvc remote modify origin --local password $DVC_PASSWORD
dvc pull create_train_test_split -r origin
- name: Remove the old validation report
run:
rm -f data_validation.html
- name: Validate data with deepchecks
run: |
python src/validate_data.py
- name: Commit the validation report file
uses: stefanzweifel/git-auto-commit-action@v4
with:
commit_message: Added the data validation file
- name: Create a comment
uses: peter-evans/commit-comment@v1
with:
body: |
Please check the data_validation.html file for a full validation report.
The script executes the following steps:
requirements.txt
.stefanzweifel/git-auto-commit-action@v4
).peter-evans/commit-comment@v1
).
After committing this file to GitHub, we will see our new workflow in the Actions tab.
We can dive deeper to see all the steps of the workflow, together with detailed logs. Those definitely come in handy when something goes wrong and we need to debug the pipeline.
After the workflow is finished, we can see that the GitHub bot posted a comment to our commit.
And now for the most interesting part — we can find the data_validation.html
report in our repository. It was automatically added and committed by the GitHub bot. To have it locally as well, we just need to pull from the repo.
For brevity, we present only some parts of the data validation report. Fortunately for us, the library also generates useful comments about the checks — what they are and what to look out for.
In the first image, we can see that the classes are perfectly balanced. This should come as no surprise, given this was exactly how we defined the splits.
In the second image, we can see the distribution plot of the brightness of the images over the train and test sets. Seems like the split was successful, as they are very similar. According to the documentation, it would be alarming if the Drift score (Earth Mover’s Distance) would be above 0.1.
Lastly, we look at the Predictive Power Score of the images’ area. As you can see in the comments, the difference in the PPS should be no greater than 0.2. That condition is satisfied, as we have a perfect PPS of 1 for all classes and datasets. Why is that? Simply the images from the two video games are of different sizes. The images from the Mario game are 160x144, while the Wario ones are 320x288 (double the size). This is probably just due to the fact that the videos were recorded using different emulator settings (originally, the games were for the same console, so they had the same output size). While this means that we can use the area of the image to perfectly predict the class, that is not the case in the actual model, as there we do reshape the images while loading them using ImageDataGenerator
.
We have covered quite a lot in this article already. From everything residing in a single Jupyter Notebook to a fully reproducible project with data versioning and experiment tracking. Having said that, there are still a few things we could add on top of our project. The ones that come to mind are:
In this article, we demonstrated a modern approach to creating ML/DL projects with code and data versioning, experiment tracking, and automating parts of the activities such as creating data sanity check reports. It definitely requires more work than having a single Jupyter Notebook for everything, but the extra effort quickly pays off.
At this point*,* I wanted to thank the DAGsHub support for helping out with some questions.
You can find the code used for this article on my GitHub or DagsHub. Also, any constructive feedback is welcome. You can reach out to me on Twitter.