Train ML Models in Docker Container and Configure Access to on-premise GPU device

In today’s machine learning development, it is common to package the training application into a container, which is then deployed to a compute infrastructure for training. However, before distributing the container image, it is crucial to perform a test locally to ensure everything works correctly.

In this guide, I will explain how to configure your local machine to run a Docker container with access to your on-premise GPU devices. I will demonstrate the setup process on a Ubuntu 20.04 machine equipped with an Nvidia RTX 2060 GPU, CUDA version 11.8, and cuDNN version 8.6.0.

Prerequisites

Docker
Nvidia driver
CUDA Toolkit
cuDNN
NVIDIA Container Toolkit

Here’s a step-by-step guide to achieving this:

Install Docker

You can follow the official documentation to install Docker Desktop. This application includes Docker Engine, Docker CLI client, Docker Compose, and other tools that enable you to build and share containerized apps.

Install Nvidia Driver

Before installing the Nvidia driver, ensure that the driver version is compatible with the CUDA Toolkit you intend to install. You can check the official documentation for information on compatibility. To determine the required version of the CUDA Toolkit, refer to the machine learning framework you will be using. More details are provided in the CUDA Toolkit section below.

After that, you can proceed to the driver download page, where you should specify your machine’s specifications. Once you have entered the details, click the “Search” button to initiate the driver download.

Install CUDA Toolkit

On an Ubuntu machine, it is advisable to install the necessary system packages, such as "build-essential,” before proceeding with the CUDA Toolkit installation.

sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev

Before installing the driver, make sure to double-check the required versions of cuDNN and CUDA Toolkit specified by the machine learning framework you intend to use. The version specifications below are from the TensorFlow library. In this case, ensure that the following version requirements are met:

After verifying the version requirements, proceed to the download page and select your machine’s specifications. Once you have selected the appropriate settings, click the “deb(network)” button to obtain the script for installing the CUDA Toolkit.

At this point, you need to modify the installation script to specify the version of the CUDA Toolkit you want to download. Below is the original script you are likely to receive after specifying your machine specifications

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

Below is an example of installing CUDA Toolkit 11.8. You will need to change the last line by appending the version number to cuda.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-11.8

Afterward, you will need to set your PATH and LD_LIBRARY_PATH to point to the CUDA Toolkit that you just installed, which, in this case, is cuda-11.8. If you are installing a different version, be sure to update it accordingly to the corresponding version. This will ensure that your system can locate and use the installed CUDA Toolkit correctly.

echo 'export PATH=/usr/local/cuda-11.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

Once you have completed these steps and set up the environment variables, it is essential to reboot the machine. Rebooting ensures that all the changes and configurations related to the CUDA Toolkit and environment variables take effect. After the reboot, your machine should be ready to utilize the installed CUDA Toolkit and GPU for machine learning tasks.

Install cuDNN

The installation of cuDNN is relatively straightforward, involving copying specific files to the CUDA Toolkit’s include and lib64 directories. To download cuDNN, you can visit the cuDNN Archive page. Ensure that the cuDNN version you download matches the one specified by your machine learning framework.

To download the cuDNN package, obtain it as a tar file and extract its contents once the download is complete. After extraction, run the following script to copy the necessary files into the appropriate CUDA Toolkit directories. Make sure that the specified path points to the correct CUDA Toolkit installation directory.

sudo cp -P <extracted_cudnn_path>/include/cudnn.h /usr/local/cuda-11.8/include
sudo cp -P <extracted_cudnn_path>/lib64/libcudnn* /usr/local/cuda-11.8/lib64/
sudo chmod a+r /usr/local/cuda-11.8/lib64/libcudnn*

Now, you have both CUDA Toolkit and cuDNN installed.

Install Nvidia Container Toolkit

To configure your Docker container to utilize the on-premise GPU devices, you need to set up the Nvidia Container Toolkit. If you are not using Docker, you can follow the official guide to install the toolkit.

To install the NVIDIA Container Toolkit, run the following command:

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit-base

To validate your installation, run the following command:

nvidia-ctk --version

To set up the Nvidia Container Toolkit, run the following command:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Then, you will need to configure the Docker daemon to recognize the NVIDIA Container Runtime by editing the Docker daemon configuration file.

sudo nvidia-ctk runtime configure --runtime=docker

Finally, to restart your docker daemon, run the following command:

sudo systemctl restart docker

Test the Setup

After completing the configuration of the Nvidia Container Toolkit and Docker, you can test your setup by running a base CUDA container. I will use the following Dockerfile and the command:

Dockerfile

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04

RUN apt-get update --yes --quiet && DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
    software-properties-common \
    build-essential apt-utils \
    wget curl vim git ca-certificates kmod \
    nvidia-driver-525 \
    && rm -rf /var/lib/apt/lists/*

RUN add-apt-repository --yes ppa:deadsnakes/ppa && apt-get update --yes --quiet
RUN DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
    python3.10 \
    python3.10-dev \
    python3.10-distutils \
    python3.10-lib2to3 \
    python3.10-gdbm \
    python3.10-tk \
    pip

RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 999 \
    && update-alternatives --config python3 && ln -s /usr/bin/python3 /usr/bin/python

RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

COPY requirements.txt /requirements.txt
COPY finetune.py /finetune.py

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install --no-cache-dir -r /requirements.txt

ENTRYPOINT [ "python3", "finetune.py" ]

Run the following command to start the container.

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

Also published here.