In today’s machine learning development, it is common to package the training application into a container, which is then deployed to a compute infrastructure for training. However, before distributing the container image, it is crucial to perform a test locally to ensure everything works correctly.
In this guide, I will explain how to configure your local machine to run a Docker container with access to your on-premise GPU devices. I will demonstrate the setup process on a Ubuntu 20.04 machine equipped with an Nvidia RTX 2060 GPU, CUDA version 11.8, and cuDNN version 8.6.0.
Here’s a step-by-step guide to achieving this:
You can follow the
Before installing the Nvidia driver, ensure that the driver version is compatible with the CUDA Toolkit you intend to install. You can check the
After that, you can proceed to the
On an Ubuntu machine, it is advisable to install the necessary system packages, such as "build-essential,” before proceeding with the CUDA Toolkit installation.
sudo apt-get install g++ freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev
Before installing the driver, make sure to double-check the required versions of cuDNN and CUDA Toolkit specified by the machine learning framework you intend to use. The version specifications below are from the TensorFlow library. In this case, ensure that the following version requirements are met:
After verifying the version requirements, proceed to the
At this point, you need to modify the installation script to specify the version of the CUDA Toolkit you want to download. Below is the original script you are likely to receive after specifying your machine specifications
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
Below is an example of installing CUDA Toolkit 11.8. You will need to change the last line by appending the version number to cuda
.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-11.8
Afterward, you will need to set your PATH
and LD_LIBRARY_PATH
to point to the CUDA Toolkit that you just installed, which, in this case, is cuda-11.8
. If you are installing a different version, be sure to update it accordingly to the corresponding version. This will ensure that your system can locate and use the installed CUDA Toolkit correctly.
echo 'export PATH=/usr/local/cuda-11.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
Once you have completed these steps and set up the environment variables, it is essential to reboot the machine. Rebooting ensures that all the changes and configurations related to the CUDA Toolkit and environment variables take effect. After the reboot, your machine should be ready to utilize the installed CUDA Toolkit and GPU for machine learning tasks.
The installation of cuDNN is relatively straightforward, involving copying specific files to the CUDA Toolkit’s include
and lib64
directories. To download cuDNN, you can visit the
To download the cuDNN package, obtain it as a tar
file and extract its contents once the download is complete. After extraction, run the following script to copy the necessary files into the appropriate CUDA Toolkit directories. Make sure that the specified path points to the correct CUDA Toolkit installation directory.
sudo cp -P <extracted_cudnn_path>/include/cudnn.h /usr/local/cuda-11.8/include
sudo cp -P <extracted_cudnn_path>/lib64/libcudnn* /usr/local/cuda-11.8/lib64/
sudo chmod a+r /usr/local/cuda-11.8/lib64/libcudnn*
Now, you have both CUDA Toolkit and cuDNN installed.
To configure your Docker container to utilize the on-premise GPU devices, you need to set up the Nvidia Container Toolkit. If you are not using Docker, you can follow the
To install the NVIDIA Container Toolkit, run the following command:
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit-base
To validate your installation, run the following command:
nvidia-ctk --version
To set up the Nvidia Container Toolkit, run the following command:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Then, you will need to configure the Docker daemon to recognize the NVIDIA Container Runtime by editing the Docker daemon configuration file.
sudo nvidia-ctk runtime configure --runtime=docker
Finally, to restart your docker daemon, run the following command:
sudo systemctl restart docker
After completing the configuration of the Nvidia Container Toolkit and Docker, you can test your setup by running a base CUDA container. I will use the following Dockerfile
and the command:
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04
RUN apt-get update --yes --quiet && DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
software-properties-common \
build-essential apt-utils \
wget curl vim git ca-certificates kmod \
nvidia-driver-525 \
&& rm -rf /var/lib/apt/lists/*
RUN add-apt-repository --yes ppa:deadsnakes/ppa && apt-get update --yes --quiet
RUN DEBIAN_FRONTEND=noninteractive apt-get install --yes --quiet --no-install-recommends \
python3.10 \
python3.10-dev \
python3.10-distutils \
python3.10-lib2to3 \
python3.10-gdbm \
python3.10-tk \
pip
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 999 \
&& update-alternatives --config python3 && ln -s /usr/bin/python3 /usr/bin/python
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
COPY requirements.txt /requirements.txt
COPY finetune.py /finetune.py
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install --no-cache-dir -r /requirements.txt
ENTRYPOINT [ "python3", "finetune.py" ]
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Also published here.