Setting up your GPU machine to be Deep Learning ready

Written by saurabhbodhe | Published 2018/07/07
Tech Story Tags: machine-learning | deep-learning | gpu | nvidia | gpu-machine

TLDRvia the TL;DR App

Hi there,

This tutorial is a loose continuation of my previous article, do take a look.

GPU-enabled Deep Learning with Google Cloud Platform_I know, high end deep learning GPU-enabled systems are hell expensive to build and not easily available unless you are…_hackernoon.com

This is written assuming you have a bare machine with GPU available, feel free to skip some part if it came partially pre set-up, also I’ll assume you have an NVIDIA card, and we’ll only cover setting up for TensorFlow in this tutorial, being the most popular Deep Learning framework (Kudos to Google!)

Installing the CUDA drivers

CUDA is a parallel computing platform by NVIDIA, and a basic prerequisite for TensorFlow. But as we will understand later, it is actually better to start the reverse way, so let’s get back to this part later.

Installing TensorFlow

Fire up your terminal (or SSH maybe, if remote machine). Find the version of TensorFlow you need for your particular application (if any), or if no such restriction let’s just go for TensorFlow 1.8.0 which I currently use.

pip install tensorflow-gpu==1.8.0

Let it install. Now move to a Python shell by running,

python

In your Python shell, type in:

import tensorflow as tf

At this moment since we didn’t install CUDA, you should see an error similar to this:

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

9.0 and filename may be something else depending on the version of TensorFlow you chose. But the whole point of going the reverse way was to know which version of CUDA do we need, which is 9.0 in this case. The official documentation is not clear on the correspondence of TF version and CUDA version, so I always found this reverse engineering method better.

Let’s go back to installing CUDA.

Use

exit()

to exit the Python shell.

Installing the CUDA drivers (this time we’ll really do it, promise)

So navigate to https://developer.nvidia.com/cuda-toolkit-archive . Choose the version you just determined above.

Linux->x86_64->Ubuntu->16.04 (or 17.04)->deb (network)

Download the deb to your machine, and follow the instructions given on the NVIDIA page to install CUDA. Once that completes, let’s check if everything went well.

Going back, reopen the Python shell and,

import tensorflow as tf

So we are not yet done, you should see a little different error message now. (if you see the same one as earlier, refer to “Troubleshooting” below)

ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory

We need one more piece of NVIDIA library called cuDNN, which is used for GPU accelerations for deep neural networks. Again notice the version of cuDNN we need, 7.0 in this case.

Navigate to https://developer.nvidia.com/cudnn and register for an account (it’s free). After you make your account, login and go to https://developer.nvidia.com/rdp/cudnn-archive

Choose the required cuDNN version and also make sure of the CUDA version you choose that version for. In this case we need,

Download cuDNN v7.0.5 (Dec 5, 2017), for CUDA 9.0

and in drop down choose,

cuDNN v7.0.5 Library for Linux

The tgz file will start downloading, move it to your machine, extract it using

tar -xzvf <CUDNN_TAR_FILENAME>

A folder “cuda” will be extracted, cd to that directory, and execute both of these,

sudo cp lib64/* /usr/local/cuda/lib64/sudo cp include/cudnn.h /usr/local/cuda/include/

And we are done (hopefully). Again fire up the Python shell, and you know what to do.

If it doesn’t throw any error this time over, we are good.

Just to make sure out GPU is being detected by TensorFlow, run this on the same Python shell

tf.test.gpu_device_name()

It should print out all GPUs available to you. This may not work on older versions of TF, if so try this one on your terminal instead:

nvidia-smi

Troubleshooting

If you did everything correctly, and still TensorFlow throws the same error even though nvidia-smi works, it is probably an issue with the environment path variables. Execute these to fix:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}export PATH=/usr/local/cuda/lib64:${PATH}

If even nvidia-smi command does not work, CUDA was not installed properly, start over again if you missed something.

Let me know about any issues, suggestions or criticism.

Cheers.

Saurabh is an undergraduate Computer Science major at National Institute of Technology, Warangal, India and currently a research intern at Indian Institute of Science, Bangalore.


Published by HackerNoon on 2018/07/07