A big question for Machine Learning and Deep Learning apps developers is whether or not to use a computer with a GPU, after all, GPUs are still very expensive. To get an idea, see the price of a typical GPU for processing AI in Brazil costs between US $ 1,000.00 and US $ 7,000.00 (or more).
The purpose of this tutorial is to demonstrate the need to use a GPU for Deep Learning processing, and I'll show you that you can use Java without C++ for this!
If you don't want to invest in a CUDA GPU, Amazon has appropriate instances for processing with GPU. Look at the price comparison of two configurations:
Instance vCPUs GPUs RAM Hourly price (US$)
c5.2xlarge 8 0 16 GiB 0,34
p3.2xlarge 8 1 61 GiB 3,06
p3.8xlarge 32 4 244 GiB 12,24
Anyone who has ever trained a Machine or Deep Learning model knows that using a GPU can decrease the training time from days / hours to minutes / seconds, right?
But, is it really necessary? Can't we use a cluster of cheap machines, as we do with Bigdata?
The simplest and most direct answer is: YES, GPUs are needed to train models and nothing will replace them. However, you have to program properly in order to get the best out of using GPU, and not all libraries and frameworks do this efficiently.
Let's start with an analogy, adapted from what I saw on a data science training, and which I really liked.
Imagine a huge motorcycle, like ... 1000 CC ... I don't know, a Kawazaki. It's a very fast bike, right? Now, imagine that you have 8 of these bikes and you want to deliver pizza. Each motorcycle can take 1 order to the customer, so if there are more than 8 orders, someone will have to wait for one of the bikes to be available for delivery.
This is how the CPU works: Very fast and focused on sequential processing. Each core is a very fast bike. Of course, you can adapt it so that each motorcycle delivers more than one pizza at a time, but, in any case, it will be sequential processing: Deliver one pizza, deliver the next, etc.
Now, let's think you have 2000 bikes and 2000 delivery people. Although the bikes are much faster, you have a lot more bikes and can deliver multiple orders at once, avoiding queues. The slowness of the bikes is compensated by the parallelism.
GPU is parallel processing oriented!
If we compare task processing time, the CPU wins, but if we consider the parallelism, in the overall throughput, the GPU is unbeatable. That is why it is used for intensive processing tasks and calculations, such as: Virtual currency mining and Deep Learning.
Programming for GPU is not simple. To start, you have to consider that there is more than one GPU vendor and that there are two more well-known programming frameworks:
The CUDA programming interface is made in C, but there are bindings for Python, like PyCuda and for Java, like JCuda. But they are a little more difficult to learn and program.
And you need to understand the CUDA platform well, as well as its individual components, such as cuDNN or cuBLAS.
However, there are easier and more interesting alternatives that use the GPU, such as Deeplearning4J and its associated project, ND4J. ND4J is like the numpy of Java, only with steroids! It is capable of allowing you to use the available GPU (s) in a simple and practical way, and that is what we will use in this tutorial.
You must have an NVidia GPU on your device, with the appropriate drivers installed. Find out which GPU you have. Then, make sure you have installed the correct Nvidia driver. Then, install the CUDA Toolkit. If everything is correct, you can run the command below:
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce MX110 On | 00000000:01:00.0 Off | N/A |
| N/A 50C P0 N/A / N/A | 666MiB / 2004MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1078 G /usr/lib/xorg/Xorg 302MiB |
| 0 1979 G /usr/bin/gnome-shell 125MiB |
| 0 2422 G ...quest-channel-token=7684432457067004974 108MiB |
| 0 19488 G ...-token=7864D1BD51E7DFBD5D19F40F0E37669D 47MiB |
| 0 20879 G ...-token=8B052333281BD2F7FF0CBFF6F185BA98 1MiB |
| 0 24967 G ...-token=62FCB4B2D2AE1DC66B4AF1A0693122BE 40MiB |
| 0 25379 G ...equest-channel-token=587023958284958671 35MiB |
+-----------------------------------------------------------------------------+
What is an AI job? Deep Learning? It is based on two complex mathematical operations:
In summary: Vector calculations and differentials of simultaneous multiple values.
That is why GPUs are necessary for development, training and also for inferences, depending on the complexity of the model.
The project for this tutorial is a Java application that performs matrices multiplication, a common operation in deep learning jobs. It multiplies the matrices only once, first on the CPU, then on the GPU (using ND4J and CUDA Toolkit). Note that it is not even a model of machine learning, but just a single basic operation.
The pom.xml file configures the ND4J to use the GPU with the CUDA platform:
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-10.1</artifactId>
<version>1.0.0-beta4</version>
</dependency>
The main class: MatMul is a simple application, which defines two matrices and calculates their product, first on the CPU, then on the GPU, using the ND4J.
I'm working with 2 arrays of 500 x 300 and 300 x 400, nothing much for a typical neural network.
My laptop is an I7, eighth generation, and has an Nvidia MX110 chipset, which is very "entry level", with 256 colors and Cuda Capability 5, that is, nothing much ... A K80 card has more than 3,500 colors and cuda capability 8 or higher.
Let's see the application execution:
CPU Interativo (nanossegundos): 111.203.589
...
1759 [main] INFO org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner - Device Name: [GeForce MX110]; CC: [5.0]; Total/free memory: [2101870592]
GPU Paralelo (nanossegundos): 9.905.426
Percentual de tempo no cálculo da GPU com relação à CPU: 8.907469704057842
Ok, the application text is still in Portuguese, but I'll provide a quick translation:
Even using an entry level GPU like mine, the matrix scalar product ran on the GPU took only 8.9% of the time that it ran on the CPU. An abysmal difference. Check it out:
Considering that the matrices product is only ONE operation, and that Feedforward involves this operation thousands of times, it is reasonable to believe that this difference must be much greater, if we were really training a neural network.
And there is no point in clustering or RDMA, because nothing, NOTHING is able to match the performance of a single GPU.
Well, I hope I have demonstrated two things here: GPU is essential and how we can use it directly from a Java application. If you want, you can even convert that MLP model we made to run on the GPU impress your boss (or boyfriend / girlfriend).