A big question for Machine Learning and Deep Learning apps developers is whether or not to use a computer with a , after all, GPUs are still very expensive. To get an idea, see the price of a typical GPU for processing AI in Brazil costs between US $ 1,000.00 and US $ 7,000.00 (or more). GPU The purpose of this tutorial is to demonstrate the need to use a GPU for Deep Learning processing, and I'll show you that you can use without C++ for this! Java If you don't want to invest in a GPU, Amazon has appropriate instances for processing with GPU. Look at the price comparison of two configurations: CUDA Instance vCPUs GPUs RAM Hourly price (US$) c5.2xlarge 8 0 16 GiB 0,34 p3.2xlarge 8 1 61 GiB 3,06 p3.8xlarge 32 4 244 GiB 12,24 Anyone who has ever trained a Machine or Deep Learning model knows that using a GPU can decrease the training time from days / hours to minutes / seconds, right? But, is it really necessary? Can't we use a cluster of cheap machines, as we do with ? Bigdata The simplest and most direct answer is: YES, GPUs are needed to train models . However, you have to program properly in order to get the best out of using GPU, and not all libraries and frameworks do this efficiently. and nothing will replace them How the GPU works Let's start with an analogy, adapted from what I saw on a data science training, and which I really liked. Imagine a huge motorcycle, like ... 1000 CC ... I don't know, a . It's a very fast bike, right? Now, imagine that you have of these bikes and you want to deliver pizza. Each motorcycle can take 1 order to the customer, so if there are more than 8 orders, someone will have to wait for one of the bikes to be available for delivery. Kawazaki 8 This is how the CPU works: Very fast and focused on sequential processing. Each core is a very fast bike. Of course, you can adapt it so that each motorcycle delivers more than one pizza at a time, but, in any case, it will be sequential processing: Deliver one pizza, deliver the next, etc. Now, let's think you have 2000 bikes and 2000 delivery people. Although the bikes are much faster, you have a lot more bikes and can deliver multiple orders at once, avoiding queues. The slowness of the bikes is compensated by the parallelism. GPU is parallel processing oriented! If we compare task processing time, the CPU wins, but if we consider the parallelism, in the overall throughput, the GPU is unbeatable. That is why it is used for intensive processing tasks and calculations, such as: Virtual currency mining and Deep Learning. How can we program for the GPU Programming for GPU is not simple. To start, you have to consider that there is more than one GPU vendor and that there are two more well-known programming frameworks: : Compute Unified Device Architecture, from Nvidia chips; CUDA : Used in GPUs from other vendors, such as AMD. OpenCL The CUDA programming interface is made in C, but there are bindings for Python, like and for Java, like . But they are a little more difficult to learn and program. PyCuda JCuda And you need to understand the CUDA platform well, as well as its individual components, such as or . cuDNN cuBLAS However, there are easier and more interesting alternatives that use the GPU, such as and its associated project, . ND4J is like the of Java, only with steroids! It is capable of allowing you to use the available GPU (s) in a simple and practical way, and that is what we will use in this tutorial. Deeplearning4J ND4J numpy First of all You must have an NVidia GPU on your device, with the appropriate drivers installed. Find out which GPU you have. Then, make sure you have installed the correct Nvidia driver. Then, install the CUDA Toolkit. If everything is correct, you can run the command below: nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce MX110 On | 00000000:01:00.0 Off | N/A | | N/A 50C P0 N/A / N/A | 666MiB / 2004MiB | 4% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1078 G /usr/lib/xorg/Xorg 302MiB | | 0 1979 G /usr/bin/gnome-shell 125MiB | | 0 2422 G ...quest-channel-token=7684432457067004974 108MiB | | 0 19488 G ...-token=7864D1BD51E7DFBD5D19F40F0E37669D 47MiB | | 0 20879 G ...-token=8B052333281BD2F7FF0CBFF6F185BA98 1MiB | | 0 24967 G ...-token=62FCB4B2D2AE1DC66B4AF1A0693122BE 40MiB | | 0 25379 G ...equest-channel-token=587023958284958671 35MiB | +-----------------------------------------------------------------------------+ AI jobs What is an AI job? Deep Learning? It is based on two complex mathematical operations: : Basically the linear combination of the weight matrices with the values in each layer, from the entry to the end; Feedforward : Differential calculation of each gradient of each neuron (including BIAS), from the last layer to the beginning, in order to adjust the weights. Feedforward is repeated for each record in the input set and multiplied by the number of iterations or epochs we want to train, that is, many times. And Backpropagation can be done at the same frequency, or at regular intervals, depending on the learning algorithm used. Backpropagation In summary: Vector calculations and differentials of simultaneous multiple values. That is why GPUs are necessary for development, training and also for inferences, depending on the complexity of the model. Demonstration The is a Java application that performs , a common operation in deep learning jobs. It multiplies the matrices only once, first on the CPU, then on the GPU (using ND4J and CUDA Toolkit). Note that it is not even a model of machine learning, but just a single basic operation. project for this tutorial matrices multiplication The file configures the ND4J to use the GPU with the CUDA platform: pom.xml org.nd4j nd4j-cuda-10.1 1.0.0-beta4 < > dependency < > groupId </ > groupId < > artifactId </ > artifactId < > version </ > version </ > dependency The main class: is a simple application, which defines two matrices and calculates their product, first on the CPU, then on the GPU, using the ND4J. MatMul I'm working with 2 arrays of 500 x 300 and 300 x 400, nothing much for a typical neural network. My laptop is an , eighth generation, and has an , which is very "entry level", with 256 colors and , that is, nothing much ... A K80 card has more than 3,500 colors and cuda capability 8 or higher. I7 Nvidia MX110 chipset Cuda Capability 5 Let's see the application execution: CPU Interativo (nanossegundos): 111.203.589 ... 1759 [main] INFO org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner - Device Name: [GeForce MX110]; CC: [5.0]; Total/free memory: [2101870592] GPU Paralelo (nanossegundos): 9.905.426 Percentual de tempo no cálculo da GPU com relação à CPU: 8.907469704057842 Ok, the application text is still in Portuguese, but I'll provide a quick translation: "CPU Interativo (nanossegundos)": Iteractive CPU (nanoseconds); "GPU Paralelo (nanossegundos)": Parallel GPU (nanoseconds); "Percentual de tempo no cálculo da GPU com relação à CPU" Time percent of GPU over CPU: 8,9%; Conclusion Even using an entry level GPU like mine, the matrix scalar product ran on the GPU took only 8.9% of the time that it ran on the CPU. An abysmal difference. Check it out: CPU time: 111,203,589 nanoseconds; GPU time: 9,905,426 nanoseconds. Considering that the matrices product is only ONE operation, and that involves this operation thousands of times, it is reasonable to believe that this difference must be much greater, if we were really training a neural network. Feedforward And there is no point in clustering or , because nothing, is able to match the performance of a single GPU. RDMA NOTHING Well, I hope I have demonstrated two things here: and how . If you want, you can even convert that model we made to run on the GPU impress your boss (or boyfriend / girlfriend). GPU is essential we can use it directly from a Java application MLP Cleuton Sampaio, M.Sc. Also . read in Spanish