As technology advances and more organizations implement machine learning operations (MLOps), people are looking for ways to speed up processes. This is especially true for organizations with deep learning (DL) processes that can be incredibly long to run. You can speed up this process by using graphical processing units (GPUs) on-premises or in the cloud.
GPUs are microprocessors that are specially designed to perform specific tasks. These units enable parallel processing of tasks and can be optimized to increase performance in artificial intelligence and deep learning processes.
GPU is a powerful tool for speeding up a data pipeline with a deep neural network. The first reason to use GPU is that DNN inference runs 3-4 times faster on GPU compared to a central processing unit (CPU) with the same pricing. The second reason is that taking some of the load off the CPU allows you to do more work at the same instance and reduces network load overall.
Typical Deep learning pipeline with GPU consists of:
Data preprocessing (CPU)
DNN execution: training or inference (GPU)
Data post-processing (CPU)
Data transfer between CPU ram and GPU ram is the most common bottleneck. Therefore there are two main aims for building Data science pipeline architecture. The first is to reduce the number of transferring data transactions by aggregating several samples (images) into a batch. The second is to reduce the size of a specific sample by filtering data before transferring.
Training and implementing DL models require using deep neural networks (DNN) and datasets with hundreds of thousands of data points. These networks require significant resources, including memory, storage, and processing power. While central processing units (CPUs) can provide this power, graphical processing units (GPUs) can substantially speed up the process.
GPUs are super fast compared to CPUs, and for many cases of AI applications, GPUs are a must-have. But in some cases, GPUs are overkill, and you should at least temporarily use CPUs to save your budget.
Here we need to say a few words about the cost of GPU calculations. As we mentioned before, GPUs are significantly faster than CPUs, but the calculation cost may be even greater than the speed you gained by switching to GPU.
So, at the beginning of development, for example, while developing proof of concept (PoC) or a minimum viable product (MVP), you can use CPUs for development and stage servers. In cases where your users are okay with long response time, you can use CPU for production servers, but only for a short period of time.
When using GPUs for on-premises implementations, you have multiple vendor options. Two popular choices are NVIDIA and AMD.
NVIDIA is a popular option, at least in part because of the libraries it provides, known as the CUDA toolkit. These libraries enable the easy establishment of deep learning processes and form the base of a strong machine learning community with NVIDIA products. This can be seen in the widespread support that many DL libraries and frameworks provide for NVIDIA hardware.
In addition to GPUs, the company also offers libraries supporting popular DL frameworks, including PyTorch. The Apex library, in particular, is useful and includes several fused, fast optimizers, such as FusedAdam.
The downside of NVIDIA is that it has recently placed restrictions on when CUDA can be used. These restrictions require that the libraries only be used with Tesla GPUs and cannot be used with the less expensive RTX or GTX hardware.
This has serious budget implications for organizations training DL models. It is also problematic when considering that although Tesla GPUs do not offer significantly more performance than the other options, the units cost up to 10x as much.
AMD provides libraries, known as ROCm. These libraries are supported by TensorFlow and PyTorch, as well as all major network architectures. However, support for the development of new networks is limited, as is community support.
Another issue is that AMD does not invest as much into its deep learning software as NVIDIA. Because of this, AMD GPUs provide limited functionality compared to NVIDIA outside of their lower price points.
An option that is growing in popularity with organizations training DL models is the use of cloud resources. These resources can provide pay-for-use access to GPUs in combination with optimized machine learning services. All three major providers offer GPU resources along with a host of configuration options.
Microsoft Azure grants a variety of instance options for GPU access. These instances have been optimized for high computation tasks, including visualization, simulations, and deep learning.
Within Azure, there are three main series of instances you can choose from:
In AWS, you can choose from four different options, each with a variety of instance sizes. Options include EC2 P3, P2, G4, and G3 instances. These options enable you to choose between NVIDIA Tesla V100, K80, T4 Tensor, or M60 GPUs. In addition, you can scale up to 16 GPUs depending on the instance.
To enhance these instances, AWS also offers Amazon Elastic Graphics, a service that enables you to attach low-cost GPU options to your EC2 instances. This enables you to use GPUs with any compatible instance as needed. This service provides greater flexibility for your workloads. Elastic Graphics provides support for OpenGL 4.3 and can offer up to 8GB of graphics memory.
Rather than dedicated GPU instances, Google Cloud enables you to attach GPUs to your existing instances. So, for example, if you are using Google Kubernetes Engine, you can create node pools with access to a range of GPUs. These include NVIDIA Tesla K80, P100, P4, V100, and T4 GPUs.
Google Cloud also offers the Tensorflow processing unit (TPU). This unit includes multiple GPUs designed for performing fast matrix multiplication. It provides similar performance to Tesla V100 instances with Tensor Cores enabled. The benefit of TPU is that it can provide cost savings through parallelization.
Each TPU is the equivalent of four GPUs, enabling comparatively larger deployments. Additionally, TPUs are now at least partially supported by PyTorch.
When choosing your infrastructure, you need to decide between an on-premises and a cloud approach. Cloud resources can significantly lower the financial barrier to building a DL infrastructure.
These services can also provide scalability and provider support. However, these infrastructures are best for short-term projects since consistent resource use can cause costs to balloon.
In contrast, on-premises infrastructures are more expensive upfront but provide you with greater flexibility. You can use your hardware for as many experiments as you want over as long a period as you want with stable costs. You also retain full control over your configurations, security, and data.
For organizations that are just getting started, cloud infrastructures make more sense. These deployments enable you to start running with minimal upfront investment and can give you time to refine your processes and requirements. However, once your operation grows large enough, switching to on-premises could be the choice.
Our AI team has at its disposal huge computational resources like a set of V100 GPUs. All of those GPUs are accessible via our internal computation service.
The computation service is just a computer with a lot of disk space, RAM, and GPUs installed on it and running Linux. We use the service for training AI solutions and for research purposes.
Usually, traditional deep learning frameworks like Tensorflow, Pytorch, or ONNX can’t directly access GPU cores to solve deep learning problems. Between an AI application and the GPU, there are a lot of complex layers of special software like CUDA and drivers for the GPU.
In an oversimplified schema, it can be shown as follows:
The schema seems legit and robust even in case a team of AI engineers uses a computation service build like that.
But in the real-life of AI software development, new versions of AI applications, AI frameworks, CUDA, and GPU drivers emerge. And it often appears that new versions of the software are not compatible with the old ones. For example, we have to use a new version of an AI framework that is not compatible with the current version of CUDA on our computation service. What should we do in such a situation? Should we update CUDA?
Definitely, we can’t do that because some other AI engineers require an old version of CUDA for their projects. That’s the problem.
So, the problem is that we can’t have two different versions of CUDA installed in our computation service as on any other system. What if we can do some trick that can magically isolate our applications from each other so that they don’t touch each other and don’t know about the existence of each other? Thankfully, we do have such a trick nowadays! The name of the trick is dockerization techniques.
We use Docker and nvidia-docker to wrap AI applications along with all necessary dependencies like AI Framework and CUDA of proper versions. This approach makes us able to maintain different versions of Tensorflow, Pytorch, and CUDA on the same computation service machine.
Simple schema of AI solutions dockerization is shown below:
To advance quickly, machine learning workloads require high processing capabilities. As opposed to CPUs, GPUs can provide an increase in processing power, higher memory bandwidth, and a capacity for parallelism.
You can use GPUs on-premises or in the cloud. Popular on-premise GPUs include NVIDIA and AMD. Cloud-based GPU can be provided by many cloud vendors, including the top three—Azure, AWS, and Google Cloud. When choosing between on-premise and cloud GPU resources, you should consider budget and skills.
On-premise resources typically come with a high upfront overhead, but the cost can stabilize in the long term. However, if you do not have the necessary skills to operate on-premise resources, you should consider cloud offerings, which can be easier to scale and often come with managed options.
Written by Evgeniy Krasnokutsky, AI/ML Solution Architect at MobiDev.