GPU databases are the hottest new thing in the database world, and it’s the most innovative trend since Hadoop showed up over 5 years ago.
Below is a guide on how to find the right GPU Database for you.
To put it simply, a GPU database is a database, relational or non-relational, that uses a GPU (graphical processing unit) to perform some database operations. For example, GPU databases are typically fast. Subsequently, GPU databases are more flexible in processing many different types of data, or much larger amounts of data.
Most of the GPU databases tend to focus on analytics, and they’re offering it to a market that was oversold on Hadoop for Big Data analytics.
The truth is Hadoop was not designed for relational data analytics, and yet the widespread use of Hadoop led to the creation of many projects that aimed to do exactly that, with mixed results. Hadoop is now past it’s hype cycle and is now entering the “Plateau of productivity” stage.
GPUs and CPUs can cooperate quite nicely, when both are taken advantage for their unique benefits, but this is not just for performance’s sake.
In reality, the different architecture of the GPU makes it unsuitable for performing specific operations, like text manipulations that may differ based on the content. Because the (thousands of) GPU cores like to work the same way, having contents that will cause the code to behave differently will cause a significant drop in performance — essentially becoming partially sequential and not strictly parallel (in what Nvidia’s CUDA calls branch divergence).
A GPU excels at performing (relatively simple), repetitive operations on large amounts of data in many streams. It just wouldn’t function very well if the branches diverge, the operations launch too few streams (kernels), or the operations can’t be parallelized.
One of the reasons Nvidia GPUs are so popular among developers is that packages like Thrust, CUB and Modern GPU have made writing performant, parallel GPU code quite easy, in contrast with optimized, vectorized, efficiently threaded CPU code.
In general, GPUs can be used for a variety of different stages in the analytics pipeline. It can be used as the main database, as part of the processing pipeline, or just for the resulting analytic dataset — for example, with popular frameworks like TensorFlow.
Let’s take a look at two of the main areas where GPUs can help in the analytics pipeline.
New stream processing solutions, like FASTDATA.io’s Plasma Engine can take advantage of GPUs for stream processing data coming in and out of databases (GPU or not). This tool can be used to perform analysis and/or transformation of streaming data on the GPU.
The main competitor to FASTDATA’s engine is GPU-enabled Spark, which is available as an open-source add-on from IBM.
With the exception of PG-Strom and Brytlyt, which retrofits the open-source Postgres RDBMS by augmenting it with GPU-aware parts, all other GPU databases are purpose built for analytics.
Blazegraph is another exception, because it is designed for graph operations.
This leaves us with four players, dealing mostly with relational, structured analytics, with an SQL interface.
This chart below should help you understand which of these GPU database is right for you: