Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Throughout my career, database workloads have almost exclusively been run on CPUs on traditional servers in the cloud or on premises. Although the research community has been publishing papers on running database workloads on GPUs for several decades, this work has been mostly at the fringes of the database community and used for niche applications in the industry.

Recently, however, GPU vendors such as NVIDIA have been marketing their chips specifically for use for database workloads, and several leading data processing vendors — including Starburst — have announced support for GPU environments along with significant performance improvements

In truth, GPUs have in fact emerged as a legitimate tool for accelerating certain types of SQL processing. However, the expected performance improvement is very much dependent on the specifics of the SQL queries and the datasets being processed. For some workloads, the gain is enormous. For other workloads, the gain is nonexistent or even negative (to the point where performance decreases when using GPUs). 

This post is designed to give the reader a brief tutorial on which database workloads are good fits for GPUs. The intention is that this information can be used as a reference for deciding whether upgrading the hardware to include GPUs would be expected to yield improved SQL performance. We will start by explaining what GPUs are and what they are designed for. Once we understand how they work, it will become easier to predict what workloads they will perform well on. We will also give some advice on what a user can do to accelerate an existing GPU database deployment. 

What are GPUs good at?

Early days

Historically, GPUs were designed for the computational requirements of graphics rendering use cases. The process of rendering 3D scenes—such as calculating the color and position of millions of individual pixels and vertices—requires performing the same mathematical operations (typically floating-point operations) on large datasets simultaneously. For example, transforming a 3D model into a 2D image requires applying projection matrices to every single vertex in the scene. A CPU, which is designed for complex, sequential tasks and branching logic, would have to process these vertices one by one (or in small batches), which is extremely inefficient and unable to meet the high-throughput, low-latency requirements of real-time graphics.

To solve this, hardware engineers developed GPUs. Instead of a few powerful cores with complex control logic (like a CPU), GPUs were built with thousands of smaller, simpler cores designed to execute the same instruction on different pieces of data simultaneously (SIMD – Single Instruction, Multiple Data). Early GPUs were often “fixed-function,” meaning their pipelines were hardwired specifically to perform graphics math (like rasterization and texture mapping). This specialization allowed them to achieve orders of magnitude higher throughput for these specific tasks compared to general-purpose CPUs.

In other words, in the early days of GPUs, they were almost entirely unusable for SQL workloads. They were wired specifically to perform mathematical functions relevant for an entirely different use case. Although there were some efforts to perform some database operations (especially spatial operators) via the rasterization capabilities of the hardware via languages such as OpenGL, the code was complex and limited to narrow sets of SQL operators. 

Turn of the century

Everything started to change with Ian Buck’s PhD thesis work on Brook in 2004 and the subsequent release of CUDA from NVIDIA in 2007. All of a sudden, it became possible to program GPUs for general-purpose tasks. This advancement opened the door to a new era of high-performance computing, where developers could offload compute-intensive and highly parallelizable tasks—such as complex scientific simulations, financial modeling, and eventually, the training of large-scale machine learning models—to the GPU. By leveraging the thousands of small, specialized cores, programmers could solve massive problems in parallel that were previously bottlenecked by the sequential execution of traditional CPUs.

The exact details on what had to be changed in the hardware architecture and in the programming language to facilitate this translation to general-purpose processors is beyond the scope of this post. The final result is what is relevant to us — ultimately GPUs became a tool for performing any operation in a massively parallel fashion. Any time you wanted to perform the same operation on many different data items in parallel, the GPU became a great solution.   

SQL, by its nature, does exactly this. Take, for example, the following query:

SELECT name, price * 1.05 
FROM products 
WHERE type = 'electronics'; 

The above query performs some processing on a table called products. This table may have millions of rows. The query requests to perform five operations per row:

  1. Extract the name of the product
  2. Extract the price of the product
  3. Extract the type of the product
  4. Check to see if the extracted type is “electronics”. (if not, the whole row can be ignored)
  5. Multiply the extracted price by 1.05 (e.g.  to show the price after sales tax)

These five operations can be done on each of the million products in the products table entirely independently. In theory, each GPU core can work separately performing these 5 operations on the row(s) assigned to it, in an embarrassingly parallel fashion.

In other words, SQL is seemingly a perfect use case for general-purpose GPU processing.

The database community indeed immediately jumped on this development, with SIGMOD and VLDB (the top two database systems publication venues) publishing papers on using CUDA for advanced database operators such as joins almost as soon as CUDA became available.  

The catch

There was, however, a major issue that prevented GPUs from wide-scale deployment for SQL use cases. CPU processing speed is often not the bottleneck that limits SQL query performance. For example, the SQL query above indeed does perform 5 operations per row, but these five operations are quite simple. Even in a theoretical world where these operations could be processed a billion times faster, end-to-end performance will not improve if there is a different bottleneck in the system.

The bottleneck in database systems is usually not the speed at which the processor processes data but rather the speed at which data can arrive at the processor to be processed. Even two decades ago, CPUs were already capable of processing multiple gigabytes per second. However, when the raw table being processed was stored in a single magnetic disk (HDD), the fastest that data could get to the CPU was approximately 100 megabytes per second (and have only increased to a maximum of approximately 300 megabytes per second today). It did not matter how fast the CPU was, or whether GPUs were being used; it took 10 seconds to process a gigabyte of HDD data. 

In most database textbooks, when they describe query operators and discuss their costs (for the purposes of query optimization), they focus almost exclusively on disk IOs. For example, they compare how join algorithm 1 requires X reads while join algorithm 2 requires Y reads. They state that since X > Y, algorithm 2 should be used. These textbooks ignore processing costs almost entirely. Even if algorithm 2 requires a ton more processing, it is deemed preferable if it has fewer disk reads.   

For this reason, the excitement at the possibility of using GPUs for SQL operations has historically been muted (until recently). Yes, they have the potential to improve processing performance, but only for situations where disk IO is not a bottleneck. 

The real advantage of GPUs: memory bandwidth

There is another key feature of GPU hardware — and this is the one that is enabling the current breakout of GPUs for SQL workloads memory bandwidth.

Graphics processing requires moving vast amounts of texture and geometry data into the processing cores. GPUs were designed with high-bandwidth, dedicated on-device memory to keep these thousands of cores fed with data, avoiding the memory bottlenecks inherent in general-purpose CPU architectures.

For example, the NVIDIA Blackwell GPU can provide an astonishing 8 terabytes of data per second from its memory to the processing cores. This is approximately 40 times faster than high-end server CPUs can get data from memory to its processing cores (around 200GB/sec).

This means that if the data to be processed is already sitting in the GPU memory, Blackwell totally eliminates the SQL processing bottleneck we described above (ie., getting the data to the processing cores to be processed). That same gigabyte HDD dataset that took 10 seconds to be processed in the example above, would take less than a millisecond on Blackwell if it was already sitting inside the GPU memory. 

This is the kind of performance improvement that can totally revolutionize an industry.

But wait, there’s another catch

The above mentioned 8 terabytes per second of data processing speed is only achievable if the data to be processed is already sitting inside the GPU memory. If it is not already there, it needs to get there first before it can be processed. To get there, data is typically transferred over a PCIe connection that typically is limited to hundreds of gigabytes per second at the high end (128 GB/second on Blackwell). 

In short, if data is arriving from HDD (or even SSD) disk, performance will often be limited by the time it takes to read data from disk (hundreds of MBs or single digit GBs per second). If it is already in CPU memory (but not GPU memory), disk IO is not necessary, but performance will still be limited by the time it takes to get that data to the GPU over the PCIe connection (3 digit GBs per second). If it is already in memory on the GPU, performance is in the TBs per second.  

The bottom line is that there are four tiers of performance one can expect from using GPUs to process SQL queries.

Tier 1: The data to be processed is sitting on a single HDD. The bottleneck will usually be the time to read data from the disk. For this scenario, it is futile to use GPUs. They will not make a difference in processing this data set. It is expected that it will take around 10 seconds to process a 1GB dataset.

Tier 2: The data to be processed is sitting on faster storage — such a large array of HDDs or SDDs (locally or across a fast network connection). Such “fast” storage can often read multiple gigabytes of data per second. However, if this number is less than the rate that the CPU can process data (which is dependent on the CPU, the database system, and the SQL query, but can reach 20GB per second in some cases but is often closer to 1-5GB/sec), there is still no benefit from using GPUs. Simply allowing the CPU to process data as it is read from storage will yield approximately equivalent performance.  It is expected that it will take around 50-700 milliseconds to process a 1GB dataset.

Tier 3: The data to be processed is still on high-end fast storage or RAM memory that is capable of delivering data at a bandwidth higher than the aforementioned rate that the database system CPUs can process data. In this case, using GPUs is going to yield a performance improvement. However, this improvement will be limited by the 128GB/sec that data can be sent over the PCIe connection to the GPU. In other words, one should not expect more than a factor of ~6-20 performance improvement over processing data on the CPU. Thus, it is expected that it will take around 10-100 milliseconds to process a 1GB dataset.

Tier 4: The data to be processed is already sitting in GPU memory. The enormous performance improvements described above are now possible. It is expected that it will take less than a millisecond to process a 1GB dataset.

What can the end-user do to get the most impact from GPUs?

From the above discussion, it is clear that the most important factor in maximizing impact  from the availability of GPUs to perform SQL processing is to make sure that the storage layer is able to deliver data to the SQL engine on the order of 10s of gigabytes per second. This means, increasing the RAM memory on the servers running the database system so that more data sits in memory, and upgrading the network and storage hardware to make sure they can deliver data to the system at the required scale.

There are some other factors that also need to be taken under consideration.

Query operators

Most SQL processing engines that work with GPUs have multiple sets of database operator code. They have the operator implementations that run on the CPU, and then they have the operator implementations that run the GPU, written in the GPU-specific programming language. Most engines only implement a subset of operators for GPU processing (specifically the ones that are expected to benefit from the GPU) and leave the rest to be performed on the CPU.

For example, Starburst currently supports scans, filters, projections, joins, data exchanges, and certain types of aggregations on the GPU. Sometimes rewriting a query in such a way that the query optimizer will produce plans that use operators implemented on the GPU will enable the performance benefits of GPUs to be realized. 

Column types

What I said above about query operators also applies to data types of columns being accessed in a query. Only those types supported by the GPU operators will be able to run there.

For example, Starburst’s GPU acceleration currently supports the most commonly used types: integers of all sizes (TINYINT, SMALLINT, INTEGER, BIGINT), floating-point numbers (REAL, DOUBLE), BOOLEAN, DECIMAL at all precisions, VARCHAR and CHAR strings, VARBINARY, DATE, and TIMESTAMP up to nanosecond precision. 

Data partitioning and sorting

For tier 3 scenarios described above, where the PCIe connection to the GPU is the bottleneck, it is important to send as little data as possible over this connection. If data is partitioned or sorted by attributes appearing in the WHERE clause of SQL queries, in many cases it is possible to perform an lightweight initial filter of data prior to being sent over this PCIe connection. This reduces the impact of the PCIe bottleneck, and can significantly improve performance.

Conclusion

Maximizing the performance improvements of using GPU hardware for SQL workloads is a shared task between the SQL vendor and the user. Today, many vendors still do not offer GPU support. Even the ones that do, the quality of the implementation may vary.

The SQL vendor is tasked with more than simply supporting the GPU. They must provide seamless usage of the GPU so that the user does not have to be aware of where data is processed — the only difference should be the performance numbers. Operators that do not run on the GPU need to fall-back to the CPU without user intervention. And the vendor is in charge of managing memory on the GPU to push more workloads from tier 3 to tier 4. This memory-management task is something that I’m personally focused on both in my academic research and in my discussions with the technical team at Starburst as part of my duties as a technical advisor.  

Meanwhile, the user is tasked with setting up the storage environment to ensure adequate data throughput to the SQL engine. Furthermore, they usually have full control over how data is partitioned and sorted that can alleviate the PCIe bottleneck for tier 3 scenarios. 

When the user and vendor both do their jobs well, the performance improvements from using GPUs can be enormous. 

 

 

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free