
Executive Summary
Starburst is actively integrating NVIDIA cuDF, an open-source data processing toolkit, directly into its SQL query engine, enabling GPU-accelerated query processing without changes to SQL or application code. This paper describes the architecture, the current state of development, and benchmark results across two industry-standard workloads.
Early benchmark results:
- TPC-H: up to 6x speedup across accelerated queries at production scale. Individual queries up to 11.4x faster.
- ClickBench: 3.4x overall GPU speedup across 43 queries, with individual queries up to 9x faster.
- Price/performance: GPU instances deliver 3–4x more analytical throughput per dollar compared to CPU-only instances at comparable cloud pricing.
This is active development. The results reported here reflect the current state of the integration. Additional operator coverage and architectural improvements are underway and will improve these numbers.
Introduction
Modern analytical workloads (terabyte-scale joins, complex aggregations, concurrent queries from AI pipelines) are pushing CPU-based query engines toward their throughput limits. Vectorized execution and SIMD acceleration help, but the fundamental constraint is raw compute and memory bandwidth.
GPUs offer a different compute model: thousands of CUDA cores, high-bandwidth on-device memory, and architectures purpose-built for massively parallel data transformation. The challenge has been bridging GPU capability to SQL execution without rebuilding the query engine.
Starburst’s approach integrates NVIDIA cuDF at the physical operator level inside the Trino-based query engine. Trino is an independent open-source query engine governed by its own foundation. Starburst ships a proprietary distribution of Trino; the GPU integration described here is developed and maintained within Starburst’s distribution and is not part of open-source Trino.
Why Query Acceleration Matters for the Agentic Era
Agentic AI systems execute sequences of tool calls to retrieve, transform, and reason over data. Unlike a single analyst query, an agent may issue dozens of queries within one task, each blocking the next step until the model receives a response. Query latency compounds directly into total agent response time.
At ten sequential queries averaging 2 seconds each, the agent loop carries 20 seconds of query wait time per task. At 4.6x average GPU acceleration, that drops to approximately 4 seconds. For production systems handling many concurrent agent sessions, this difference determines whether the system is responsive or impractical.
Beyond latency, several workload categories that appear inside agentic pipelines are naturally GPU-bound:
- Data preparation for AI: aggregations, joins, and transformations over large datasets that feed model inputs are the same operations GPU already accelerates in analytical SQL.
- Text and pattern operations at scale: regular expression matching, string extraction, and entity filtering over large corpora. ClickBench results on regexp queries (4x speedup) are directly applicable to these workloads.
- Concurrent query density: agents operating in parallel fire independent query streams. GPU handles high-concurrency analytical compute more efficiently than CPU at scale.
- Shared infrastructure: the same GPU instance that accelerates SQL can serve AI inference workloads, reducing the number of distinct infrastructure tiers an organization must operate.
GPU-accelerated SQL is not only faster analytics. For organizations building agentic systems on a data layer, it reduces a compounding latency bottleneck that CPU-based query engines find it challenging to address through further optimization alone.
Starburst and NVIDIA: Joint Development
This work is the result of a technical collaboration between Starburst and NVIDIA engineering teams. NVIDIA contributed deep expertise in cuDF, GPU memory management, and optimization guidance specific to Blackwell architecture. Starburst drove the integration architecture, operator-level implementation within the Starburst query engine, and benchmark methodology.
The collaboration reflects a shared goal: making GPU-accelerated SQL a production capability for enterprise data platforms, not a research prototype. Both teams continue to work jointly on the roadmap, including GPU RemoteExchange, expanded operator coverage, and Blackwell-specific optimizations.
Architecture: GPU Acceleration at the Physical Operator Level
Rather than offloading entire queries or rewriting the query engine, Starburst targets individual physical operators, the lowest-level execution primitives in the query plan. NVIDIA cuDF is integrated directly at this layer.
How It Works
The Starburst query engine compiles SQL into a physical plan composed of execution operators: Table Scan, Filter, Aggregation, Hash Join, TopN, and others. At execution time, each operator checks whether its input data and operation are GPU-capable. If so, it hands off to cuDF; if not, it falls back to CPU transparently. No query fails due to incomplete GPU coverage.
When consecutive operators are both GPU-capable, data is passed between them as a reference to GPU memory. It never leaves the GPU. This eliminates two categories of overhead: the PCIe transfer between host and device memory, and the marshalling cost of converting between Starburst’s in-memory format and the columnar layout required by cuDF. Both overheads are significant; eliminating them together is what makes operator chaining on GPU compound in value.
The operator-level approach carries two distinct architectural advantages:
- Flexibility: operators compose freely in any combination. A Scan → Filter → Hash Join → Aggregation pipeline uses the same GPU building blocks as Scan → Aggregation. Any combination the planner produces is handled with no bespoke code paths.
- Performance: each additional GPU-capable operator in a pipeline eliminates another round of data movement and format conversion. As operator coverage grows, acceleration compounds: each new operator improves performance across all queries that contain it.
GPU Operators — Complete
The following operators are complete and integrated into Starburst’s query engine:
| Operator | SQL Coverage |
| Table Scan (Parquet) | FROM clause, Parquet reads |
| Filter | WHERE, HAVING, predicates (=, !=, >, <, BETWEEN, LIKE, IN) |
| Aggregation | GROUP BY, COUNT, SUM, AVG, MIN, MAX, GROUPING SETS, ROLLUP |
| Scan + Filter (fused) | Eliminates intermediate materialization between scan and filter |
| TopN | ORDER BY + LIMIT |
| Join | INNER, LEFT/RIGHT/FULL OUTER JOIN, CROSS JOIN |
| REGEXP_REPLACE | String transformation functions |
| Key SQL functions | length, IF, date/time extractions |
A single operator implementation covers an entire category of SQL syntax. GPU Aggregation accelerates GROUP BY, COUNT, SUM, AVG, MIN, MAX, GROUPING SETS, and ROLLUP simultaneously, because all compile to the same AggregationOperator in the physical plan. Hash Join covers all join types in one implementation.
Benchmark Setup
Hardware
TPC-H benchmarks were conducted on AWS g7e.4xlarge instances (NVIDIA Blackwell):
| Component | Specification |
| GPU | NVIDIA GB202 (Blackwell), 96 GB GPU memory |
| CPU | Intel Xeon (Emerald Rapids), 16 vCPU |
| RAM | 128 GB |
| Network | 50 Gbps |
ClickBench benchmarks were conducted on AWS g6.4xlarge instances (NVIDIA L4):
| Component | Specification |
| GPU | NVIDIA L4 (Ada Lovelace), 24 GB GDDR6 |
| CPU | AMD EPYC, 16 vCPU |
| RAM | 64 GB |
| Network | 25 Gbps |
| On-demand cost | $1.323/hr (US East) |
Workloads
TPC-H
A 22-query supply-chain analytics benchmark. TPC-H exercises multi-table joins, complex aggregations, sorting, and subqueries across an 8-table schema. It is one of the most demanding benchmarks for GPU-accelerated databases due to its join-heavy profile.
ClickBench
A 43-query web analytics benchmark derived from a real-world production schema. Queries cover filters, string matching (LIKE, regexp), aggregations, and projections over a wide-column, single-table dataset. The profile is well-suited to GPU memory bandwidth and vectorized compute.
Methodology
- Multiple warmup runs followed by measured runs; mean reported per query
- Single-node execution
- Speedup reported as geometric mean across per-query ratios — the statistically appropriate metric for multiplicative comparisons, used by SPEC and TPC
- CPU baseline uses the CPU side of the same instance for consistent hardware conditions
Note: Results are from a development environment on single-node configurations. Production performance will vary by workload and deployment.
TPC-H Results
TPC-H is historically one of the most challenging benchmarks for GPU-accelerated databases. Its multi-table join complexity and mixed operator profile pushes every layer of the execution engine.
Summary
| Metric | Result |
| Geometric mean speedup (accelerated queries) | 4.6x |
| Max single-query speedup | 11.4x (Q13) |
| Queries exceeding 6x speedup | 5 of 18 tested |
| Queries exceeding 4x speedup | 13 of 18 tested |
| Hardware | NVIDIA GB202 Blackwell, 96 GB GPU memory |
Top Query Results
| Query | CPU (ms) | GPU (ms) | Speedup |
| Q13 | 32,827 | 2,871 | 11.4x |
| Q04 | 11,407 | 1,517 | 7.5x |
| Q01 | 11,890 | 1,610 | 7.4x |
| Q03 | 12,000 | 1,704 | 7.0x |
| Q05 | 13,146 | 2,091 | 6.3x |
| Q08 | 13,257 | 2,982 | 4.4x |
| Q09 | 61,208 | 23,999 | 2.6x |
Note: Q17 shows a GPU regression (0.3x) due to a gap in the dynamic row filtering optimization on GPU. This gap has been addressed. Q18, Q20, Q21, Q22 were excluded from this run due to memory configuration and will be included in subsequent runs.
What Drove Performance
Hash Join on GPU
Hash Join was the single largest step-change in TPC-H performance. With Hash Join on GPU, data from Table Scan and Filter flows directly into the join on-device, eliminating both PCIe transfers and format conversion at the operator boundary. Critically, aggregation operators following the join can now also stay in GPU memory throughout the pipeline. The compounding effect is a property of the architecture: once multiple consecutive operators are on-device, each additional one eliminates another round of data movement for all queries that contain it.
Decimal Arithmetic Support
Queries involving decimal column arithmetic previously fell back to CPU. With decimal arithmetic now GPU-supported, a broader set of query plans execute entirely on-device.
Remote Exchange Optimization
A complementary optimization eliminates unnecessary data movement across pipeline stages, delivering an additional 42% improvement on top of per-query GPU gains across the full TPC-H suite.
ClickBench Results
ClickBench exercises wide-column scans, string matching, regexp operations, and aggregations over a single large table, a profile well-suited to GPU memory bandwidth and vectorized compute.
Summary
| Metric | Result |
| Overall GPU speedup | 3.4x |
| Queries at 2x or better | 26 of 43 |
| Max single-query speedup | 9x |
| Query coverage | All 43 queries execute on GPU |
Notable Results
| Query type | GPU Time (ms) | vs CPU | Speedup |
| Aggregation-heavy (Q09) | 611 | -81% | 5.3x |
| Filter + aggregate (Q36) | 881 | -80% | 5.0x |
| String matching (Q21) | 1,013 | -75% | 4.1x |
| Regexp (Q28) | 1,040 | -75% | 4.0x |
| 26 queries total | — | -54% avg | 2x+ |
Regexp and string-matching queries show a particularly strong GPU advantage. Operations that are computationally expensive on CPU vectorize efficiently on CUDA cores with cuDF. All 43 ClickBench queries execute end-to-end on GPU.
Price/Performance
GPU instances are often assumed to carry a significant cost premium over CPU-only alternatives. The benchmark results challenge that assumption.
Cost per Unit of Analytical Work
On ClickBench, the GPU benchmark instance (g6.4xlarge, $1.323/hr) delivered 3.4x the analytical throughput of the CPU baseline instance (m8g.8xlarge, $1.436/hr) at 8% lower hourly cost. The effective cost per query drops by approximately 3.4x.
Framed differently: for the same dollar spent, the GPU configuration processes 3.4x more queries. For continuously running workloads (analytics pipelines, concurrent query loads, AI data preparation), this translates directly to infrastructure cost reduction or the ability to serve significantly higher query volumes without scaling out.
| Configuration | Cost/hr | Relative Throughput | Cost per Query |
| CPU baseline (m8g.8xlarge) | $1.436 | 1.0x (baseline) | 1.0x (baseline) |
| GPU (g6.4xlarge, NVIDIA L4) | $1.323 | 3.4x | ~0.28x |
For TPC-H on Blackwell (g7e.4xlarge), the 4.6x throughput improvement further improves the cost-per-query metric relative to CPU. A full price/performance comparison for that configuration is in progress as Blackwell instance pricing stabilizes.
Note: All figures are hardware costs only, excluding software licensing. Large customers typically receive significant discounts on GPU instances. CPU baseline is m8g.8xlarge (Graviton 4, US East on-demand).
GPU Execution: Operational Details
Query Planning Is Unchanged
SQL queries go through Starburst’s standard parser, planner, and optimizer unchanged. No GPU-specific syntax, query hints, or application changes are required. The GPU acceleration layer is transparent to the query author.
Graceful CPU Fallback
If an operator encounters an unsupported data type, expression, or connector, it falls back to CPU execution for that operation. No query fails due to incomplete GPU coverage. The fallback is per-operator, so partial GPU execution still delivers partial benefit for queries that mix GPU-capable and CPU-only operators.
For aggregation-heavy queries, this means rows passing unnecessarily through CPU memory between PARTIAL and FINAL aggregation stages. Implementing GPU LocalExchange to keep data on-device across these boundaries is the next expected step-change in TPC-H performance and is actively under development.
Implications for Data Teams
- No SQL changes required. Existing queries run faster without modification.
- Incremental deployment. GPU acceleration is operator-by-operator. Workloads benefit immediately for supported operators; coverage expands with each release.
- Cost-effective at current cloud pricing. GPU instances deliver 3–4x more analytical throughput for comparable or lower hourly cost. The 2–3x cost premium assumption does not hold for current GPU instance generations.
- Shared infrastructure for analytics and AI. GPU instances that accelerate SQL queries can simultaneously support AI inference workloads, reducing the number of distinct infrastructure tiers required.
Conclusion
Starburst’s GPU acceleration work, integrating NVIDIA cuDF at the physical operator level, demonstrates that GPU-native SQL execution is achievable within an existing query engine without rebuilding it. The operator-level architecture provides both flexibility and compounding performance benefits as coverage grows.
Current results: Up to 6x on TPC-H (Blackwell), 3.4x on ClickBench (L4), with individual queries exceeding 11x. GPU instances deliver this throughput at comparable or lower cost per query than CPU alternatives.
This is active development. GPU RemoteExchange, additional operator coverage, and Blackwell-specific optimizations are in progress. The trajectory is clear: each new operator compounds the gains already in production.
Note: Benchmarks conducted on AWS with NVIDIA GPU instances in a development environment on single-node configurations. Production performance characteristics will vary by workload and deployment.



