GPU-Accelerated SQL Analytics: How Starburst and NVIDIA Deliver Industry-Benchmark Speedups on GPU Infrastructure

GPU-Native query execution in the Starburst SQL engine

June 1, 2026

Jitender Aswani

Senior Vice President, Engineering

Starburst Data

Martin Traverso

Co-creator of Trino (formerly PrestoSQL), CTO of Starburst

Starburst

Piotr Findeisen

Co-founder and Distinguished Software Engineer

Starburst

Raunaq Morarka

Senior Staff Software Engineer

Starburst

Piotr Rzysko

Software Engineer

Starburst

Jitender Aswani

Senior Vice President, Engineering

Starburst Data

Martin Traverso

Co-creator of Trino (formerly PrestoSQL), CTO of Starburst

Starburst

Piotr Findeisen

Co-founder and Distinguished Software Engineer

Starburst

Raunaq Morarka

Senior Staff Software Engineer

Starburst

Piotr Rzysko

Software Engineer

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

A Future Look at Data Systems for Agents

Executive Summary

Starburst is actively integrating NVIDIA cuDF, an open-source data processing toolkit, directly into its SQL query engine, enabling GPU-accelerated query processing without changes to SQL or application code. This paper describes the architecture, the current state of development, and benchmark results across two industry-standard workloads.

Early benchmark results:

TPC-H: up to 6x speedup across accelerated queries at production scale. Individual queries up to 11.4x faster.
ClickBench: 3.4x overall GPU speedup across 43 queries, with individual queries up to 9x faster.
Price/performance: GPU instances deliver 3–4x more analytical throughput per dollar compared to CPU-only instances at comparable cloud pricing.

This is active development. The results reported here reflect the current state of the integration. Additional operator coverage and architectural improvements are underway and will improve these numbers.

Introduction

Modern analytical workloads (terabyte-scale joins, complex aggregations, concurrent queries from AI pipelines) are pushing CPU-based query engines toward their throughput limits. Vectorized execution and SIMD acceleration help, but the fundamental constraint is raw compute and memory bandwidth.

GPUs offer a different compute model: thousands of CUDA cores, high-bandwidth on-device memory, and architectures purpose-built for massively parallel data transformation. The challenge has been bridging GPU capability to SQL execution without rebuilding the query engine.

Starburst’s approach integrates NVIDIA cuDF at the physical operator level inside the Trino-based query engine. Trino is an independent open-source query engine governed by its own foundation. Starburst ships a proprietary distribution of Trino; the GPU integration described here is developed and maintained within Starburst’s distribution and is not part of open-source Trino.

Why Query Acceleration Matters for the Agentic Era

Agentic AI systems execute sequences of tool calls to retrieve, transform, and reason over data. Unlike a single analyst query, an agent may issue dozens of queries within one task, each blocking the next step until the model receives a response. Query latency compounds directly into total agent response time.

At ten sequential queries averaging 2 seconds each, the agent loop carries 20 seconds of query wait time per task. At 4.6x average GPU acceleration, that drops to approximately 4 seconds. For production systems handling many concurrent agent sessions, this difference determines whether the system is responsive or impractical.

Beyond latency, several workload categories that appear inside agentic pipelines are naturally GPU-bound:

Data preparation for AI: aggregations, joins, and transformations over large datasets that feed model inputs are the same operations GPU already accelerates in analytical SQL.
Text and pattern operations at scale: regular expression matching, string extraction, and entity filtering over large corpora. ClickBench results on regexp queries (4x speedup) are directly applicable to these workloads.
Concurrent query density: agents operating in parallel fire independent query streams. GPU handles high-concurrency analytical compute more efficiently than CPU at scale.
Shared infrastructure: the same GPU instance that accelerates SQL can serve AI inference workloads, reducing the number of distinct infrastructure tiers an organization must operate.

GPU-accelerated SQL is not only faster analytics. For organizations building agentic systems on a data layer, it reduces a compounding latency bottleneck that CPU-based query engines find it challenging to address through further optimization alone.

Starburst and NVIDIA: Joint Development

This work is the result of a technical collaboration between Starburst and NVIDIA engineering teams. NVIDIA contributed deep expertise in cuDF, GPU memory management, and optimization guidance specific to Blackwell architecture. Starburst drove the integration architecture, operator-level implementation within the Starburst query engine, and benchmark methodology.

The collaboration reflects a shared goal: making GPU-accelerated SQL a production capability for enterprise data platforms, not a research prototype. Both teams continue to work jointly on the roadmap, including GPU RemoteExchange, expanded operator coverage, and Blackwell-specific optimizations.

Architecture: GPU Acceleration at the Physical Operator Level

Rather than offloading entire queries or rewriting the query engine, Starburst targets individual physical operators, the lowest-level execution primitives in the query plan. NVIDIA cuDF is integrated directly at this layer.

How It Works

The Starburst query engine compiles SQL into a physical plan composed of execution operators: Table Scan, Filter, Aggregation, Hash Join, TopN, and others. At execution time, each operator checks whether its input data and operation are GPU-capable. If so, it hands off to cuDF; if not, it falls back to CPU transparently. No query fails due to incomplete GPU coverage.

When consecutive operators are both GPU-capable, data is passed between them as a reference to GPU memory. It never leaves the GPU. This eliminates two categories of overhead: the PCIe transfer between host and device memory, and the marshalling cost of converting between Starburst’s in-memory format and the columnar layout required by cuDF. Both overheads are significant; eliminating them together is what makes operator chaining on GPU compound in value.

The operator-level approach carries two distinct architectural advantages:

Flexibility: operators compose freely in any combination. A Scan → Filter → Hash Join → Aggregation pipeline uses the same GPU building blocks as Scan → Aggregation. Any combination the planner produces is handled with no bespoke code paths.
Performance: each additional GPU-capable operator in a pipeline eliminates another round of data movement and format conversion. As operator coverage grows, acceleration compounds: each new operator improves performance across all queries that contain it.

GPU Operators — Complete

The following operators are complete and integrated into Starburst’s query engine:

Operator	SQL Coverage
Table Scan (Parquet)	FROM clause, Parquet reads
Filter	WHERE, HAVING, predicates (=, !=, >, <, BETWEEN, LIKE, IN)
Aggregation	GROUP BY, COUNT, SUM, AVG, MIN, MAX, GROUPING SETS, ROLLUP
Scan + Filter (fused)	Eliminates intermediate materialization between scan and filter
TopN	ORDER BY + LIMIT
Join	INNER, LEFT/RIGHT/FULL OUTER JOIN, CROSS JOIN
REGEXP_REPLACE	String transformation functions
Key SQL functions	length, IF, date/time extractions

A single operator implementation covers an entire category of SQL syntax. GPU Aggregation accelerates GROUP BY, COUNT, SUM, AVG, MIN, MAX, GROUPING SETS, and ROLLUP simultaneously, because all compile to the same AggregationOperator in the physical plan. Hash Join covers all join types in one implementation.

Benchmark Setup

Hardware

TPC-H benchmarks were conducted on AWS g7e.4xlarge instances (NVIDIA Blackwell):

Component	Specification
GPU	NVIDIA GB202 (Blackwell), 96 GB GPU memory
CPU	Intel Xeon (Emerald Rapids), 16 vCPU
RAM	128 GB
Network	50 Gbps

ClickBench benchmarks were conducted on AWS g6.4xlarge instances (NVIDIA L4):

Component	Specification
GPU	NVIDIA L4 (Ada Lovelace), 24 GB GDDR6
CPU	AMD EPYC, 16 vCPU
RAM	64 GB
Network	25 Gbps
On-demand cost	$1.323/hr (US East)

Workloads

TPC-H

A 22-query supply-chain analytics benchmark. TPC-H exercises multi-table joins, complex aggregations, sorting, and subqueries across an 8-table schema. It is one of the most demanding benchmarks for GPU-accelerated databases due to its join-heavy profile.

ClickBench

A 43-query web analytics benchmark derived from a real-world production schema. Queries cover filters, string matching (LIKE, regexp), aggregations, and projections over a wide-column, single-table dataset. The profile is well-suited to GPU memory bandwidth and vectorized compute.

Methodology

Multiple warmup runs followed by measured runs; mean reported per query
Single-node execution
Speedup reported as geometric mean across per-query ratios — the statistically appropriate metric for multiplicative comparisons, used by SPEC and TPC
CPU baseline uses the CPU side of the same instance for consistent hardware conditions

Note: Results are from a development environment on single-node configurations. Production performance will vary by workload and deployment.

TPC-H Results

TPC-H is historically one of the most challenging benchmarks for GPU-accelerated databases. Its multi-table join complexity and mixed operator profile pushes every layer of the execution engine.

Summary

Metric	Result
Geometric mean speedup (accelerated queries)	4.6x
Max single-query speedup	11.4x (Q13)
Queries exceeding 6x speedup	5 of 18 tested
Queries exceeding 4x speedup	13 of 18 tested
Hardware	NVIDIA GB202 Blackwell, 96 GB GPU memory

Top Query Results

Query	CPU (ms)	GPU (ms)	Speedup
Q13	32,827	2,871	11.4x
Q04	11,407	1,517	7.5x
Q01	11,890	1,610	7.4x
Q03	12,000	1,704	7.0x
Q05	13,146	2,091	6.3x
Q08	13,257	2,982	4.4x
Q09	61,208	23,999	2.6x

Note: Q17 shows a GPU regression (0.3x) due to a gap in the dynamic row filtering optimization on GPU. This gap has been addressed. Q18, Q20, Q21, Q22 were excluded from this run due to memory configuration and will be included in subsequent runs.

What Drove Performance

Hash Join on GPU

Hash Join was the single largest step-change in TPC-H performance. With Hash Join on GPU, data from Table Scan and Filter flows directly into the join on-device, eliminating both PCIe transfers and format conversion at the operator boundary. Critically, aggregation operators following the join can now also stay in GPU memory throughout the pipeline. The compounding effect is a property of the architecture: once multiple consecutive operators are on-device, each additional one eliminates another round of data movement for all queries that contain it.

Decimal Arithmetic Support

Queries involving decimal column arithmetic previously fell back to CPU. With decimal arithmetic now GPU-supported, a broader set of query plans execute entirely on-device.

Remote Exchange Optimization

A complementary optimization eliminates unnecessary data movement across pipeline stages, delivering an additional 42% improvement on top of per-query GPU gains across the full TPC-H suite.

ClickBench Results

ClickBench exercises wide-column scans, string matching, regexp operations, and aggregations over a single large table, a profile well-suited to GPU memory bandwidth and vectorized compute.

Summary

Metric	Result
Overall GPU speedup	3.4x
Queries at 2x or better	26 of 43
Max single-query speedup	9x
Query coverage	All 43 queries execute on GPU

Notable Results

Query type	GPU Time (ms)	vs CPU	Speedup
Aggregation-heavy (Q09)	611	-81%	5.3x
Filter + aggregate (Q36)	881	-80%	5.0x
String matching (Q21)	1,013	-75%	4.1x
Regexp (Q28)	1,040	-75%	4.0x
26 queries total	—	-54% avg	2x+

Regexp and string-matching queries show a particularly strong GPU advantage. Operations that are computationally expensive on CPU vectorize efficiently on CUDA cores with cuDF. All 43 ClickBench queries execute end-to-end on GPU.

Price/Performance

GPU instances are often assumed to carry a significant cost premium over CPU-only alternatives. The benchmark results challenge that assumption.

Cost per Unit of Analytical Work

On ClickBench, the GPU benchmark instance (g6.4xlarge, $1.323/hr) delivered 3.4x the analytical throughput of the CPU baseline instance (m8g.8xlarge, $1.436/hr) at 8% lower hourly cost. The effective cost per query drops by approximately 3.4x.

Framed differently: for the same dollar spent, the GPU configuration processes 3.4x more queries. For continuously running workloads (analytics pipelines, concurrent query loads, AI data preparation), this translates directly to infrastructure cost reduction or the ability to serve significantly higher query volumes without scaling out.

Configuration	Cost/hr	Relative Throughput	Cost per Query
CPU baseline (m8g.8xlarge)	$1.436	1.0x (baseline)	1.0x (baseline)
GPU (g6.4xlarge, NVIDIA L4)	$1.323	3.4x	~0.28x

For TPC-H on Blackwell (g7e.4xlarge), the 4.6x throughput improvement further improves the cost-per-query metric relative to CPU. A full price/performance comparison for that configuration is in progress as Blackwell instance pricing stabilizes.

Note: All figures are hardware costs only, excluding software licensing. Large customers typically receive significant discounts on GPU instances. CPU baseline is m8g.8xlarge (Graviton 4, US East on-demand).

GPU Execution: Operational Details

Query Planning Is Unchanged

SQL queries go through Starburst’s standard parser, planner, and optimizer unchanged. No GPU-specific syntax, query hints, or application changes are required. The GPU acceleration layer is transparent to the query author.

Graceful CPU Fallback

If an operator encounters an unsupported data type, expression, or connector, it falls back to CPU execution for that operation. No query fails due to incomplete GPU coverage. The fallback is per-operator, so partial GPU execution still delivers partial benefit for queries that mix GPU-capable and CPU-only operators.

For aggregation-heavy queries, this means rows passing unnecessarily through CPU memory between PARTIAL and FINAL aggregation stages. Implementing GPU LocalExchange to keep data on-device across these boundaries is the next expected step-change in TPC-H performance and is actively under development.

Implications for Data Teams

No SQL changes required. Existing queries run faster without modification.
Incremental deployment. GPU acceleration is operator-by-operator. Workloads benefit immediately for supported operators; coverage expands with each release.
Cost-effective at current cloud pricing. GPU instances deliver 3–4x more analytical throughput for comparable or lower hourly cost. The 2–3x cost premium assumption does not hold for current GPU instance generations.
Shared infrastructure for analytics and AI. GPU instances that accelerate SQL queries can simultaneously support AI inference workloads, reducing the number of distinct infrastructure tiers required.

Conclusion

Starburst’s GPU acceleration work, integrating NVIDIA cuDF at the physical operator level, demonstrates that GPU-native SQL execution is achievable within an existing query engine without rebuilding it. The operator-level architecture provides both flexibility and compounding performance benefits as coverage grows.

Current results: Up to 6x on TPC-H (Blackwell), 3.4x on ClickBench (L4), with individual queries exceeding 11x. GPU instances deliver this throughput at comparable or lower cost per query than CPU alternatives.

This is active development. GPU RemoteExchange, additional operator coverage, and Blackwell-specific optimizations are in progress. The trajectory is clear: each new operator compounds the gains already in production.

Note: Benchmarks conducted on AWS with NVIDIA GPU instances in a development environment on single-node configurations. Production performance characteristics will vary by workload and deployment.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

The Data Engineers Guide to Iceberg v3

GPU-Accelerated SQL Analytics: How Starburst and NVIDIA Deliver Industry-Benchmark Speedups on GPU Infrastructure

More deployment options

Start for Free with Starburst Galaxy

A Future Look at Data Systems for Agents

Executive Summary

Introduction

Why Query Acceleration Matters for the Agentic Era

Starburst and NVIDIA: Joint Development

Architecture: GPU Acceleration at the Physical Operator Level

How It Works

GPU Operators — Complete

Benchmark Setup

Hardware

Workloads

TPC-H

ClickBench

Methodology

TPC-H Results

Summary

Top Query Results

What Drove Performance

Hash Join on GPU

Decimal Arithmetic Support

Remote Exchange Optimization

ClickBench Results

Summary

Notable Results

Price/Performance

Cost per Unit of Analytical Work

GPU Execution: Operational Details

Query Planning Is Unchanged

Graceful CPU Fallback

Implications for Data Teams

Conclusion

Start for Free with Starburst Galaxy