Starburst Enterprise Intelligence Platform on NVIDIA Vera

Benchmarking Modern CPU Platforms for Large-Scale SQL Analytics

March 16, 2026

Jitender Aswani

Senior Vice President, Engineering

Starburst Data

Jeff Lester

Software Engineer

Starburst Data

Yash Vijay

Staff Software Engineer

Starburst Data

Martin Traverso

Co-creator of Trino (formerly PrestoSQL)

Starburst

Jitender Aswani

Senior Vice President, Engineering

Starburst Data

Jeff Lester

Software Engineer

Starburst Data

Yash Vijay

Staff Software Engineer

Starburst Data

Martin Traverso

Co-creator of Trino (formerly PrestoSQL)

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

From HTTP to REST to MCP

Executive Summary

Modern analytics and AI pipelines rely heavily on large-scale SQL processing engines operating on data lake architectures. As datasets grow into the terabyte and petabyte range, the performance characteristics of underlying CPU architectures become increasingly important.

To evaluate modern server CPU performance for large-scale analytics workloads, we benchmarked the Starburst Enterprise Intelligence Platform running on NVIDIA Vera against two widely deployed server platforms:

Intel Xeon 6
AMD EPYC Zen 5 (Turin)

The benchmarks used TPC-DS analytical workloads executed through Starburst’s query engine on Iceberg tables, representing a modern enterprise data stack.

Across these workloads, the NVIDIA Vera platform demonstrated:

~3× faster query throughput
Up to 1.85× better CPU efficiency vs Xeon
Comparable CPU efficiency vs AMD

These improvements translate directly into:

Faster analytics
Improved infrastructure utilization
Higher throughput for AI and data workloads.

Benchmark goals

This study aims to evaluate:

Query execution performance across modern CPU architectures
CPU efficiency under large-scale analytics workloads
System throughput at realistic dataset sizes
Parallel execution characteristics

The goal is to provide a transparent view of how these systems behave under large-scale analytical query patterns commonly seen in modern data platforms.

Software stack

We executed these benchmarks using the following stack:

Component	Description
Data Platform	Starburst Enterprise Platform
Execution engine	Starburst Query Engine
Storage format	Apache Iceberg
Dataset	TPC-DS
Benchmark scale	SF10 and SF1000
Storage	local NVMe / local SSD

Hardware platforms

We performed the following two primary comparisons:

Intel Xeon 6 vs NVIDIA Vera

Platform	CPU	Cores	Memory	Storage
Intel Xeon 6	Intel Xeon 6	176 vCPUs	708 GB	NVMe
NVIDIA Vera	NVIDIA Vera	176 cores	708 GB	NVMe

AMD EPYC Zen 5 vs NVIDIA Vera

Platform	CPU	Cores	SMT	Memory	Storage
AMD EPYC Zen 5	Turin	88 cores	disabled	708 GB	NVMe
NVIDIA Vera	Vera	88 cores	disabled	708 GB	NVMe

Disabling SMT ensures true core-to-core comparison.

Dataset and workloads

We used TPC-DS, a well-known analytical SQL benchmark designed to model complex decision-support workloads.

While synthetic benchmarks cannot capture every characteristic of production workloads, TPC-DS provides a standardized way to evaluate complex analytical query execution across systems.

TPC-DS includes:

large fact tables
Multi-way joins
Nested aggregations
Complex predicates

Two scale factors were evaluated.

Scale	Data Size	Purpose
SF10	~10 GB	CPU efficiency testing
SF1000	~1 TB	realistic analytics workload

Data generation

TPC-DS requires synthetic dataset generation prior to query execution.

Data generation itself stresses:

CPU
Memory bandwidth
Storage throughput

Platform	SF10 Data Generation Time
NVIDIA Vera	102.2 s (1.7 m)
Intel Xeon 6	143.3 s (2.4 m)
AMD Zen 5	131.7 s (2.2 m)

Platform	SF1000 Data Generation Time
NVIDIA Vera	684.0 s (11.4 m)
Intel Xeon 6	1,750.1 s (29.2 m)
AMD Zen 5	1,375.6 s (22.9 m)

Query workloads

We executed the following TPC-DS analytical queries as part of the benchmark workload. These TPC-DS queries stress different parts of the SQL execution pipeline. The table below summarizes the dominant characteristics of each query.

Query	Workload Type	Dominant Operators
q4	Large fact table scan with filtering	Scan, Filter, Aggregation
q9	Multi-table join with filtering	Scan, Join, Aggregation
q14a	Join-heavy analytical pipeline	Scan, Hash Join, Aggregation
q14b	Join-heavy analytical pipeline	Scan, Hash Join, Aggregation
q23a	Nested joins with aggregation	Scan, Join, Aggregation
q23b	Nested joins with aggregation	Scan, Join, Aggregation
q47	Dimensional joins with filtering	Scan, Join
q57	Large aggregation workload	Scan, Aggregation
q67	Mixed join and aggregation	Scan, Join, Aggregation
q78	Aggregation-heavy analytical query	Scan, Aggregation

We selected these queries because they exercise a diverse set of analytical SQL patterns commonly seen in large-scale data platforms.

Specifically, the query set includes:

Scan-heavy workloads

Queries such as q4 and q9 perform large fact table scans with predicate filtering and projections. These queries stress memory bandwidth and vectorized execution pipelines.

Join-heavy pipelines

Queries including q14a, q14b, q23a, and q23b involve multiple joins between fact and dimension tables. These workloads stress hash table construction, memory locality, and join probe performance.

Aggregation-heavy analytics

Queries such as q57, q67, and q78 contain multi-stage aggregations and grouping operations over large datasets. These workloads stress CPU arithmetic throughput and aggregation pipeline efficiency.

Together, this query set provides a representative mix of scan, join, and aggregation workloads, allowing us to evaluate how different CPU architectures perform across common analytical execution patterns.

Why analytics workloads are CPU intensive

Analytics queries process massive datasets and involve:

Scanning billions of rows
Evaluating predicates
Building hash tables
Performing aggregations

These operations stress:

Memory bandwidth
SIMD units
Cache hierarchies.

Execution methodology

To generate a realistic analytical workload and avoid artifacts caused by sequential query execution, we used Apache JMeter to drive concurrent query execution against the Starburst cluster.

Query execution model

Each benchmark run used the following configuration:

Parameter	Value
Load generator	Apache JMeter
Virtual users	5
Queries per user	full query set
Query set	q4, q9, q14a, q14b, q23a, q23b, q47, q57, q67, q78
Execution pattern	sequential queries per user
Total concurrent query streams	5

Each virtual user executed the full query sequence in order. With five concurrent users, this produced a continuous analytical workload that better reflects real production query environments compared to single-query benchmarks.

This approach ensures that:

The query engine operates under sustained load
Worker threads remain active
CPU utilization remains representative of production workloads

Query runtimes were captured directly from the Starburst telemetry and query logs and aggregated across multiple runs to produce the final results reported in this paper.

System monitoring and validation

To ensure the benchmark results accurately reflect system performance, we monitored system behavior throughout the benchmark runs using multiple tools.

The goal of this monitoring was to confirm that:

CPU cores were actively utilized
The system was not bottlenecked by I/O
Worker threads were executing queries in parallel
No resource throttling occurred during execution.

Monitoring tools

The following tools were used during the benchmark:

Tool	Purpose
ntop / ntopng	network and system traffic monitoring
top / htop	real-time CPU utilization
vmstat	memory and CPU activity
iostat	disk throughput monitoring
Starburst Telemetry	query stage execution timing

These tools provided visibility into system utilization across CPU, memory, and storage subsystems.

CPU utilization observations

During the benchmark runs:

CPU utilization remained consistently high during query execution.
Multiple worker threads were active simultaneously.
Query stages executed in parallel across available cores.

This confirmed that the workloads exercised the full CPU parallelism available on the systems.

Parallel execution behavior

Using query execution metrics and system monitoring, we observed that the Starburst query engine maintained multiple active execution stages across worker threads.

This behavior is expected for large analytical queries where operations such as:

Table scans
Hash joins
Aggregations

Each of these operations was executed in parallel across many tasks.

Sustained parallel execution allows the system to process large datasets efficiently and is a key factor in overall query throughput.

Benchmark results

SF1000 results

Average runtime across the workload:

Platform	Avg Query Runtime
AMD Zen 5	~131 s
Intel Xeon 6	~104 s
NVIDIA Vera	~39–43 s

Relative performance:

Platform	Relative
AMD Zen 5	1.0×
Intel Xeon 6	1.1×
NVIDIA Vera	~3.0×

Vera completed the workload approximately 3× faster.

Total workload runtime

Platform	Total Runtime
AMD Zen 5 (88 cores, SMT off)	~6,559 s
NVIDIA Vera (88 cores, SMT off)	~2,130 s

Total Workload Runtime represents the cumulative runtime across all JMeter virtual users executing the full query set.

CPU efficiency

CPU time analysis reveals how efficiently each platform executes the workload.

Xeon comparison

Platform	CPU Time
Intel Xeon 6	1.0×
NVIDIA Vera	0.54×

This represents 1.85× better CPU efficiency.

AMD comparison

Platform	CPU Time
AMD Zen 5	0.84×
NVIDIA Vera	1.0×

CPU efficiency is comparable between the architectures.

SF10 vs SF1000

Platform	SF10	SF1000
AMD Zen 5	~1.8 s	131 s
Xeon 6	~2.9 s	104 s
Vera	~1.4 s	39–43 s

Small datasets highlight CPU efficiency, while larger datasets expose system throughput differences.

Operator-level analysis

TPC-DS queries stress three major operator types:

Scans
Joins
Aggregations

Scan operators

Scan operators perform:

Column decoding
Predicate evaluation
Projection

These operations stress:

Memory bandwidth
Vectorized execution
Cache hierarchy.

Scan-heavy queries: q4, q9, q57

Join operators

Join-heavy queries include: q14a, q14b, q23a, q23b

These operations stress:

Hash table construction
Memory locality
Pointer chasing.

Aggregation operators

Aggregation workloads include: q57, q67, q78

These rely heavily on:

SIMD execution
CPU arithmetic throughput.

Parallel execution

Starburst Query Engine executes queries using parallel tasks distributed across CPU cores.

Performance depends on:

Task scheduling
Memory bandwidth
Thread scaling.

During the benchmark, Vera sustained higher parallel throughput, which contributed to faster overall execution.

Conclusion

Across both Intel Xeon 6 and AMD EPYC Zen 5 comparisons, the Starburst Enterprise Intelligence Platform running on NVIDIA Vera demonstrated:

~3× faster query performance
Up to 1.85× better CPU efficiency
Higher sustained system throughput

These improvements enable organizations to run larger analytical workloads while reducing infrastructure requirements.

Future work

Additional benchmarking areas that could further illuminate performance differences include:

Larger dataset scales (SF3000+)
Concurrent multi-query workloads
Mixed analytics and AI pipelines.

Final remarks

This study demonstrates that the Starburst Enterprise Intelligence Platform running on NVIDIA Vera delivers significantly higher query throughput for large-scale analytics workloads.

These improvements can help organizations:

Accelerate analytics
Support AI workloads
Improve infrastructure efficiency

Appendix A — Query list

q4
q9
q14a
q14b
q23a
q23b
q47
q57
q67
q78

Appendix B — Benchmark configuration

Starburst Enterprise Platform
Apache Iceberg tables
Starburst query execution
Local NVMe / SSD storage
708 GB system memory

Great — the last two sections that will make the white paper engineering-grade and credible to database/system engineers are:

Full per-query benchmark tables
CPU utilization and parallelism analysis

Appendix C — Per-query results (SF1000)

The following table shows the average query runtime for each query in the SF1000 benchmark suite.

These values represent the mean of multiple runs to minimize noise from system variability.

Query	AMD Zen5	Intel Xeon 6	NVIDIA Vera
q4	241.4	141.7	48.5
q9	102.1	98.6	46.7
q14a	214.6	153.6	49.7
q14b	210.1	116.6	38.1
q23a	141.7	142.7	60.9
q23b	142.4	134.8	66.1
q47	75.6	64.7	27.3
q57	44.5	38.0	19.9
q67	49.8	75.3	40.2
q78	89.5	74.4	28.7

Average query runtime:

Platform	Average Runtime
AMD Zen5	~131 s
Xeon 6	~104 s
Vera	~39–43 s

Appendix D — SF10 benchmark results

The SF10 dataset represents a small dataset scenario where CPU efficiency dominates performance.

Unlike SF1000, SF10 places minimal pressure on memory bandwidth or storage throughput.

Platform	Avg Query Runtime (SF10)
AMD Zen5	1.82 s
Xeon 6	2.93 s
Vera	1.43 s

Because the dataset is small, the performance differences primarily reflect:

CPU pipeline efficiency
Vector execution performance
Cache behavior

Appendix E — CPU utilization and parallelism

To better understand the observed performance differences, we examined CPU utilization during query execution.

The Starburst query engine schedules work across multiple worker threads. The amount of active CPU parallelism can be estimated using:

Effective Parallelism = Total CPU Time / Wall Clock Time

This metric reflects how many CPU cores are effectively utilized during execution.

AMD Zen5 vs Vera parallelism

Platform	CPU Time	Wall Time	Effective Parallelism
AMD Zen5	~18,963 s	~1312 s	~14.5 cores
NVIDIA Vera	~22,628 s	~426 s	~54 cores

These results indicate that the Vera system sustained significantly higher parallel execution during the workload.

Higher sustained parallelism allows:

Faster completion of query stages
Improved resource utilization
Higher overall throughput.

Xeon6 vs Vera CPU efficiency

Platform	Relative CPU Time
Intel Xeon 6	1.0×
NVIDIA Vera	0.54×

This indicates that Vera required approximately 1.85× fewer CPU cycles to complete the workload.

Appendix F —Workload throughput comparison

The full SF1000 workload runtime across platforms is summarized below.

Platform	Total Runtime
AMD Zen5	~1312 s
Intel Xeon 6	~1040 s
NVIDIA Vera	~426 s

Relative throughput:

Platform	Relative Performance
AMD Zen5	1.0×
Intel Xeon 6	~1.1×
NVIDIA Vera	~3.0×

Appendix G —Reproducibility notes

Benchmark runs were conducted under the following conditions:

Identical dataset
Identical query set
Identical Starburst configuration
Multiple query repetitions
Average runtime used for reporting.

This methodology helps reduce noise from:

Caching effects
Background system processes
Query scheduling variability.

Appendix H — Additional observations

Several patterns emerged from the benchmark results.

1. Scan-heavy queries dominate runtime

Queries performing large fact table scans show the largest performance differences across architectures.

2. Join pipelines amplify CPU differences

Join-heavy queries stress memory locality and hash table construction.

3. Large datasets reveal architectural differences

At a larger scale,factors, such as SF1000, the following differences become more pronounced.

Memory bandwidth
Task scheduling
Parallelism

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Starburst’s mission is to free our customers to see the invisible and achieve the impossible

Starburst Enterprise Intelligence Platform on NVIDIA Vera

More deployment options

Start for Free with Starburst Galaxy

From HTTP to REST to MCP

Executive Summary

Benchmark goals

Software stack

Hardware platforms

Intel Xeon 6 vs NVIDIA Vera

AMD EPYC Zen 5 vs NVIDIA Vera

Dataset and workloads

Data generation

Query workloads

Scan-heavy workloads

Join-heavy pipelines

Aggregation-heavy analytics

Why analytics workloads are CPU intensive

Execution methodology

Query execution model

System monitoring and validation

Monitoring tools

CPU utilization observations

Parallel execution behavior

Benchmark results

SF1000 results

Total workload runtime

CPU efficiency

Xeon comparison

AMD comparison

SF10 vs SF1000

Operator-level analysis

Scan operators

Join operators

Aggregation operators

Parallel execution

Conclusion

Future work

Final remarks

Appendix A — Query list

Appendix B — Benchmark configuration

Appendix C — Per-query results (SF1000)

Appendix D — SF10 benchmark results

Appendix E — CPU utilization and parallelism

AMD Zen5 vs Vera parallelism

Xeon6 vs Vera CPU efficiency

Appendix F —Workload throughput comparison

Appendix G —Reproducibility notes

Appendix H — Additional observations

1. Scan-heavy queries dominate runtime

2. Join pipelines amplify CPU differences

3. Large datasets reveal architectural differences

Start for Free with Starburst Galaxy