
Executive Summary
Modern analytics and AI pipelines rely heavily on large-scale SQL processing engines operating on data lake architectures. As datasets grow into the terabyte and petabyte range, the performance characteristics of underlying CPU architectures become increasingly important.
To evaluate modern server CPU performance for large-scale analytics workloads, we benchmarked the Starburst Enterprise Intelligence Platform running on NVIDIA Vera against two widely deployed server platforms:
- Intel Xeon 6
- AMD EPYC Zen 5 (Turin)
The benchmarks used TPC-DS analytical workloads executed through Starburst’s query engine on Iceberg tables, representing a modern enterprise data stack.
Across these workloads, the NVIDIA Vera platform demonstrated:
- ~3× faster query throughput
- Up to 1.85× better CPU efficiency vs Xeon
- Comparable CPU efficiency vs AMD
These improvements translate directly into:
- Faster analytics
- Improved infrastructure utilization
- Higher throughput for AI and data workloads.
Benchmark goals
This study aims to evaluate:
- Query execution performance across modern CPU architectures
- CPU efficiency under large-scale analytics workloads
- System throughput at realistic dataset sizes
- Parallel execution characteristics
The goal is to provide a transparent view of how these systems behave under large-scale analytical query patterns commonly seen in modern data platforms.
Software stack
We executed these benchmarks using the following stack:
| Component | Description |
| Data Platform | Starburst Enterprise Platform |
| Execution engine | Starburst Query Engine |
| Storage format | Apache Iceberg |
| Dataset | TPC-DS |
| Benchmark scale | SF10 and SF1000 |
| Storage | local NVMe / local SSD |
Hardware platforms
We performed the following two primary comparisons:
Intel Xeon 6 vs NVIDIA Vera
| Platform | CPU | Cores | Memory | Storage |
| Intel Xeon 6 | Intel Xeon 6 | 176 vCPUs | 708 GB | NVMe |
| NVIDIA Vera | NVIDIA Vera | 176 cores | 708 GB | NVMe |
AMD EPYC Zen 5 vs NVIDIA Vera
| Platform | CPU | Cores | SMT | Memory | Storage |
| AMD EPYC Zen 5 | Turin | 88 cores | disabled | 708 GB | NVMe |
| NVIDIA Vera | Vera | 88 cores | disabled | 708 GB | NVMe |
Disabling SMT ensures true core-to-core comparison.
Dataset and workloads
We used TPC-DS, a well-known analytical SQL benchmark designed to model complex decision-support workloads.
While synthetic benchmarks cannot capture every characteristic of production workloads, TPC-DS provides a standardized way to evaluate complex analytical query execution across systems.
TPC-DS includes:
- large fact tables
- Multi-way joins
- Nested aggregations
- Complex predicates
Two scale factors were evaluated.
| Scale | Data Size | Purpose |
| SF10 | ~10 GB | CPU efficiency testing |
| SF1000 | ~1 TB | realistic analytics workload |
Data generation
TPC-DS requires synthetic dataset generation prior to query execution.
Data generation itself stresses:
- CPU
- Memory bandwidth
- Storage throughput
| Platform | SF10 Data Generation Time |
| NVIDIA Vera | 102.2 s (1.7 m) |
| Intel Xeon 6 | 143.3 s (2.4 m) |
| AMD Zen 5 | 131.7 s (2.2 m) |
| Platform | SF1000 Data Generation Time |
| NVIDIA Vera | 684.0 s (11.4 m) |
| Intel Xeon 6 | 1,750.1 s (29.2 m) |
| AMD Zen 5 | 1,375.6 s (22.9 m) |
Query workloads
We executed the following TPC-DS analytical queries as part of the benchmark workload. These TPC-DS queries stress different parts of the SQL execution pipeline. The table below summarizes the dominant characteristics of each query.
| Query | Workload Type | Dominant Operators |
| q4 | Large fact table scan with filtering | Scan, Filter, Aggregation |
| q9 | Multi-table join with filtering | Scan, Join, Aggregation |
| q14a | Join-heavy analytical pipeline | Scan, Hash Join, Aggregation |
| q14b | Join-heavy analytical pipeline | Scan, Hash Join, Aggregation |
| q23a | Nested joins with aggregation | Scan, Join, Aggregation |
| q23b | Nested joins with aggregation | Scan, Join, Aggregation |
| q47 | Dimensional joins with filtering | Scan, Join |
| q57 | Large aggregation workload | Scan, Aggregation |
| q67 | Mixed join and aggregation | Scan, Join, Aggregation |
| q78 | Aggregation-heavy analytical query | Scan, Aggregation |
We selected these queries because they exercise a diverse set of analytical SQL patterns commonly seen in large-scale data platforms.
Specifically, the query set includes:
Scan-heavy workloads
Queries such as q4 and q9 perform large fact table scans with predicate filtering and projections. These queries stress memory bandwidth and vectorized execution pipelines.
Join-heavy pipelines
Queries including q14a, q14b, q23a, and q23b involve multiple joins between fact and dimension tables. These workloads stress hash table construction, memory locality, and join probe performance.
Aggregation-heavy analytics
Queries such as q57, q67, and q78 contain multi-stage aggregations and grouping operations over large datasets. These workloads stress CPU arithmetic throughput and aggregation pipeline efficiency.
Together, this query set provides a representative mix of scan, join, and aggregation workloads, allowing us to evaluate how different CPU architectures perform across common analytical execution patterns.
Why analytics workloads are CPU intensive
Analytics queries process massive datasets and involve:
- Scanning billions of rows
- Evaluating predicates
- Building hash tables
- Performing aggregations
These operations stress:
- Memory bandwidth
- SIMD units
- Cache hierarchies.
Execution methodology
To generate a realistic analytical workload and avoid artifacts caused by sequential query execution, we used Apache JMeter to drive concurrent query execution against the Starburst cluster.
Query execution model
Each benchmark run used the following configuration:
| Parameter | Value |
| Load generator | Apache JMeter |
| Virtual users | 5 |
| Queries per user | full query set |
| Query set | q4, q9, q14a, q14b, q23a, q23b, q47, q57, q67, q78 |
| Execution pattern | sequential queries per user |
| Total concurrent query streams | 5 |
Each virtual user executed the full query sequence in order. With five concurrent users, this produced a continuous analytical workload that better reflects real production query environments compared to single-query benchmarks.
This approach ensures that:
- The query engine operates under sustained load
- Worker threads remain active
- CPU utilization remains representative of production workloads
Query runtimes were captured directly from the Starburst telemetry and query logs and aggregated across multiple runs to produce the final results reported in this paper.
System monitoring and validation
To ensure the benchmark results accurately reflect system performance, we monitored system behavior throughout the benchmark runs using multiple tools.
The goal of this monitoring was to confirm that:
- CPU cores were actively utilized
- The system was not bottlenecked by I/O
- Worker threads were executing queries in parallel
- No resource throttling occurred during execution.
Monitoring tools
The following tools were used during the benchmark:
| Tool | Purpose |
| ntop / ntopng | network and system traffic monitoring |
| top / htop | real-time CPU utilization |
| vmstat | memory and CPU activity |
| iostat | disk throughput monitoring |
| Starburst Telemetry | query stage execution timing |
These tools provided visibility into system utilization across CPU, memory, and storage subsystems.
CPU utilization observations
During the benchmark runs:
- CPU utilization remained consistently high during query execution.
- Multiple worker threads were active simultaneously.
- Query stages executed in parallel across available cores.
This confirmed that the workloads exercised the full CPU parallelism available on the systems.
Parallel execution behavior
Using query execution metrics and system monitoring, we observed that the Starburst query engine maintained multiple active execution stages across worker threads.
This behavior is expected for large analytical queries where operations such as:
- Table scans
- Hash joins
- Aggregations
Each of these operations was executed in parallel across many tasks.
Sustained parallel execution allows the system to process large datasets efficiently and is a key factor in overall query throughput.
Benchmark results
SF1000 results
Average runtime across the workload:
| Platform | Avg Query Runtime |
| AMD Zen 5 | ~131 s |
| Intel Xeon 6 | ~104 s |
| NVIDIA Vera | ~39–43 s |
Relative performance:
| Platform | Relative |
| AMD Zen 5 | 1.0× |
| Intel Xeon 6 | 1.1× |
| NVIDIA Vera | ~3.0× |
Vera completed the workload approximately 3× faster.
Total workload runtime
| Platform | Total Runtime |
| AMD Zen 5 (88 cores, SMT off) | ~6,559 s |
| NVIDIA Vera (88 cores, SMT off) | ~2,130 s |
Total Workload Runtime represents the cumulative runtime across all JMeter virtual users executing the full query set.
CPU efficiency
CPU time analysis reveals how efficiently each platform executes the workload.
Xeon comparison
| Platform | CPU Time |
| Intel Xeon 6 | 1.0× |
| NVIDIA Vera | 0.54× |
This represents 1.85× better CPU efficiency.
AMD comparison
| Platform | CPU Time |
| AMD Zen 5 | 0.84× |
| NVIDIA Vera | 1.0× |
CPU efficiency is comparable between the architectures.
SF10 vs SF1000
| Platform | SF10 | SF1000 |
| AMD Zen 5 | ~1.8 s | 131 s |
| Xeon 6 | ~2.9 s | 104 s |
| Vera | ~1.4 s | 39–43 s |
Small datasets highlight CPU efficiency, while larger datasets expose system throughput differences.
Operator-level analysis
TPC-DS queries stress three major operator types:
- Scans
- Joins
- Aggregations
Scan operators
Scan operators perform:
- Column decoding
- Predicate evaluation
- Projection
These operations stress:
- Memory bandwidth
- Vectorized execution
- Cache hierarchy.
Scan-heavy queries: q4, q9, q57
Join operators
Join-heavy queries include: q14a, q14b, q23a, q23b
These operations stress:
- Hash table construction
- Memory locality
- Pointer chasing.
Aggregation operators
Aggregation workloads include: q57, q67, q78
These rely heavily on:
- SIMD execution
- CPU arithmetic throughput.
Parallel execution
Starburst Query Engine executes queries using parallel tasks distributed across CPU cores.
Performance depends on:
- Task scheduling
- Memory bandwidth
- Thread scaling.
During the benchmark, Vera sustained higher parallel throughput, which contributed to faster overall execution.
Conclusion
Across both Intel Xeon 6 and AMD EPYC Zen 5 comparisons, the Starburst Enterprise Intelligence Platform running on NVIDIA Vera demonstrated:
- ~3× faster query performance
- Up to 1.85× better CPU efficiency
- Higher sustained system throughput
These improvements enable organizations to run larger analytical workloads while reducing infrastructure requirements.
Future work
Additional benchmarking areas that could further illuminate performance differences include:
- Larger dataset scales (SF3000+)
- Concurrent multi-query workloads
- Mixed analytics and AI pipelines.
Final remarks
This study demonstrates that the Starburst Enterprise Intelligence Platform running on NVIDIA Vera delivers significantly higher query throughput for large-scale analytics workloads.
These improvements can help organizations:
- Accelerate analytics
- Support AI workloads
- Improve infrastructure efficiency
Appendix A — Query list
q4 q9 q14a q14b q23a q23b q47 q57 q67 q78
Appendix B — Benchmark configuration
- Starburst Enterprise Platform
- Apache Iceberg tables
- Starburst query execution
- Local NVMe / SSD storage
- 708 GB system memory
Great — the last two sections that will make the white paper engineering-grade and credible to database/system engineers are:
- Full per-query benchmark tables
- CPU utilization and parallelism analysis
Appendix C — Per-query results (SF1000)
The following table shows the average query runtime for each query in the SF1000 benchmark suite.
These values represent the mean of multiple runs to minimize noise from system variability.
| Query | AMD Zen5 | Intel Xeon 6 | NVIDIA Vera |
| q4 | 241.4 | 141.7 | 48.5 |
| q9 | 102.1 | 98.6 | 46.7 |
| q14a | 214.6 | 153.6 | 49.7 |
| q14b | 210.1 | 116.6 | 38.1 |
| q23a | 141.7 | 142.7 | 60.9 |
| q23b | 142.4 | 134.8 | 66.1 |
| q47 | 75.6 | 64.7 | 27.3 |
| q57 | 44.5 | 38.0 | 19.9 |
| q67 | 49.8 | 75.3 | 40.2 |
| q78 | 89.5 | 74.4 | 28.7 |
Average query runtime:
| Platform | Average Runtime |
| AMD Zen5 | ~131 s |
| Xeon 6 | ~104 s |
| Vera | ~39–43 s |
Appendix D — SF10 benchmark results
The SF10 dataset represents a small dataset scenario where CPU efficiency dominates performance.
Unlike SF1000, SF10 places minimal pressure on memory bandwidth or storage throughput.
| Platform | Avg Query Runtime (SF10) |
| AMD Zen5 | 1.82 s |
| Xeon 6 | 2.93 s |
| Vera | 1.43 s |
Because the dataset is small, the performance differences primarily reflect:
- CPU pipeline efficiency
- Vector execution performance
- Cache behavior
Appendix E — CPU utilization and parallelism
To better understand the observed performance differences, we examined CPU utilization during query execution.
The Starburst query engine schedules work across multiple worker threads. The amount of active CPU parallelism can be estimated using:
Effective Parallelism = Total CPU Time / Wall Clock Time
This metric reflects how many CPU cores are effectively utilized during execution.
AMD Zen5 vs Vera parallelism
| Platform | CPU Time | Wall Time | Effective Parallelism |
| AMD Zen5 | ~18,963 s | ~1312 s | ~14.5 cores |
| NVIDIA Vera | ~22,628 s | ~426 s | ~54 cores |
These results indicate that the Vera system sustained significantly higher parallel execution during the workload.
Higher sustained parallelism allows:
- Faster completion of query stages
- Improved resource utilization
- Higher overall throughput.
Xeon6 vs Vera CPU efficiency
| Platform | Relative CPU Time |
| Intel Xeon 6 | 1.0× |
| NVIDIA Vera | 0.54× |
This indicates that Vera required approximately 1.85× fewer CPU cycles to complete the workload.
Appendix F —Workload throughput comparison
The full SF1000 workload runtime across platforms is summarized below.
| Platform | Total Runtime |
| AMD Zen5 | ~1312 s |
| Intel Xeon 6 | ~1040 s |
| NVIDIA Vera | ~426 s |
Relative throughput:
| Platform | Relative Performance |
| AMD Zen5 | 1.0× |
| Intel Xeon 6 | ~1.1× |
| NVIDIA Vera | ~3.0× |
Appendix G —Reproducibility notes
Benchmark runs were conducted under the following conditions:
- Identical dataset
- Identical query set
- Identical Starburst configuration
- Multiple query repetitions
- Average runtime used for reporting.
This methodology helps reduce noise from:
- Caching effects
- Background system processes
- Query scheduling variability.
Appendix H — Additional observations
Several patterns emerged from the benchmark results.
1. Scan-heavy queries dominate runtime
Queries performing large fact table scans show the largest performance differences across architectures.
2. Join pipelines amplify CPU differences
Join-heavy queries stress memory locality and hash table construction.
3. Large datasets reveal architectural differences
At a larger scale,factors, such as SF1000, the following differences become more pronounced.
- Memory bandwidth
- Task scheduling
- Parallelism



