Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Executive Summary

Modern analytics and AI pipelines rely heavily on large-scale SQL processing engines operating on data lake architectures. As datasets grow into the terabyte and petabyte range, the performance characteristics of underlying CPU architectures become increasingly important.

To evaluate modern server CPU performance for large-scale analytics workloads, we benchmarked the Starburst Enterprise Intelligence Platform running on NVIDIA Vera against two widely deployed server platforms:

  • Intel Xeon 6
  • AMD EPYC Zen 5 (Turin)

The benchmarks used TPC-DS analytical workloads executed through Starburst’s query engine on Iceberg tables, representing a modern enterprise data stack.

Across these workloads, the NVIDIA Vera platform demonstrated:

  • ~3× faster query throughput
  • Up to 1.85× better CPU efficiency vs Xeon
  • Comparable CPU efficiency vs AMD

These improvements translate directly into:

  • Faster analytics
  • Improved infrastructure utilization
  • Higher throughput for AI and data workloads.

Benchmark goals

This study aims to evaluate:

  1. Query execution performance across modern CPU architectures
  2. CPU efficiency under large-scale analytics workloads
  3. System throughput at realistic dataset sizes
  4. Parallel execution characteristics

The goal is to provide a transparent view of how these systems behave under large-scale analytical query patterns commonly seen in modern data platforms.

Software stack

We executed these benchmarks using the following stack:

Component Description
Data Platform Starburst Enterprise Platform
Execution engine Starburst Query Engine
Storage format Apache Iceberg
Dataset TPC-DS
Benchmark scale SF10 and SF1000
Storage local NVMe / local SSD

 

Hardware platforms

We performed the following two primary comparisons:

Intel Xeon 6 vs NVIDIA Vera

Platform CPU Cores Memory Storage
Intel Xeon 6 Intel Xeon 6 176 vCPUs 708 GB NVMe
NVIDIA Vera NVIDIA Vera 176 cores 708 GB NVMe

 

AMD EPYC Zen 5 vs NVIDIA Vera

Platform CPU Cores SMT Memory Storage
AMD EPYC Zen 5 Turin 88 cores disabled 708 GB NVMe
NVIDIA Vera Vera 88 cores disabled 708 GB NVMe

Disabling SMT ensures true core-to-core comparison.

Dataset and workloads

We used TPC-DS, a well-known analytical SQL benchmark designed to model complex decision-support workloads.

While synthetic benchmarks cannot capture every characteristic of production workloads, TPC-DS provides a standardized way to evaluate complex analytical query execution across systems.

TPC-DS includes:

  • large fact tables
  • Multi-way joins
  • Nested aggregations
  • Complex predicates

Two scale factors were evaluated.

Scale Data Size Purpose
SF10 ~10 GB CPU efficiency testing
SF1000 ~1 TB realistic analytics workload

 

Data generation

TPC-DS requires synthetic dataset generation prior to query execution.

Data generation itself stresses:

  • CPU
  • Memory bandwidth
  • Storage throughput
Platform SF10 Data Generation Time
NVIDIA Vera 102.2 s (1.7 m)
Intel Xeon 6 143.3 s (2.4 m)
AMD Zen 5 131.7 s (2.2 m)

 

Platform SF1000 Data Generation Time
NVIDIA Vera 684.0 s (11.4 m)
Intel Xeon 6 1,750.1 s (29.2 m)
AMD Zen 5 1,375.6 s (22.9 m)

 

Query workloads

We executed the following TPC-DS analytical queries as part of the benchmark workload. These TPC-DS queries stress different parts of the SQL execution pipeline. The table below summarizes the dominant characteristics of each query.

Query Workload Type Dominant Operators
q4 Large fact table scan with filtering Scan, Filter, Aggregation
q9 Multi-table join with filtering Scan, Join, Aggregation
q14a Join-heavy analytical pipeline Scan, Hash Join, Aggregation
q14b Join-heavy analytical pipeline Scan, Hash Join, Aggregation
q23a Nested joins with aggregation Scan, Join, Aggregation
q23b Nested joins with aggregation Scan, Join, Aggregation
q47 Dimensional joins with filtering Scan, Join
q57 Large aggregation workload Scan, Aggregation
q67 Mixed join and aggregation Scan, Join, Aggregation
q78 Aggregation-heavy analytical query Scan, Aggregation

We selected these queries because they exercise a diverse set of analytical SQL patterns commonly seen in large-scale data platforms.

Specifically, the query set includes:

Scan-heavy workloads

Queries such as q4 and q9 perform large fact table scans with predicate filtering and projections. These queries stress memory bandwidth and vectorized execution pipelines.

Join-heavy pipelines

Queries including q14a, q14b, q23a, and q23b involve multiple joins between fact and dimension tables. These workloads stress hash table construction, memory locality, and join probe performance.

Aggregation-heavy analytics

Queries such as q57, q67, and q78 contain multi-stage aggregations and grouping operations over large datasets. These workloads stress CPU arithmetic throughput and aggregation pipeline efficiency.

Together, this query set provides a representative mix of scan, join, and aggregation workloads, allowing us to evaluate how different CPU architectures perform across common analytical execution patterns.

Why analytics workloads are CPU intensive

Analytics queries process massive datasets and involve:

  • Scanning billions of rows
  • Evaluating predicates
  • Building hash tables
  • Performing aggregations

These operations stress:

  • Memory bandwidth
  • SIMD units
  • Cache hierarchies.

Execution methodology

To generate a realistic analytical workload and avoid artifacts caused by sequential query execution, we used Apache JMeter to drive concurrent query execution against the Starburst cluster.

Query execution model

Each benchmark run used the following configuration:

Parameter Value
Load generator Apache JMeter
Virtual users 5
Queries per user full query set
Query set q4, q9, q14a, q14b, q23a, q23b, q47, q57, q67, q78
Execution pattern sequential queries per user
Total concurrent query streams 5

Each virtual user executed the full query sequence in order. With five concurrent users, this produced a continuous analytical workload that better reflects real production query environments compared to single-query benchmarks.

This approach ensures that:

  • The query engine operates under sustained load
  • Worker threads remain active
  • CPU utilization remains representative of production workloads

Query runtimes were captured directly from the Starburst telemetry and query logs and aggregated across multiple runs to produce the final results reported in this paper.

System monitoring and validation

To ensure the benchmark results accurately reflect system performance, we monitored system behavior throughout the benchmark runs using multiple tools.

The goal of this monitoring was to confirm that:

  • CPU cores were actively utilized
  • The system was not bottlenecked by I/O
  • Worker threads were executing queries in parallel
  • No resource throttling occurred during execution.

Monitoring tools

The following tools were used during the benchmark:

Tool Purpose
ntop / ntopng network and system traffic monitoring
top / htop real-time CPU utilization
vmstat memory and CPU activity
iostat disk throughput monitoring
Starburst Telemetry query stage execution timing

These tools provided visibility into system utilization across CPU, memory, and storage subsystems.

CPU utilization observations

During the benchmark runs:

  • CPU utilization remained consistently high during query execution.
  • Multiple worker threads were active simultaneously.
  • Query stages executed in parallel across available cores.

This confirmed that the workloads exercised the full CPU parallelism available on the systems.

Parallel execution behavior

Using query execution metrics and system monitoring, we observed that the Starburst query engine maintained multiple active execution stages across worker threads.

This behavior is expected for large analytical queries where operations such as:

  • Table scans
  • Hash joins
  • Aggregations

Each of these operations was executed in parallel across many tasks.

Sustained parallel execution allows the system to process large datasets efficiently and is a key factor in overall query throughput.

Benchmark results

SF1000 results

Average runtime across the workload:

Platform Avg Query Runtime
AMD Zen 5 ~131 s
Intel Xeon 6 ~104 s
NVIDIA Vera ~39–43 s

Relative performance:

Platform Relative
AMD Zen 5 1.0×
Intel Xeon 6 1.1×
NVIDIA Vera ~3.0×

Vera completed the workload approximately 3× faster.

Total workload runtime

Platform Total Runtime
AMD Zen 5 (88 cores, SMT off) ~6,559 s
NVIDIA Vera (88 cores, SMT off) ~2,130 s

Total Workload Runtime represents the cumulative runtime across all JMeter virtual users executing the full query set.

CPU efficiency

CPU time analysis reveals how efficiently each platform executes the workload.

Xeon comparison

Platform CPU Time
Intel Xeon 6 1.0×
NVIDIA Vera 0.54×

This represents 1.85× better CPU efficiency.

AMD comparison

Platform CPU Time
AMD Zen 5 0.84×
NVIDIA Vera 1.0×

CPU efficiency is comparable between the architectures.

SF10 vs SF1000

Platform SF10 SF1000
AMD Zen 5 ~1.8 s 131 s
Xeon 6 ~2.9 s 104 s
Vera ~1.4 s 39–43 s

Small datasets highlight CPU efficiency, while larger datasets expose system throughput differences.

Operator-level analysis

TPC-DS queries stress three major operator types:

  • Scans
  • Joins
  • Aggregations

Scan operators

Scan operators perform:

  • Column decoding
  • Predicate evaluation
  • Projection

These operations stress:

  • Memory bandwidth
  • Vectorized execution
  • Cache hierarchy.

Scan-heavy queries: q4, q9, q57

Join operators

Join-heavy queries include: q14a, q14b, q23a, q23b

These operations stress:

  • Hash table construction
  • Memory locality
  • Pointer chasing.

Aggregation operators

Aggregation workloads include: q57, q67, q78

These rely heavily on:

  • SIMD execution
  • CPU arithmetic throughput.

Parallel execution

Starburst Query Engine executes queries using parallel tasks distributed across CPU cores.

Performance depends on:

  • Task scheduling
  • Memory bandwidth
  • Thread scaling.

During the benchmark, Vera sustained higher parallel throughput, which contributed to faster overall execution.

Conclusion

Across both Intel Xeon 6 and AMD EPYC Zen 5 comparisons, the Starburst Enterprise Intelligence Platform running on NVIDIA Vera demonstrated:

  • ~3× faster query performance
  • Up to 1.85× better CPU efficiency
  • Higher sustained system throughput

These improvements enable organizations to run larger analytical workloads while reducing infrastructure requirements.

Future work

Additional benchmarking areas that could further illuminate performance differences include:

  • Larger dataset scales (SF3000+)
  • Concurrent multi-query workloads
  • Mixed analytics and AI pipelines.

Final remarks

This study demonstrates that the Starburst Enterprise Intelligence Platform running on NVIDIA Vera delivers significantly higher query throughput for large-scale analytics workloads.

These improvements can help organizations:

  • Accelerate analytics
  • Support AI workloads
  • Improve infrastructure efficiency

Appendix A — Query list

q4
q9
q14a
q14b
q23a
q23b
q47
q57
q67
q78

Appendix B — Benchmark configuration

  • Starburst Enterprise Platform
  • Apache Iceberg tables
  • Starburst query execution
  • Local NVMe / SSD storage
  • 708 GB system memory

Great — the last two sections that will make the white paper engineering-grade and credible to database/system engineers are:

  1. Full per-query benchmark tables
  2. CPU utilization and parallelism analysis

Appendix C — Per-query results (SF1000)

The following table shows the average query runtime for each query in the SF1000 benchmark suite.

These values represent the mean of multiple runs to minimize noise from system variability.

Query AMD Zen5 Intel Xeon 6 NVIDIA Vera
q4 241.4 141.7 48.5
q9 102.1 98.6 46.7
q14a 214.6 153.6 49.7
q14b 210.1 116.6 38.1
q23a 141.7 142.7 60.9
q23b 142.4 134.8 66.1
q47 75.6 64.7 27.3
q57 44.5 38.0 19.9
q67 49.8 75.3 40.2
q78 89.5 74.4 28.7

Average query runtime:

Platform Average Runtime
AMD Zen5 ~131 s
Xeon 6 ~104 s
Vera ~39–43 s

Appendix D — SF10 benchmark results

The SF10 dataset represents a small dataset scenario where CPU efficiency dominates performance.

Unlike SF1000, SF10 places minimal pressure on memory bandwidth or storage throughput.

Platform Avg Query Runtime (SF10)
AMD Zen5 1.82 s
Xeon 6 2.93 s
Vera 1.43 s

Because the dataset is small, the performance differences primarily reflect:

  • CPU pipeline efficiency
  • Vector execution performance
  • Cache behavior

Appendix E — CPU utilization and parallelism

To better understand the observed performance differences, we examined CPU utilization during query execution.

The Starburst query engine schedules work across multiple worker threads. The amount of active CPU parallelism can be estimated using:

Effective Parallelism = Total CPU Time / Wall Clock Time

This metric reflects how many CPU cores are effectively utilized during execution.

AMD Zen5 vs Vera parallelism

Platform CPU Time Wall Time Effective Parallelism
AMD Zen5 ~18,963 s ~1312 s ~14.5 cores
NVIDIA Vera ~22,628 s ~426 s ~54 cores

These results indicate that the Vera system sustained significantly higher parallel execution during the workload.

Higher sustained parallelism allows:

  • Faster completion of query stages
  • Improved resource utilization
  • Higher overall throughput.

Xeon6 vs Vera CPU efficiency

Platform Relative CPU Time
Intel Xeon 6 1.0×
NVIDIA Vera 0.54×

This indicates that Vera required approximately 1.85× fewer CPU cycles to complete the workload.

Appendix F —Workload throughput comparison

The full SF1000 workload runtime across platforms is summarized below.

Platform Total Runtime
AMD Zen5 ~1312 s
Intel Xeon 6 ~1040 s
NVIDIA Vera ~426 s

Relative throughput:

Platform Relative Performance
AMD Zen5 1.0×
Intel Xeon 6 ~1.1×
NVIDIA Vera ~3.0×

 

Appendix G —Reproducibility notes

Benchmark runs were conducted under the following conditions:

  • Identical dataset
  • Identical query set
  • Identical Starburst configuration
  • Multiple query repetitions
  • Average runtime used for reporting.

This methodology helps reduce noise from:

  • Caching effects
  • Background system processes
  • Query scheduling variability.

Appendix H — Additional observations

Several patterns emerged from the benchmark results.

1. Scan-heavy queries dominate runtime

Queries performing large fact table scans show the largest performance differences across architectures.

2. Join pipelines amplify CPU differences

Join-heavy queries stress memory locality and hash table construction.

3. Large datasets reveal architectural differences

At a larger scale,factors, such as SF1000, the following differences become more pronounced.

  • Memory bandwidth
  • Task scheduling
  • Parallelism

 

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free