Lie #5 — Vendor Benchmarks Measure Real-World Performance

Datanova 2023: The data lies (and truths)

February 7, 2023

Justin Borgman
Co-Founder & CEO
Starburst

Justin Borgman
Co-Founder & CEO
Starburst

More deployment options

Request Enterprise trial license key →

Professional athletes aren’t remembered for how well they practiced. They’re remembered for how well they perform in the big game.

Technology needs to be the same. For too long, database vendors have promised to provide faster insights while touting vendor-provided query benchmarking numbers as proof of their claims. However, when it comes time to perform in production, the customer struggles to find the promised value.

A product that worked well in one environment failed in another because it wasn’t fast in real-world scenarios. It’s like a race car driver that clocks a stunner on the track alone, but can’t get anywhere close to that time once opponents are introduced or there’s a new obstacle on the track.

The most commonly used benchmarks in data analytics are TPC-H and TPC-DS. TPC is an independent organization created to hold systems providers accountable to publishing accurate performance results. The first test was completed in 1989. Unfortunately, while well intentioned, TPC benchmarks do not reflect the reality of customer’s production environments in 2023.

TPC ignores the time and money it takes to prepare data for analysis

TPC assumes the data is already perfectly loaded and optimized in a central data warehouse. They ignore the complex pipelines that need to be maintained. They even presume moving data to one location for analysis is the way to go. While moving all data for analytics into a central data warehouse was the predominant strategy for decades, about 10 years ago the gravity started shifting towards data lakes (later enhanced by data lakehouse functionality such as updates/deletes), and most recently scalable distributed data fabric / data mesh architectures are becoming more prevalent.

What happens in reality? If you have to move data to Teradata or Snowflake, before you run a test — and it takes you 3 months to complete the ETL project, then your query response time is actually 3 months+ the query response time.

Cost/Performance is more important than performance alone

Not only did your query take much longer than you thought, you also paid engineers to build pipelines, you were charged cloud egress fees, and you are now paying to store duplicate data in multiple places. Your price/performance? Not at all what the TPC benchmarks may indicate.

Vendors cheat

Vendor benchmarks reflect pristine lab conditions that do not in any way reflect the real world. True production workloads are concurrent and vary in size. Small queries run at the same time as large ones, consuming system resources and impacting performance.

Vendors often test with small datasets and make extensive use of caching, often without even disclosing it, in order to juice up their performance. This is equivalent to using steroids in professional sports. Simply put, it’s cheating. In the real world, customers have Terabytes or Petabytes of data and it’s impossible to cache it all. To be clear, caching is a wonderful feature in the right circumstances, but you simply can’t afford to cache the universe in real production environments. Furthermore, ad hoc queries and canned reports present different types of workloads and both types need to be run in order to fully understand a system’s performance.

Don’t take our word for it. Try it yourself.

We always advise our customers to run their own performance benchmarks to ensure results are vendor-neutral and reflect the reality of their production workloads. If you’d like to read more, one of my favorite Gartner analysts of all-time, Merv Adrian has an excellent report on this topic called Six Reasons to Ignore Vendor DBMS Benchmarks.

At Starburst, we are happy to help. Starburst is data source agnostic and can read from nearly any system and format. So we are ideally positioned to play a trusted advisor role as you evolve your architecture. Data lakes and open data formats like Iceberg and Hudi are great choices for most of your workloads, providing high performance at low cost. Other workloads may be better suited for a key-value store, RDBMS, or CDW. Regardless of your architecture, think about performance holistically, consider how the costs of your infrastructure scale over time, and most importantly, test it yourself!

If you’re a data rebel, watch now.

Free. Virtual. Global.

Watch On-Demand