Testing the boundaries of partitioning for data lake analytics

How to avoid wasting 80% of compute resources on scanning and filtering

Last Updated: April 12, 2024

Query engines have quickly become the tool of choice for data-driven companies looking to adopt the agility and flexibility of the modern data lake architecture. As analytics use cases grow in demand across almost every business unit, data teams are desperate for a way to balance cost and performance. The single most common problem in data lake analytics ROI is handling the sheer volume of data that needs to be scanned and processed in the data lake to serve users’ queries.

Best practices tell us that a good partitioning strategy as well as techniques like z-ordering and clustering, aligned with the dimensions that are frequently used in the filters of the queries, can go a long way to help limit the volume of data scanned and dramatically accelerate queries and reduce costs. But since query engines still have to cull through all of the data from the partitions that were not filtered out, partitioning can only take you so far. 

In fact, we’ve discovered that on average 80% of customers using open source Trino or AWS Athena spend 80% of compute resources on ScanFilter operation, i.e. on reading and filtering the underlying data, regardless of their partitioning strategy.

This means filtering by only a few columns and partitions is not enough to reduce data reads and optimize query performance for certain workloads.

Why partitioning doesn’t work

Partitioning is the closest you can get to indexing in a cloud data lake. In order to reduce query time and cost, the best partitioning strategy will make queries filter out as many partitions as possible, enabling the query engine to scan as little data as possible. Common examples include partitioning by  time, and using up to 3 levels of partitioning (i.e. year, month, day) to facilitate queries that filter by time. In multi-tenant use cases, there is another level of partitioning by the tenant as well. 

However, despite the best partitioning efforts, queries still read massive amounts of data. Limitations of partitioning include:

  • Partitioning does not reduce the data reads for queries that filter on other columns different from the partition columns. 
  • Partitioning doesn’t help in cases where query patterns are dynamic and involve predicates on more than a few columns. 

Most people think a common solution to the above limitations should be adding more partitions – partitioning by multiple dimensions or by high cardinality columns. However, this may result in long-tailed data distribution across partitions and extremely small files per partition. Such small files on the data lake can significantly degrade query performance.

This is a huge challenge in cases where SLA requirements are strict (i.e. interactive performance), several workloads coexist, or the partitioning strategy needs to enable fast analytics for all use cases. 

Limitations of partitioning in practice

As data-driven companies use their data as one of their most strategic assets that drive their competitive edge, data consumers constantly look for deeper insights. This means running more sophisticated complex queries that filter the data by tens sometimes hundreds of columns to drive smart decisions.

Consider a scenario where customer-facing apps and internal analytics workloads coexist on the data lake. The data might be partitioned by date and by the tenant in order to serve customer-facing app workload in a timely manner, but it does not support cross-tenant event analysis workloads (to answer business questions such as “get all tenants who had event X in the last week”, for example). Queries in the event analysis workload won’t benefit from the partitioning strategy, which will result in reading all tenants’ partitions, leading to a long query response time.

If a company wants both internal analytics across multiple customers, and external analytics that present data to each customer separately, it can make sense to duplicate the table data and use time-based and event partitioning for internal analytics, and time and tenant partitioning for the customer-facing analytics — but that means that the data lake will no longer exist as a single source of truth, and two replicas of the same data need to be maintained. 

Another option would be to consolidate both use cases and partition by date, tenant, and event. In the case of 50 tenants and 40 event types, we will end up with 730,000 partitions for last year’s data alone (50X40X365 = 730,000). This enormous number of partitions will most likely result in very small files, data skew, long listing time by the query engine, and overall poor query performance.

In short, in today’s modern data analytics era, data challenges aren’t necessarily related to the number of rows — a complexity that can be solved through several approaches, including partitioning. Rather, data challenges arise from the multi-dimensionality of data — hundreds and often thousands of columns — making partitioning simply ineffective in reducing data read.

Eliminate extensive data reads with Smart Indexing

Indexing is the tried and true method of creating separate files that can be used to quickly identify where to retrieve records from the data set. Indexing is multi-dimensional by nature and is extremely effective for queries that require filtering the data in many columns. 

Starburst uses an indexing mechanism that is uniquely optimized for high-performance analytics called nanoblock indexing. Instead of storing one large index for each column that the user selects, Warp Speed (Starburst’s smart indexing and caching technology) dynamically creates millions of nanoblocks – a few dozen kilobyte-sized sub-sections of the indexed column. This removes the need to worry about data partitioning and layout.

Because query patterns keep changing (and we can’t keep updating the data layout), and because partitioning can’t really reduce the amount of data being read, Warp Speed can help serve multiple workloads without compromising on the performance provided with the existing partitioning strategy and data layout. You can keep your existing partitioning layout (unlike liquid clustering), and let Warp Speed indexing do all the heavy lifting.

How to get started with Warp Speed in Galaxy

It’s easy to get started with Warp Speed. Simply create a free Starburst Galaxy account and select “accelerated” as the execution mode in the cluster creation dialog. 

Once you start running queries, Warp Speed will begin creating index and caching elements for you automatically.

Try Starburst Galaxy today

The analytics platform for your data lake

Start Free

Start for Free with Starburst Galaxy

Up to $500 in usage credits included

Please fill in all required fields and ensure you are using a valid email address.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.