Improving performance with Iceberg sorted tables

April 24, 2023

Tom Nats
Director of Customer Solutions
Starburst

Tom Nats
Director of Customer Solutions
Starburst

More deployment options

Request Enterprise trial license key →

This post is part of the Iceberg blog series. Read the entire series:

What are sorted Iceberg tables?

Sorted Iceberg tables can provide a huge increase in performance in query times and can also lead to a decrease in cloud object storage request costs. The table is sorted by at least one column and the data is written to your object storage in files that are sorted by that column or columns.

How do they work?

Below are two tables that show a very simple example of how sorted tables work.

Not sorted:

File Name	Column Name	Min	Max
1	custkey	54	432029
2	custkey	2329	145292
3	custkey	101067	410918

sorted_by=ARRAY[’custkey’]:

File Name	Column Name	Min	Max
1	custkey	5	12321
2	custkey	23229	112019
3	custkey	113219	420001

Now, the following query was issued:

select * from customer_iceberg where custkey = 111029;

First the “not sorted” table, the min and max are checked in the Iceberg manifest files and since they are not sorted and the custkey we’re looking for is 111029, all 3 files need to be read as this value could be found in these files.

Now, for the sorted table, it’s pretty obvious that only file 2 would need to be read. This saves 2 files from being read but now imagine there are 1000s and even millions of files. Having a sorted table could save a very large amount of files from being read which of course would grealty reduce query times.

Creating sorted Iceberg tables

Creating a sorted Iceberg table is easy, you just add sorted_by and the column name(s):

CREATE TABLE
catalog_sales_sorted (
   cs_sold_date_sk bigint,
   more columns...
)
WITH
(
   format = 'PARQUET',
   sorted_by = ARRAY['cs_sold_date_sk'],
   type = 'ICEBERG'
)

Performance and cost benefits

Let’s look at a simple example of how this can greatly improve your performance of queries for your end users and with a bonus on top, save on your cloud storage costs.

In this very basic example, I took a tpc-ds table named catalog_sales and created a sorted version of it on cs_sold_date_sk. I did this as I noticed many tpc-ds queries using this table and column as part of their predicts. Example here.

Here is the simple query based on the non-sorted Iceberg table:

SELECT 
d.d_date,sum(cs_ext_sales_price) from 
s3lakehouse.iceberg_ds_sf1000.catalog_sales s,
s3lakehouse.iceberg_ds_sf1000.date_dim d
WHERE
s.cs_sold_date_sk = d.d_date_sk and
d_date 
BETWEEN
(cast ('2002-06-01' AS DATE) - INTERVAL '30' day) 
        AND   (cast ('2002-06-01' AS date) + INTERVAL '30' day)
GROUP BY
d.d_date

Here is the query based on the sorted Iceberg table:

SELECT 
d.d_date,sum(cs_ext_sales_price) from 
s3lakehouse.iceberg_ds_sf1000.catalog_sales_sorted s,
s3lakehouse.iceberg_ds_sf1000.date_dim d
WHERE
s.cs_sold_date_sk = d.d_date_sk and
d_date 
BETWEEN
(Cast ('2002-06-01' AS DATE) - INTERVAL '30' day) AND             ( 
                                cast ('2002-06-01' AS date) + INTERVAL '30' day)
GROUP BY
d.d_date

As you can see even from this simple example, the amount of data read is almost half! This can result in a huge amount of performance improvement and reduced costs in object store “gets”.

More complex queries also see a huge performance improvement. For my test, I took a 1TB tpc-ds table named catalog_sales and created a sorted version based on the cs_sold_date_sk column:

create table iceberg_tpcds_sf1000_parquet_sorted.catalog_sales
with (format='parquet',sorted_by=ARRAY['cs_sold_date_sk'])
as select * from iceberg_tpcds_sf1000_parquet.catalog_sales;

The query I used is a standard tpc-ds one:

use iceberg_tpcds_sf1000_parquet_sorted;
with v1 as(
 select i_category, i_brand, cc_name, d_year, d_moy,
       sum(cs_sales_price) sum_sales,
       avg(sum(cs_sales_price)) over
         (partition by i_category, i_brand,
                    cc_name, d_year)
         avg_monthly_sales,
       rank() over
         (partition by i_category, i_brand,
                    cc_name
          order by d_year, d_moy) rn
 from item, catalog_sales, date_dim, call_center
 where cs_item_sk = i_item_sk and
      cs_sold_date_sk = d_date_sk and
      cc_call_center_sk= cs_call_center_sk and
      (
        d_year = 1999 or
        ( d_year = 1999-1 and d_moy =12) or
        ( d_year = 1999+1 and d_moy =1)
      )
 group by i_category, i_brand,
         cc_name , d_year, d_moy),
v2 as(
 select v1.i_category ,v1.d_year, v1.d_moy ,v1.avg_monthly_sales
       ,v1.sum_sales, v1_lag.sum_sales psum, v1_lead.sum_sales nsum
 from v1, v1 v1_lag, v1 v1_lead
 where v1.i_category = v1_lag.i_category and
      v1.i_category = v1_lead.i_category and
      v1.i_brand = v1_lag.i_brand and
      v1.i_brand = v1_lead.i_brand and
      v1.cc_name = v1_lag.cc_name and
      v1.cc_name = v1_lead.cc_name and
      v1.rn = v1_lag.rn + 1 and
      v1.rn = v1_lead.rn - 1)
select  *
from v2
where  d_year = 1999 and
       avg_monthly_sales > 0 and
       case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1
order by sum_sales - avg_monthly_sales, 3
limit 100;

For the unsorted version, the number of rows read are 1.4 Billion and 8.09GB Bytes read:

For the sorted version, the number of rows is 387M rows and 2.4GB Bytes so there is a huge difference!

Sorted tables, if used correctly, are one of the biggest performance features I’ve seen in the big data space in a long time.

Materialized views

Materialized views (MV) also support sorted columns. The view is stored as an Iceberg table on object storage and it sorted by the column(s) in the MV ddl:

CREATE MATERIALIZED VIEW s3lakehouse.demo_tpch.iceberg_vw_sorted
with (storage_schema='mv_storage',sorted_by = ARRAY['custkey'])
AS select * from
"tpch"."sf1000"."customer";
REFRESH MATERIALIZED VIEW s3lakehouse.demo_tpch.iceberg_vw_sorted;

Here are the results of a very basic query:

select * from s3lakehouse.demo_tpch.iceberg_vw where custkey = 77746;

Non Sorted MV:

Sorted MV:

As you can see, from this very simple example, we read 100K rows vs. 370K. Now, imagine if this was a very large table with many partitions, the performance would be much higher and the amount of data read from your cloud object store would be greatly reduced. (saving $$$ as well)

Optimizing sorted Iceberg tables

In Iceberg, the optimize command looks for small files and combines them into larger ones improving performance for queries of all types.

Luckily, the Optimize command will sort the data based on the DDL of the table:

ALTER TABLE catalog_sales_sorted EXECUTE optimize

This command will optimize the catalog_sales_sorted table by combining smaller files into larger ones that are sorted by the cs_sold_date_sk column as noted above. This is very handy when you are streaming/micro-batching data into an Iceberg table and need to optimize it as given intervals and still want to benefit from the sorting.

Wrapping it up

There are many benefits of using Iceberg tables with Starburst Galaxy and adding the ability to sort tables by columns that are often used for filtering can provide a huge boost in performance and also save a bit on the pocketbook when it comes to cloud storage costs.

If you have any questions, please feel free to reach out to us. We have also launched Starburst Academy with many free courses including our Data Foundations, our self-paced, hands-on learning course which covers data lakes extensively.

Schedule a call with an expert

Book time

Improving performance with Iceberg sorted tables

More deployment options

What are sorted Iceberg tables?

How do they work?

Creating sorted Iceberg tables

Performance and cost benefits

Materialized views

Optimizing sorted Iceberg tables

Wrapping it up

Schedule a call with an expert

Apache Iceberg DML (update/delete/merge) & Maintenance in Trino

Apache Iceberg Schema Evolution in Trino

Iceberg Partitioning and Performance Optimizations in Trino

Offload your cloud data warehouse workloads

Cookie Notice

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Analytical/Performance Cookies

Functional/Preference Cookies

Targeting/Advertising Cookies

Starburst’s mission is to free our customers to see the invisible and achieve the impossible

Improving performance with Iceberg sorted tables

More deployment options

What are sorted Iceberg tables?

How do they work?

Creating sorted Iceberg tables

Performance and cost benefits

Materialized views

Optimizing sorted Iceberg tables

Wrapping it up

Schedule a call with an expert

Apache Iceberg DML (update/delete/merge) & Maintenance in Trino

Apache Iceberg Schema Evolution in Trino

Iceberg Partitioning and Performance Optimizations in Trino

Offload your cloud data warehouse workloads

Cookie Notice

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Analytical/Performance Cookies

Functional/Preference Cookies

Targeting/Advertising Cookies