Best Practices for Optimizing Apache Iceberg Performance

December 4, 2025

Lester Martin

Developer Adocate

Starburst

Evan Smith

Technical Content Manager

Starburst Data

Lester Martin

Developer Adocate

Starburst

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Starburst Galaxy Provides Superior Data Ingestion

You’ve heard the promise. Iceberg delivers warehouse-like performance on data lake infrastructure. It handles petabyte-scale Apache Iceberg tables. It’s faster than Hive.

But there’s a problem with this. Potential doesn’t equal actual. Getting peak performance from Iceberg isn’t automatic. It requires intentional data architectural design choices, a robust data maintenance strategy, and consistent upkeep. Without proper optimization strategies, even the most promising data platform can fall short of expectations.

In this article, we’ll look at why Iceberg has become central to lakehouse architecture and how to ensure you continue to get the best performance from your implementation. We’ll also explore automation options that reduce the heavy lifting for data engineering teams.

The benefits of Apache Iceberg

Let’s start at the beginning. Apache Iceberg is a data lakehouse open table format. It’s not a storage system itself but rather a layer on top of your existing data lake storage.

Iceberg adds several critical features that address some of the key shortcomings of its predecessors, namely Apache Hive.

Specifically:

A metadata-driven approach with table metadata that yields better overall performance when compared to Hive. Iceberg metadata provides efficient query planning through its enhanced metadata collection.
ACID transactions to guarantee consistent operations across partitions, enabling reliable real-time data ingestion and updates.
Easy schema evolution, allowing you to alter and change your table structures without expensive migration as your business changes over time.
Time travel and rollbacks, enabled by detailed snapshotting whenever an Iceberg table changes. This functionality allows you to query historical states in a dataset easily.

Iceberg + Trino

Now for the interesting part, connecting Iceberg and Trino. Iceberg on its own is impressive, but to really shine, it needs to be used in the appropriate data ecosystem. This is particularly true when combined with a distributed SQL query engine like Trino. Iceberg is designed to support tables with petabytes of data; meanwhile, Trino is designed to process them at scale.

Together, they create the Icehouse data architecture (which we’ve written on extensively in the past).

How Iceberg optimizes query performance

How does Iceberg achieve its half of the optimization equation? The answer lies in its value for data discovery, coupled with its ability to adapt and change easily, as a result of its metadata collection and ACID compliance. The result is not only versatile, but also performant.

How much faster is Iceberg? According to the Apache Iceberg project, when managed well, Iceberg can yield 10x performance improvements over Hive. We’ve written on Iceberg performance before, and we continue to see Iceberg outperform other data architectures, especially when coupled with Trino.

Key performance optimizations for Apache Iceberg

Again, though, this isn’t a given. Iceberg has many inherent advantages that make it capable of high performance, but to get this level of performance, you need to take certain steps in your data architecture to achieve this potential.

What steps do you need to take to ensure that Iceberg performs optimally?

There are many ways to fine-tune Iceberg. One starting point is to ensure that all tables are structured properly. This includes partitioning, and Iceberg handles partitioning very differently than other technologies like Hive. To get results, the right partitioning strategy, query patterns, and maintenance routines make all the difference.

Files can be written with organization strategies such as sorting and bucketing to improve performance when access patterns are well understood.

Why you need to maintain Iceberg properly

Additionally, you need to ensure that performance does not degrade over time due to a phenomenon known as the small files problem. This problem occurs due to the frequency and volume of data modifications and how those changes are implemented with versioning strategies, including Merge-on-Read (MOR). While data updates via MOR help maintain lineage for the table snapshots that make Iceberg so valuable, they do represent a maintenance challenge if left unchecked.

Without regular maintenance, you can encounter performance decreases as tables change. Understanding these trade-offs between flexibility and optimization is essential for data engineering teams.

We love Iceberg. And we’ve spent a lot of time leveraging its power for our customers. We’ve consistently identified the following optimization strategies as yielding the biggest performance bang for your buck:

Partitioning
Sorted tables
File management and compaction
Snapshot management
Monitoring and measuring performance

Let’s look at each one in turn.

Partitioning

Good partitioning acts as a filtering system at the directory level. When you partition tables correctly, Iceberg enables partition pruning during query scanning, dramatically reducing the files that need to be read.

Not all tables are big enough to partition. Generally, a table’s overall file size footprint should be at least 1TB before utilizing partitions.

Choose efficient partition columns

Pick low-cardinality columns with fairly uniform distributions. Time-series data works well: partition by month or day using timestamp columns, depending on your data volume. Columns like region, product_category, or customer_segment can also be effective. Avoid high-cardinality columns, such as user IDs or transaction IDs, as they create too many partitions.

Avoid over-partitioning

At a minimum, your total file size per partition should be in the 1GB – 10GB range, but ideally be 10GB – 100GB. Individual file sizes themselves should be targeted at 100MB+ each to maintain optimal read performance. This keeps the number of files in the range of 100-1000 files.

Too many small partitions hurt more than they help. If your daily data is only 2GB, partition by month instead.

Leverage hidden partitioning

Unlike Hive, Iceberg doesn’t force users to think about partition columns in their queries. Hidden partitioning automatically routes queries to the right partitions based on filter predicates. This prevents user errors and simplifies query writing.

Consider partition evolution

As your use cases change, Iceberg allows you to modify your partitioning strategy without rewriting data, adapting to new query patterns as your workloads evolve.

Sorted tables and bucketing

Data written in sorted order by one or more columns enables aggressive file skipping during queries. Sorted tables are one of Iceberg’s most powerful performance features and a well-used strategy when looking for a range of values on the sorted column.

An alternate strategy is bucketing. Bucketing helps to eliminate data skew by clumping all distinct values of a high cardinality column (such as customer ID) into a given “bucket” file. Each bucket will contain many distinct values, but again, all rows of that column’s value will be in the same bucket.

Consider a real-world example from TPC-DS 1TB benchmark testing:

Unsorted table: 1.4 billion rows read, 8.09GB data scanned
Sorted table: 387 million rows read, 2.4GB data scanned
Result: Nearly 50% reduction in data processed

Use sorted tables and bucketing on:

Columns frequently used in WHERE clauses
Join keys
Filter predicates in common queries
Date/timestamp columns for range queries

Maintain sorted tables through automation. When you stream or micro-batch data into Iceberg, use the OPTIMIZE command to preserve sort order. This combines small files while maintaining the sorted structure. It’s essential for keeping sorted tables effective as data grows.

The benefits compound. Less data scanned means lower storage costs from fewer Amazon S3 GET operations. It also means lower compute requirements. Your queries run faster and cheaper.

File management and compaction

Iceberg offers a number of high-level features, including time travel and cross-version analysis. Tracking these changes requires keeping old versions of files intact through manifest lists and metadata files.

The downside to this is that files deteriorate over time. Iceberg generates new metadata and data files whenever a dataset is modified. This leads to the proliferation of a number of small files that must be opened and merged on the fly to get the complete picture of a dataset. That leads to performance slowdowns over time.

This makes file management one of the central tenets of good Iceberg performance. There are several ways to address the problem.

Small files hurt performance. Each file has a fixed open/read cost. Thousands of small files mean thousands of I/O operations. This becomes a bottleneck even with good metadata.

Image depicting the small files problem, where having many small files impacts performance compared to having the same amount of data in a smaller number of larger files. The image shows the uncompacted data alongside the compacted data for comparison.

In the example above, the query engine has to read 26 smaller “blue” files instead of 2 larger “green” files that represent the exact same data spread out differently.

How to detect the problem. Use the $files metadata table to identify files under 100MB:

SELECT COUNT(*) as small_files 
FROM "catalog"."schema"."table$files"
WHERE file_size_in_bytes < 100000000;

Using compaction. You can solve this problem using compaction. To do this, use the optimize() command in Trino to consolidate small files into larger ones:

ALTER TABLE catalog.schema.table 
EXECUTE optimize(file_size_threshold => '100MB');

You should target 100MB file sizes as a best practice, though you can adjust the target file size based on your specific workloads. The optimize command is smart—it won’t compact files that don’t need it.

Best practices for compaction:

Prioritize frequently-queried tables over recently-added data
Use filters to compact specific partitions
Record filters used during compaction to avoid overlap in future runs
Consider running compaction on a separate cluster to avoid impacting query workloads
With time-based partitioning, you can stop compacting data files once it is no longer being modified

Snapshot management

Over time, Iceberg accumulates old snapshots and associated metadata. While these enable time travel functionality, they also consume storage and can impact query planning performance. Managing snapshots is a critical but often overlooked aspect of Iceberg maintenance.

Expire snapshots regularly. Use the expire_snapshots procedure to remove old snapshots beyond your retention requirements:

ALTER TABLE catalog.schema.table 
EXECUTE expire_snapshots(retention_threshold => '7d');

Snapshot expiration removes metadata for old versions while preserving the current state and recent history. This cleans up metadata being saved, but more importantly the underlying data files that the expired snapshots were referencing. Without this, your data lakehouse would continue to hold more and more data files.

Remove orphaned files. After snapshot expiration, you may have data files no longer referenced by any snapshot. Use remove_orphan_files to reclaim this space:

ALTER TABLE catalog.schema.table 
EXECUTE remove_orphan_files(retention_threshold => '3d');

Rewrite manifests periodically. As tables evolve, manifest files can become fragmented. Use rewrite_manifests to consolidate them:

ALTER TABLE catalog.schema.table 
EXECUTE rewrite_manifests;

This reduces the number of manifest files the query engine must read, improving query planning efficiency.

Monitoring and measuring performance

You can’t improve what you don’t measure. Track these metrics consistently to understand your table health and query performance:

File statistics. Monitor file counts and sizes over time. Watch for the proliferation of small files using the $files metadata table.

Query performance metrics. Track query execution times, data scanned, and rows processed. Look for degradation patterns—operations steadily slowing down over time—that indicate maintenance is needed.

Table health indicators. Use Iceberg’s metadata tables to understand table state:

$files – View data file sizes and counts
$manifests – Track manifest file health and consolidation needs
$snapshots – Monitor snapshot accumulation and plan expiration schedules

Regular monitoring helps you catch performance problems before they impact users and allows you to optimize proactively rather than reactively.

Not all data needs to be in Iceberg

This leaves one question unaddressed.

Should your data even be in Iceberg?

Years of focusing on data centralization as a panacea for data silos have led many of us to view it as the default position. In practice, mass centralization is a massive endeavor that often runs over time and budget.

Ironically, mass centralization often results in more data siloes. That’s because business teams, frustrated by the slow pace of migration, go off and invent their own solutions.

Centralization may not even be practical given today’s legal realities. For example, data sovereignty laws mean some data can never leave a given country’s borders.

We advocate a more incremental approach. Use query tools like Trino to access distributed data by default. From there, identify high-value datasets to move over. These are datasets that:

Are critical to the business.
Are frequently accessed by multiple users across different use cases.
Will most benefit from Iceberg’s advanced features.

In other words, leave data federated if it doesn’t need the performance boost. Don’t let migration become a blocker to new projects.

The Icehouse: Delivering performance + flexibility

Starburst Icehouse architecture implements precisely this level of choice with data centralization.

Driven by the combination of Iceberg and Trino, the Icehouse serves as the central nervous system of your data platform. It connects your data applications and key data processing tools to your data, no matter where it lives.

In short, the Icehouse is all about choice. Your data, your way, when and how you need it.

You can build an Icehouse architecture yourself. Or you can use Starburst, which implements the Icehouse along with several performance-boosting features for Iceberg:

Automated Iceberg data maintenance. Starburst data maintenance scheduling enables you to configure regular maintenance tasks—including compaction, snapshot expiration, and orphan file removal—so your tables remain query-ready at all times. This automation eliminates the manual overhead that data engineering teams typically face.

Warp Speed. A proprietary caching and indexing layer that boosts query performance by up to 7x and reduces cloud computing costs by as much as 40% through intelligent read performance optimization.

Managed Iceberg Pipelines. When you are ready to migrate data to Iceberg, Starburst makes it easy with zero-ops managed pipelines. Managed Iceberg Pipelines produce production-ready workflows that result in 10x faster queries and a 66% reduction in data costs. Ingest data from Kafka Streaming or from AWS S3 using our new file ingest, which makes it easier than ever to keep your Iceberg tables hydrated with fresh data.

To learn more about how Starburst can give you both choice and performance for Apache Iceberg, contact us today.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Starburst’s mission is to free our customers to see the invisible and achieve the impossible

Best Practices for Optimizing Apache Iceberg Performance

More deployment options

Start for Free with Starburst Galaxy

Starburst Galaxy Provides Superior Data Ingestion

The benefits of Apache Iceberg

Iceberg + Trino

How Iceberg optimizes query performance

Key performance optimizations for Apache Iceberg

What steps do you need to take to ensure that Iceberg performs optimally?

Why you need to maintain Iceberg properly

Partitioning

Choose efficient partition columns

Avoid over-partitioning

Leverage hidden partitioning

Consider partition evolution

Sorted tables and bucketing

File management and compaction

Snapshot management

Monitoring and measuring performance

Not all data needs to be in Iceberg

The Icehouse: Delivering performance + flexibility

Start for Free with Starburst Galaxy