
You’ve heard the promise. Iceberg delivers warehouse-like performance on data lake infrastructure. It handles petabyte-scale Apache Iceberg tables. It’s faster than Hive.
But there’s a problem with this. Potential doesn’t equal actual. Getting peak performance from Iceberg isn’t automatic. It requires intentional data architectural design choices, a robust data maintenance strategy, and consistent upkeep. Without proper optimization strategies, even the most promising data platform can fall short of expectations.
In this article, we’ll look at why Iceberg has become central to lakehouse architecture and how to ensure you continue to get the best performance from your implementation. We’ll also explore automation options that reduce the heavy lifting for data engineering teams.
The benefits of Apache Iceberg
Let’s start at the beginning. Apache Iceberg is a data lakehouse open table format. It’s not a storage system itself but rather a layer on top of your existing data lake storage.
Iceberg adds several critical features that address some of the key shortcomings of its predecessors, namely Apache Hive.
Specifically:
- A metadata-driven approach with table metadata that yields better overall performance when compared to Hive. Iceberg metadata provides efficient query planning through its enhanced metadata collection.
- ACID transactions to guarantee consistent operations across partitions, enabling reliable real-time data ingestion and updates.
- Easy schema evolution, allowing you to alter and change your table structures without expensive migration as your business changes over time.
- Time travel and rollbacks, enabled by detailed snapshotting whenever an Iceberg table changes. This functionality allows you to query historical states in a dataset easily.
Iceberg + Trino
Now for the interesting part, connecting Iceberg and Trino. Iceberg on its own is impressive, but to really shine, it needs to be used in the appropriate data ecosystem. This is particularly true when combined with a distributed SQL query engine like Trino. Iceberg is designed to support tables with petabytes of data; meanwhile, Trino is designed to process them at scale.
Together, they create the Icehouse data architecture (which we’ve written on extensively in the past).
How Iceberg optimizes query performance
How does Iceberg achieve its half of the optimization equation? The answer lies in its value for data discovery, coupled with its ability to adapt and change easily, as a result of its metadata collection and ACID compliance. The result is not only versatile, but also performant.
How much faster is Iceberg? According to the Apache Iceberg project, when managed well, Iceberg can yield 10x performance improvements over Hive. We’ve written on Iceberg performance before, and we continue to see Iceberg outperform other data architectures, especially when coupled with Trino.
Key performance optimizations for Apache Iceberg
Again, though, this isn’t a given. Iceberg has many inherent advantages that make it capable of high performance, but to get this level of performance, you need to take certain steps in your data architecture to achieve this potential.
What steps do you need to take to ensure that Iceberg performs optimally?
There are many ways to fine-tune Iceberg. One starting point is to ensure that all tables are structured properly. This includes partitioning, and Iceberg handles partitioning very differently than other technologies like Hive. To get results, the right partitioning strategy, query patterns, and maintenance routines make all the difference.
Files can be written with organization strategies such as sorting and bucketing to improve performance when access patterns are well understood.
Why you need to maintain Iceberg properly
Additionally, you need to ensure that performance does not degrade over time due to a phenomenon known as the small files problem. This problem occurs due to the frequency and volume of data modifications and how those changes are implemented with versioning strategies, including Merge-on-Read (MOR). While data updates via MOR help maintain lineage for the table snapshots that make Iceberg so valuable, they do represent a maintenance challenge if left unchecked.
Without regular maintenance, you can encounter performance decreases as tables change. Understanding these trade-offs between flexibility and optimization is essential for data engineering teams.
We love Iceberg. And we’ve spent a lot of time leveraging its power for our customers. We’ve consistently identified the following optimization strategies as yielding the biggest performance bang for your buck:
- Partitioning
- Sorted tables
- File management and compaction
- Snapshot management
- Monitoring and measuring performance
Let’s look at each one in turn.
Partitioning
Good partitioning acts as a filtering system at the directory level. When you partition tables correctly, Iceberg enables partition pruning during query scanning, dramatically reducing the files that need to be read.
Not all tables are big enough to partition. Generally, a table’s overall file size footprint should be at least 1TB before utilizing partitions.
Choose efficient partition columns
Pick low-cardinality columns with fairly uniform distributions. Time-series data works well: partition by month or day using timestamp columns, depending on your data volume. Columns like region, product_category, or customer_segment can also be effective. Avoid high-cardinality columns, such as user IDs or transaction IDs, as they create too many partitions.
Avoid over-partitioning
At a minimum, your total file size per partition should be in the 1GB – 10GB range, but ideally be 10GB – 100GB. Individual file sizes themselves should be targeted at 100MB+ each to maintain optimal read performance. This keeps the number of files in the range of 100-1000 files.
Too many small partitions hurt more than they help. If your daily data is only 2GB, partition by month instead.
Leverage hidden partitioning
Unlike Hive, Iceberg doesn’t force users to think about partition columns in their queries. Hidden partitioning automatically routes queries to the right partitions based on filter predicates. This prevents user errors and simplifies query writing.
Consider partition evolution
As your use cases change, Iceberg allows you to modify your partitioning strategy without rewriting data, adapting to new query patterns as your workloads evolve.
Sorted tables and bucketing
Data written in sorted order by one or more columns enables aggressive file skipping during queries. Sorted tables are one of Iceberg’s most powerful performance features and a well-used strategy when looking for a range of values on the sorted column.
An alternate strategy is bucketing. Bucketing helps to eliminate data skew by clumping all distinct values of a high cardinality column (such as customer ID) into a given “bucket” file. Each bucket will contain many distinct values, but again, all rows of that column’s value will be in the same bucket.
Consider a real-world example from TPC-DS 1TB benchmark testing:
- Unsorted table: 1.4 billion rows read, 8.09GB data scanned
- Sorted table: 387 million rows read, 2.4GB data scanned
- Result: Nearly 50% reduction in data processed
Use sorted tables and bucketing on:
- Columns frequently used in WHERE clauses
- Join keys
- Filter predicates in common queries
- Date/timestamp columns for range queries
Maintain sorted tables through automation. When you stream or micro-batch data into Iceberg, use the OPTIMIZE command to preserve sort order. This combines small files while maintaining the sorted structure. It’s essential for keeping sorted tables effective as data grows.
The benefits compound. Less data scanned means lower storage costs from fewer Amazon S3 GET operations. It also means lower compute requirements. Your queries run faster and cheaper.
File management and compaction
Iceberg offers a number of high-level features, including time travel and cross-version analysis. Tracking these changes requires keeping old versions of files intact through manifest lists and metadata files.
The downside to this is that files deteriorate over time. Iceberg generates new metadata and data files whenever a dataset is modified. This leads to the proliferation of a number of small files that must be opened and merged on the fly to get the complete picture of a dataset. That leads to performance slowdowns over time.
This makes file management one of the central tenets of good Iceberg performance. There are several ways to address the problem.
Small files hurt performance. Each file has a fixed open/read cost. Thousands of small files mean thousands of I/O operations. This becomes a bottleneck even with good metadata.

In the example above, the query engine has to read 26 smaller “blue” files instead of 2 larger “green” files that represent the exact same data spread out differently.
How to detect the problem. Use the $files metadata table to identify files under 100MB:
SELECT COUNT(*) as small_files FROM "catalog"."schema"."table$files" WHERE file_size_in_bytes < 100000000;
Using compaction. You can solve this problem using compaction. To do this, use the optimize() command in Trino to consolidate small files into larger ones:
ALTER TABLE catalog.schema.table EXECUTE optimize(file_size_threshold => '100MB');
You should target 100MB file sizes as a best practice, though you can adjust the target file size based on your specific workloads. The optimize command is smart—it won’t compact files that don’t need it.
Best practices for compaction:
- Prioritize frequently-queried tables over recently-added data
- Use filters to compact specific partitions
- Record filters used during compaction to avoid overlap in future runs
- Consider running compaction on a separate cluster to avoid impacting query workloads
- With time-based partitioning, you can stop compacting data files once it is no longer being modified
Snapshot management
Over time, Iceberg accumulates old snapshots and associated metadata. While these enable time travel functionality, they also consume storage and can impact query planning performance. Managing snapshots is a critical but often overlooked aspect of Iceberg maintenance.
Expire snapshots regularly. Use the expire_snapshots procedure to remove old snapshots beyond your retention requirements:
ALTER TABLE catalog.schema.table EXECUTE expire_snapshots(retention_threshold => '7d');
Snapshot expiration removes metadata for old versions while preserving the current state and recent history. This cleans up metadata being saved, but more importantly the underlying data files that the expired snapshots were referencing. Without this, your data lakehouse would continue to hold more and more data files.
Remove orphaned files. After snapshot expiration, you may have data files no longer referenced by any snapshot. Use remove_orphan_files to reclaim this space:
ALTER TABLE catalog.schema.table EXECUTE remove_orphan_files(retention_threshold => '3d');
Rewrite manifests periodically. As tables evolve, manifest files can become fragmented. Use rewrite_manifests to consolidate them:
ALTER TABLE catalog.schema.table EXECUTE rewrite_manifests;
This reduces the number of manifest files the query engine must read, improving query planning efficiency.
Monitoring and measuring performance
You can’t improve what you don’t measure. Track these metrics consistently to understand your table health and query performance:
File statistics. Monitor file counts and sizes over time. Watch for the proliferation of small files using the $files metadata table.
Query performance metrics. Track query execution times, data scanned, and rows processed. Look for degradation patterns—operations steadily slowing down over time—that indicate maintenance is needed.
Table health indicators. Use Iceberg’s metadata tables to understand table state:
- $files – View data file sizes and counts
- $manifests – Track manifest file health and consolidation needs
- $snapshots – Monitor snapshot accumulation and plan expiration schedules
Regular monitoring helps you catch performance problems before they impact users and allows you to optimize proactively rather than reactively.
Not all data needs to be in Iceberg
This leaves one question unaddressed.
Should your data even be in Iceberg?
Years of focusing on data centralization as a panacea for data silos have led many of us to view it as the default position. In practice, mass centralization is a massive endeavor that often runs over time and budget.
Ironically, mass centralization often results in more data siloes. That’s because business teams, frustrated by the slow pace of migration, go off and invent their own solutions.
Centralization may not even be practical given today’s legal realities. For example, data sovereignty laws mean some data can never leave a given country’s borders.
We advocate a more incremental approach. Use query tools like Trino to access distributed data by default. From there, identify high-value datasets to move over. These are datasets that:
- Are critical to the business.
- Are frequently accessed by multiple users across different use cases.
- Will most benefit from Iceberg’s advanced features.
In other words, leave data federated if it doesn’t need the performance boost. Don’t let migration become a blocker to new projects.
The Icehouse: Delivering performance + flexibility
Starburst Icehouse architecture implements precisely this level of choice with data centralization.
Driven by the combination of Iceberg and Trino, the Icehouse serves as the central nervous system of your data platform. It connects your data applications and key data processing tools to your data, no matter where it lives.
In short, the Icehouse is all about choice. Your data, your way, when and how you need it.
You can build an Icehouse architecture yourself. Or you can use Starburst, which implements the Icehouse along with several performance-boosting features for Iceberg:
Automated Iceberg data maintenance. Starburst data maintenance scheduling enables you to configure regular maintenance tasks—including compaction, snapshot expiration, and orphan file removal—so your tables remain query-ready at all times. This automation eliminates the manual overhead that data engineering teams typically face.
Warp Speed. A proprietary caching and indexing layer that boosts query performance by up to 7x and reduces cloud computing costs by as much as 40% through intelligent read performance optimization.
Managed Iceberg Pipelines. When you are ready to migrate data to Iceberg, Starburst makes it easy with zero-ops managed pipelines. Managed Iceberg Pipelines produce production-ready workflows that result in 10x faster queries and a 66% reduction in data costs. Ingest data from Kafka Streaming or from AWS S3 using our new file ingest, which makes it easier than ever to keep your Iceberg tables hydrated with fresh data.
To learn more about how Starburst can give you both choice and performance for Apache Iceberg, contact us today.



