
Everyone is talking about Apache Iceberg these days.
Why? One reason is velocity. While most of the world’s data is not currently on Apache Iceberg, if adoption continues in this direction, it eventually will be. In this sense, understanding Iceberg becomes about understanding the future direction of data technology.
Another reason is performance and feature enhancements. As an open-source table format, Iceberg has taken the world by storm due to its high performance, schema evolution capabilities, and full CRUD support for SQL-like flexibility when using object storage. That makes it suited to almost any workflow, including analytics and AI. Additionally, Apache Iceberg supports both large-scale batch processing and real-time data ingestion (when using the right data platform).
If you haven’t jumped on the Iceberg bandwagon, there’s never been a better time.
Feeling hesitant? We understand. For many data architects and data engineers, the words “data migration” carry a significant sense of warning.
Luckily, moving to Iceberg isn’t an all-or-nothing proposition. With the right foundations, you can gradually move over your most critical workloads, while leaving the rest of your data where it is.
Interested? Let’s get started.
The roadblocks to adopting Apache Iceberg
Apache Iceberg’s benefits are well-documented. By using metadata differently, Apache Iceberg is able to deploy many additional features compared with the more traditional Hive architecture. In fact, metadata allows Iceberg to support numberous critical features, including time travel, rollback, schema evolution, snapshots, and more.
Iceberg tables provide functionality that traditional table formats simply couldn’t match, making them ideal for modern data platforms with growing, changeable needs.
Given the list of features, it’s no wonder so many teams want to migrate to Iceberg. So why isn’t all of the world’s data on Iceberg? The reasons might include some of the following.
All or nothing thinking
This is the bane of every data migration effort. This attitude comes from a fraught legacy of indiscriminate centralization that has characterized so many organizational data projects. Instead of focusing on a single critical dataset and running a pilot, teams try to boil the ocean and migrate everything to Iceberg in one shot.
This type of thinking also pops up with datasets that are technically (or even legally) hard to move. You may, for example, have data in Delta Lake format that you’re convinced needs to stay there. Some teams point to these datasets as reasons they can’t move anything to Apache Iceberg.
Maintenance concerns
Iceberg performance does degrade over time if you’re not watchful. Like any other data toolchain, Iceberg tables must be monitored and maintained to remain optimized. Regular compaction of data files, optimizing partitioning strategies, and metadata management are essential. Teams that don’t realize this might see disappointing results from Apache Iceberg over the first few months of usage, discouraging further adoption.
Governance concerns
Apache Iceberg contains features that make it easier to run secure, compliant workloads. But it needs to be properly integrated into your data stack to realize them.
Also, as a relatively new technology, Apache Iceberg use likely requires formal organizational approval. This can slow down efforts at adoption unless a case can be made for it internally.
Training requirements
Supporting a new technology requires more than just making it available. Data engineering teams need education on how to use and support it, and when to employ it for specific use cases. Understanding when to use INSERT into versus other operations, how partitioning works, and how to leverage Iceberg features like time travel requires hands-on experience and training to get right.
Organizational constraints
Finally, you may run into organizational inertia. For example, Sales might like their data just the way it is. A department’s data engineers may believe their data is perfectly tuned. Until every key stakeholder understands the value of Iceberg for them, it can be hard to make a case for change.
Getting started with Iceberg: A step-by-step approach to adoption
Organizational adoption is something that requires time. At Starburst, we work with our customers to help engage key stakeholders and overcome any obstacle. These problems are often very nuanced and particular to the organizations involved. That’s a certain kind of problem to solve, and we pride ourselves on partnering with our customers to help unblock them.
Technical problems are, in a way, simpler to solve than organizational ones. When it comes to technical obstacles, we’ve got some good news. Modern distributed query engines, such as Trino, mean there’s no need to pursue a “centralize everything” strategy with Apache Iceberg. The Trino query engine, along with other engines, such as Apache Spark, supports Apache Iceberg natively through connectors and APIs.
Collectively, these technologies allow a more flexible approach by using federation with selective centralization. Using this method, users begin by migrating critical workloads one at a time onto Iceberg. You then evaluate and improve upon the first migration before moving to the next.
Here’s a practical roadmap for getting started with your first migration and laying the foundation for subsequent efforts.
Evaluate your current architecture
First, assess how you’re storing data. Where is your data currently stored? And more importantly, why is it stored that way? Apache Iceberg is designed to work best with cloud storage on platforms like AWS, Azure, GCP, and other data platforms (but also supports on-premises Hadoop HDFS when needed).
You’ll also want to migrate to a modern, distributed query engine to reap the benefits of data federation. Using an engine like Trino will set you up for success, as Apache Iceberg was designed with compatibility for Trino in mind. That’ll give you support out of the gate for all of Iceberg’s key features, including full Data Manipulation Language (DML) support for operations like DELETE, INSERT INTO, ALTER TABLE, and upserts, as well as native support for schema evolution. You’ll also reap the best performance for your federated queries across multiple data sources.
If you’re working with data pipelines that involve Apache Spark, PySpark, or Python-based workflows, Apache Iceberg provides excellent compatibility with these ecosystems as well.
Identify any roadblocks
There aren’t many technical roadblocks to adapting Apache Iceberg that can’t be overcome with strong planning. You may, however, run into some of the organizational stumbles we alluded to earlier.
To help fix this, make sure the runway is clear for Iceberg before you start taxiing the plane. At a minimum, get buy-in from data stakeholders. Ensure they understand what you’re changing and how it’ll benefit them (better performance, compliance, etc.). If you work in a large enterprise, obtain key stakeholder sign-off for Apache Iceberg usage prior to launch.
Identify workloads that are good migration candidates
This is where you’ll want to be selective and discerning. The best workloads to start will usually be your most critical workloads. This generally means:
- Workloads with large data volumes (a single Iceberg table can contain tens of petabytes of data)
- Data that’s frequently used and accessed
- Real-world use cases requiring scalability and high performance
You should also migrate data that would benefit from Iceberg’s more advanced features, such as:
- ACID transaction support and optimistic concurrency
- Time travel queries for accessing historical data snapshots
- Schema evolution (i.e., data that changes frequently because business realities change frequently)
- Partition evolution for adapting to changing data patterns
- Efficient handling of delete operations and data updates
Consider implementations involving streaming data from Kafka or real-time data ingestion, as Apache Iceberg excels in these use cases.
Define a maintenance strategy
As noted earlier, Iceberg tables require regular upkeep to maintain top performance. Common operations include:
- File compaction: Merging small data files into larger ones
- Snapshot expiration: Removing old table versions to control storage
- Orphan file cleanup: Deleting uncommitted data files
- Statistics refresh: Keeping the query optimizer informed
- Metadata optimization: To improve query planning performance
Luckily, most of these tasks can be automated. We recommend using an Iceberg host that supports Managed Iceberg, which takes the responsibility of implementing these chores off your shoulders. With Managed Iceberg, you can automate key operations with a few clicks as well as configure retention policies for your data.
Start with a migration pilot
Identify your pilot dataset to get started. This will ideally be a table with most of the following characteristics:
- Medium to large table size
- Frequently queried (high impact if successful)
- A frequently changing schema that’s costly to alter table structure repeatedly on Hive
- Experiencing current pain points (slow queries, difficult updates)
- No complex external dependencies
- Not mission-critical (but important enough to matter)
- Stored in common file formats like Parquet, Avro, or JSON
Don’t overdo Iceberg feature adoption immediately out of the gate. Instead, start with basic features, such as using hidden partitioning for timestamp-based data and using sorted tables for frequently filtered columns. When you create table definitions, focus on establishing proper partitioning strategies from the start. In other words, keep it simple at first, and then iterate.
Validate the migration and create your migration roadmap
Validate that all elements of the migration, including data accuracy (row counts, checksums, sample comparisons), query performance, and all SQL operations, including write data operations. This is also the time to ensure your compliance and security measures are airtight.
Test that your Iceberg tables work correctly with your existing tools and ecosystem. Verify that metadata is properly tracked, that partitioning is optimized, and that compaction processes are functioning as expected.
From there, create a roadmap for migrating other tables and workloads. At this point, you can start exploring support for advanced Iceberg features, such as time travel and schema evolution. You can also consider using data products to simplify packaging, deploying, and maintaining governed datasets implemented in Apache Iceberg.
Next, understand how Apache Iceberg can integrate with your existing data warehouse, lakehouse, or data lake architecture. Many organizations find that Apache Iceberg serves as the open table format that bridges traditional data warehouse capabilities with modern data lake flexibility.
How to make getting started with Apache Iceberg easier
Once you move beyond an all-or-nothing mindset, migrating to Apache Iceberg is easy. And once your first dataset migration is complete, subsequent migrations will become even easier as your team learns from each effort.
That said, data migration always takes time and effort. The less heavy lifting you have to do yourself, the faster you’ll realize value.
We built Starburst Galaxy to minimize the pain points of converting to Iceberg. Our fully-managed lakehouse platform sports multiple features that make Apache Iceberg easy:
- Use Managed Iceberg Pipelines for a zero ops approach to importing data with no infrastructure complexity.
- Leverage Managed Iceberg to simplify table maintenance, automate compaction, and reduce operational overhead.
- Integrate seamlessly with leading platforms, including Snowflake, Databricks, and over 50 other datasources.
Whether you’re migrating from ETL workflows, implementing large-scale data management, or optimizing existing data engineering processes, Starburst provides the tools and compatibility you need.
Contact us today to learn how to get started with Apache Iceberg using a proven, real-world approach.
Want to know more about Iceberg for data engineers? We wrote this eBook to help you.



