
The Icehouse Manifesto: Building an Open Lakehouse

Yusuf Cattaneo
Solutions Architect
Starburst
Emma Lullo
Senior Product Marketing Manager, Starburst Galaxy
Starburst
Yusuf Cattaneo
Solutions Architect
Starburst
Emma Lullo
Senior Product Marketing Manager, Starburst Galaxy
Starburst
Share
This post is part of the Iceberg blog series. Read the entire series:
Apache Hive has long been a popular choice for storing and processing large amounts of data in Hadoop environments. However, as data engineering requirements have evolved, new technologies have emerged that offer improved performance, flexibility, and workload capabilities.
In this blog post, we’ll walk through the differences between Hive and Iceberg, the use cases for both formats, and how to start planning your migration strategy.
Apache Hive is open-source data warehouse software project built on top of Apache Hadoop to provide data query and analysis capabilities via a SQL-like interface. Hive supports storage on AWS S3, ADLS, and GCS through the Hadoop Distributed File System (HDFS). With Hive, non-programmers familiar with SQL can read, write, and manage petabytes of big data.
There are four main components of Apache Hive:
For the purposes of comparison to Apache Iceberg, we will strictly be focusing on the Hive data model and the Hive metastore.
Data in Hive is organized into tables similar to a relational database and data about each table is stored in a directory in HDFS. The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions that operates independently of Apache Hive. The HMS has become a building block for data lakes providing critical data abstraction and data discovery capabilities.
The majority of the challenges associated with Apache Hive stem from the fact that data in tables is tracked at the folder level. This leads to several challenges including:
Apache Iceberg is an open table format that was designed with modern cloud infrastructure in mind. It was created at Netflix to overcome the limitations of Apache Hive and includes key features like efficient updates and deletes, snapshot isolation, and partitioning.
Check out Ryan Blue’s talk on creating Apache Iceberg table format at Netflix here.
As Tom Nats mentions in his “Introduction to Apache Iceberg in Trino” blog, Apache Iceberg is made up of three layers:
As you can see, Iceberg defines the data in the table at the file level, rather than a table pointing to a directory or a set of directories.
Apache Iceberg brings new capabilities to the data lake – including warehouse-like DML capabilities and data consistency. Specifically, Apache Iceberg offers the following advantages:
Migrating your Hive tables to Iceberg might seem like a quick fix for turning your data lake into a lakehouse, but it can create more problems than it solves when not done correctly. This webinar will compare and contrast the architectures of Apache Hive and Apache Iceberg, as well as walk through examples of when migrations would or would not be helpful.
Now that we have looked at the architecture for Hive and Iceberg, we understand that both are efficient technologies for querying large datasets. The choice depends on the requirements of your use case.
Let’s look at how the capabilities of the two compare:
Hive Tables |
Iceberg Tables |
|
Open source |
Yes |
Yes |
Read object storage using SQL |
Yes |
Yes |
File format |
Parquet, Orc, Avro |
Parquet, Orc, Avro |
Performant at scale | Yes | Yes |
ACID transactions |
No |
Yes |
Table versioning |
No |
Yes |
Time travel |
No |
Yes |
Schema evolution |
No |
Yes |
Partition evolution |
No |
Yes |
Partition pruning |
Yes |
Yes |
As you can see, Iceberg unlocks traditional data warehousing capabilities on cost-effective cloud storage.
I’m often asked by my customers and prospects “when should I consider migrating to Apache Iceberg?” While I do prefer working with Apache Iceberg, the decision to migrate is not one that should be taken lightly. It oftentimes requires several months of experimenting to structure your Iceberg tables correctly and optimize your queries.
That’s why I tend to take a use case based approach when considering the value of migration. If you are looking to run more use cases directly from your object storage (including one of the following), I would highly recommend exploring the migration – and Starburst can help.
Organizations building latency-sensitive data applications on top of cloud object storage
Iceberg enables collaborative data workflows by providing a shared and consistent data representation at all times
Organizations performing historical analysis or root cause analysis by using time travel
Organizations that require data in object storage to be easily modifiable for GDPR compliance
Iceberg can help leave situations like the below in the past (via its snapshot and time-travel capabilities):
If you’re exploring a potential migration, check out our migration tutorial or get free migration guidance from our experts.
Migrate Hive tables to Apache Iceberg with Starburst Galaxy