This post is part of the Iceberg blog series. Read the entire series:
- Introduction to Apache Iceberg in Trino
- Iceberg Partitioning and Performance Optimizations in Trino
- Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
- Apache Iceberg Schema Evolution in Trino
- Apache Iceberg Time Travel & Rollbacks in Trino
- Automated maintenance for Apache Iceberg tables in Starburst Galaxy
- Improving performance with Iceberg sorted tables
- How to migrate your Hive tables to Apache Iceberg
Introduction
Apache Hive has long been a popular choice for storing and processing large amounts of data in Hadoop environments. However, as data engineering requirements have evolved, new technologies have emerged that offer improved performance, flexibility, and workload capabilities.
In this blog post, we’ll walk through the rationale behind migrating from Hive to Iceberg, the steps needed to complete a successful migration, and highlight some of the key benefits of doing so.
What is Apache Hive?
Apache Hive is open-source data warehouse software designed to read, write, and manage large datasets extracted from the Apache Hadoop Distributed File System (HDFS), one aspect of a larger Hadoop Ecosystem. Hive was originally created as a way to query HDFS with SQL instead of MapReduce.
Today, Hive can be used to query any type of cloud object storage and is useful for running infrequent, batch SQL queries. Hive can handle large datasets with billions of rows, but query latencies can be high due to the overhead of file listing within the metadata process.
What is Apache Iceberg?
Apache Iceberg is a high-performance open table format that provides fast and scalable analytics on large datasets. Iceberg’s design allows for faster queries and more efficient data processing by providing features like efficient updates and deletes, snapshot isolation, and partitioning.
Why is Iceberg better than Hive?
Both Iceberg and Hive are efficient technologies for querying large datasets, but the choice depends on the requirements of your use case.
Let’s look at how the capabilities of the two compare:
Hive Tables |
Iceberg Tables |
|
Open source |
Yes |
Yes |
Read object storage using SQL |
Yes |
Yes |
File format |
Parquet, Orc, Avro |
Parquet, Orc, Avro |
Performant at scale | Yes | Yes |
ACID transactions |
No |
Yes |
Table versioning |
No |
Yes |
Time travel |
No |
Yes |
Schema evolution |
No |
Yes |
Partition evolution |
No |
Yes |
Partition pruning |
Yes |
Yes |
As you can see, Iceberg unlocks traditional data warehousing capabilities on cost-effective cloud storage.
Migrating a Hive table to an Iceberg table
Now that we have a solid understanding of Iceberg’s benefits and how it differs from Hive, let’s walk through how we would upgrade a Hive table to the Iceberg format.
I wanted to take the simplest route possible and decided to leverage Starburst Galaxy – a fully managed data lake analytics platform built on top of the OS Trino engine.
Step 1: Create a S3 catalog in Starburst Galaxy
First, we will be leveraging the S3 catalog in Galaxy that enables you to connect and query data residing in S3. In Starburst Galaxy, data sources are defined as catalogs, which are then used as part of an object’s fully-qualified name in queries.
Follow these steps to begin creating a catalog for S3:
- In the navigation menu, select Catalogs.
- Click Create catalog.
- On the Select a data source pane, click the S3 icon.
You will need to define a catalog name, description, authentication method, metastore configuration, and a default table format. Iceberg is selected by default but for the purposes of this tutorial, select Hive.
For more detailed instructions on connecting to S3, visit the Galaxy documentation.
Step 2: Connect your S3 catalog to a cluster
Once your catalog has been configured, you will need to connect it to a cluster so that you can access the necessary compute resources.
Use the following steps to add your S3 catalog to an existing cluster or create a new cluster in the same cloud region as your S3 instance:
- In the Add to cluster section, expand the menu in the Select cluster field.
- Click Create a new cluster to create a new cluster in the same region, and add it to the cluster selection menu.
- Click Add to cluster to view your new catalog’s configuration.
Step 3: Run schema discovery
Now that the catalog has been assigned to a cluster, we can go ahead and run schema discovery to “crawl” the S3 URI and generate the necessary DDL statements to create a table.
For this example, we will use a Hive table called “ds_salaries_slim”.
This Hive table is currently partitioned by the “usage_month” with the following structure within S3:
Step 4: Choose your migration method
There are two potential methods we can use to migrate this data into an Iceberg table:
- Upgrade via the shadow migrate process
- Upgrade via the in-place method
We will explore the pros and cons of both approaches.
Upgrading via the shadow migrate process
This approach creates a second Iceberg table off of the original Hive table. By leveraging a “shadow” process, we are afforded the following benefits:
- The schema and partition modifications are communicated in advance, enabling better management of data files.
- By conducting regular audits, validation, and counts of the data during the migration process, the likelihood of copying over corrupt data decreases. This is because you can clean any imperfect data present in the old table and ensure it does not corrupt the new table.
To convert this table to Iceberg format, we will leverage a CTAS statement with a few different parameters added in the “WITH” clause:
- format: File format of our dataset
- type: This correlates to the underlying table store
- external_location: An explicit URI of where the data should be stored
- partitioning: Logical division of data
- sorted_by: Outlining the order of the table (This is especially useful when the sorted columns show high cardinality and are used as a filter for selective reads.)
CREATE TABLE demo_aws_s3.yusuf_cattaneo_demo.ds_salaries_slim_ice WITH (format = 'ORC', type = 'iceberg', location='s3://yusuf-cattaneo-bootcamp-nov2022/kaggle/blog_posts/ds_salaries_slim_ice/', partitioning=array['work_year'], sorted_by=array['salary_in_usd'] ) AS select * from demo_aws_s3.yusuf_cattaneo_demo.ds_salaries_slim
*Note the ds_salaries_slim table above does not involve any timestamp(3) column, if it did we would need to cast that column to timestamp(6)
Upgrading via the in-place method
The in-place data migration avoids rewriting the data. Instead, you write new Apache Iceberg tables that comprise the already existing files in your S3 instance. By leveraging an “in-place” process, we are afforded the following benefits:
- Data lineage is preserved, as the metadata is preserved.
- This process can be less time-consuming as all data does not need to be restated, in addition to avoiding data-duplication issues.
- If any type of error arises during the metadata writing process, you only need to re-write the metadata, not the underlying data.
To leverage an in-place migration method, we will simply utilize an ALTER TABLE statement with the set properties parameter, as seen below:
ALTER TABLE demo_aws_s3.yusuf_cattaneo_demo.ds_salaries_slimSET PROPERTIES type = 'ICEBERG';
Step 5: Explore the benefits of Iceberg
Now that we have created an Iceberg table, we can dig into the magic of the Iceberg architecture and how it leverages metadata. In the below figure, you will see how a single table is broken down into the following metadata components in the Iceberg world:
- Manifest File: Describes the contents of a set of data files in a dataset. Contains information on the files’ location, size, format, etc. Each time the table is updated with new data a corresponding new manifest file is created to describe the new data files. This allows Iceberg to track changes to a dataset over time, making time travel feasible.
- Manifest List: This is a file that contains a list of all the manifest files that describe the contents of a dataset.
- Metadata File: Contains information about a dataset such as a schema, partitioning, file format, etc., and is usually stored in either a JSON or Avro file.
Let’s see how our Iceberg table looks when we rerun the CTAS statement above:
By having our data laid out in this fashion we are afforded the following benefits:
- Increased query performance by avoiding full directory scanning
- ACID transaction support
- Time travel functionality is enabled to view data at an earlier point in time
- Table rollback is possible to revert the table to a previous state
- Seamless schema evolution
- On the fly partition evolution
Wrapping up
Overall, migrating to Iceberg from Hive offers several advantages, including faster query performance due to Iceberg’s ability to handle large datasets and query optimization due to partition pruning and predicate pushdown.
Furthermore, Iceberg supports ACID transactions to ensure data consistency/integrity and schema evolution to modify table schemas without breaking existing queries. This is all in addition to my favorite feature which is built-in versioning to track changes to data over time, thus increasing auditing and reproducibility capabilities.
In summary, Iceberg can help leave situations like the below in the past (via its snapshot and time-travel capabilities):
Do you have suggestions or feedback? Feel free to contact me at yusuf.cattaneo@starburstdata.com.