Last Updated: 2024-03-22

Background

Apache Hive has long been a popular choice for storing and processing large amounts of data in Hadoop environments. However, as data engineering requirements have evolved, new technologies have emerged that offer improved performance, flexibility, and workload capabilities. One such technology is Iceberg, an open table format.

In this tutorial, we'll walk through the rationale behind migrating from Hive to Iceberg, the steps needed to complete a successful migration, and some of the key benefits of doing so. Two migration methods will be shown, and the advantages of each method will be discussed.

Prerequisites

Learning outcomes

Upon successful completion of this tutorial, you will be able to:

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Background

Before diving into the steps in this tutorial, let's review the basic definitions of Hive and Iceberg and compare their features and capabilities.

Apache Hive

Apache Hive is open-source data warehouse software designed to read, write, and manage large datasets extracted from the Apache Hadoop Distributed File System (HDFS), one aspect of a larger Hadoop Ecosystem. Hive was originally created as a way to query HDFS with SQL instead of MapReduce.

Today, Hive can be used to query any type of cloud object storage and is useful for running infrequent, batch SQL queries. Hive can handle large datasets with billions of rows, but query latencies can be high due to the overhead of file listing within the metadata process.

Apache Iceberg table format

Apache Iceberg is a high-performance open table format that provides fast and scalable analytics on large datasets. Iceberg's design allows for faster queries and more efficient data processing by providing features like efficient updates and deletes, snapshot isolation, and partitioning.

Hive vs. Iceberg comparison

Both Iceberg and Hive are efficient technologies for querying large datasets, but the choice depends on the requirements of your use case.

Let's compare their features:

Background

There are two potential methods we can use to migrate data from a Hive table to an Iceberg table:

Let's explore the definition and benefits of each method.

Shadow migration process

This approach creates a second Iceberg table off of the original Hive table. By leveraging a "shadow" process, we are afforded the following benefits:

The in-place method

The in-place data migration method avoids rewriting the data. Instead, you write new Apache Iceberg tables comprising the existing files in your S3 bucket. By leveraging an "in-place" process, we are afforded the following benefits:

Choosing a method

Ultimately, the method you choose will depend on many factors. The good news is that you can use Starburst Galaxy to migrate from Hive to Iceberg regardless of which method you choose.

Background

It's time to build and populate the Hive tables you'll be using in this tutorial. You'll create two tables – one for each migration method. The tables will be built in the catalog and schema that you created as a prerequisite to this tutorial.

Note: In practice, you would use existing Hive tables to migrate to Iceberg.

Step 1: Sign into Starburst Galaxy

Step 2: Verify that your role is set to accountadmin

Creating and modifying tables requires a role with advanced permissions. In this tutorial, we will utilize the accountadmin role.

Step 3: Select Query editor

In Starburst Galaxy, SQL is run from the Query editor.

Step 4: Create first table

The first table you create will use JSON files to store its data. You will use CTAS to create the table and pull data from the tpch catalog to populate it.

For example:

cluster: aws-us-east-1-free

catalog: tmp_cat

schema: tmp_erin_rosas_02152

CREATE TABLE cust_json
WITH (type='hive', format='json')
AS SELECT * FROM tpch.sf1.customer;
SHOW CREATE TABLE cust_json;

Step 5: Create second table

This table will be almost identical to the one you just created. The difference is that it will use ORC files to store its data.

CREATE TABLE cust_orc
WITH (type='hive', format='orc')
AS SELECT * FROM tpch.sf1.customer;
SHOW CREATE TABLE cust_orc;

Background

Let's explore the shadow migration process first. We will use the table called cust_json for testing.

To convert this table to Iceberg format, we will leverage a CTAS statement with a single parameter added in the WITH clause to identify that the new table will be created using the Iceberg table format.

Step 1: Execute CTAS statement

It's time to run the SQL to migrate your table from Hive to Iceberg via the shadow migration method.

CREATE TABLE cust_iceberg_shadow
WITH (type='iceberg')
AS SELECT * FROM cust_json;
SHOW CREATE TABLE cust_iceberg_shadow;

Background

Now it's time to try the in-place method for Hive to Iceberg migration. This method requires an ALTER TABLE statement with the SET PROPERTIES parameter identifying a change to the Iceberg table format, as detailed in our documentation.

Step 1: Run ALTER TABLE statement

You'll be working in the Starburst Galaxy query editor again for this step.

ALTER TABLE cust_orc
SET PROPERTIES type = 'ICEBERG';
SHOW CREATE TABLE cust_orc;

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've completed this tutorial, you should have a better understanding of how and when to use each of the Hive to Iceberg migration methods.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.