Migrate Hive tables to Apache Iceberg with Starburst Galaxy

28 mins remaining

1. Tutorial overview

Last Updated: 2024-03-22

Background

Apache Hive has long been a popular choice for storing and processing large amounts of data in Hadoop environments. However, as data engineering requirements have evolved, new technologies have emerged that offer improved performance, flexibility, and workload capabilities. One such technology is Iceberg, an open table format.

In this tutorial, we'll walk through the rationale behind migrating from Hive to Iceberg, the steps needed to complete a successful migration, and some of the key benefits of doing so. Two migration methods will be shown, and the advantages of each method will be discussed.

Prerequisites

Learning outcomes

Upon successful completion of this tutorial, you will be able to:

  • Outline the benefits of migrating from Hive to Iceberg.
  • Use the shadow migration process to migrate a table from Hive to Iceberg.
  • Use the in-place method to migrate a table from Hive to Iceberg.

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

2. Hive vs. Iceberg: What is the difference?

Background

Before diving into the steps in this tutorial, let's review the basic definitions of Hive and Iceberg and compare their features and capabilities.

Apache Hive

Apache Hive is open-source data warehouse software designed to read, write, and manage large datasets extracted from the Apache Hadoop Distributed File System (HDFS), one aspect of a larger Hadoop Ecosystem. Hive was originally created as a way to query HDFS with SQL instead of MapReduce.

Today, Hive can be used to query any type of cloud object storage and is useful for running infrequent, batch SQL queries. Hive can handle large datasets with billions of rows, but query latencies can be high due to the overhead of file listing within the metadata process.

Apache Iceberg table format

Apache Iceberg is a high-performance open table format that provides fast and scalable analytics on large datasets. Iceberg's design allows for faster queries and more efficient data processing by providing features like efficient updates and deletes, snapshot isolation, and partitioning.

Hive vs. Iceberg comparison

Both Iceberg and Hive are efficient technologies for querying large datasets, but the choice depends on the requirements of your use case.

Let's compare their features:

3. Compare migration methods

Background

There are two potential methods we can use to migrate data from a Hive table to an Iceberg table:

  • The shadow migration process
  • The in-place method

Let's explore the definition and benefits of each method.

Shadow migration process

This approach creates a second Iceberg table off of the original Hive table. By leveraging a "shadow" process, we are afforded the following benefits:

  • The schema and partition modifications are communicated in advance, enabling better management of data files.
  • Regular audits, validation, and counts of the data are conducted during the migration process, reducing the likelihood of copying over corrupt data. During the process, you can clean any imperfect data present in the old table to ensure it does not corrupt the new table.

The in-place method

The in-place data migration method avoids rewriting the data. Instead, you write new Apache Iceberg tables comprising the existing files in your S3 bucket. By leveraging an "in-place" process, we are afforded the following benefits:

  • Data lineage is preserved, as the metadata is preserved.
  • This process can be less time-consuming as all data does not need to be restated.
  • Data duplication issues are avoided.
  • If any type of error arises during the metadata writing process, you only need to re-write the metadata, not the underlying data.

Choosing a method

Ultimately, the method you choose will depend on many factors. The good news is that you can use Starburst Galaxy to migrate from Hive to Iceberg regardless of which method you choose.

4. Build Hive tables

Background

It's time to build and populate the Hive tables you'll be using in this tutorial. You'll create two tables – one for each migration method. The tables will be built in the catalog and schema that you created as a prerequisite to this tutorial.

Note: In practice, you would use existing Hive tables to migrate to Iceberg.

Step 1: Sign into Starburst Galaxy

  • Sign into Starburst Galaxy in the usual way.

Step 2: Verify that your role is set to accountadmin

Creating and modifying tables requires a role with advanced permissions. In this tutorial, we will utilize the accountadmin role.

  • Your current role appears below your email address in the top, right-hand corner of the browser.
  • If your role is not set to accountadmin, click your username and choose accountadmin from the list under Your roles & privileges.

Step 3: Select Query editor

In Starburst Galaxy, SQL is run from the Query editor.

  • Using the left-hand navigation menu, select Query>>Query editor.

Step 4: Create first table

The first table you create will use JSON files to store its data. You will use CTAS to create the table and pull data from the tpch catalog to populate it.

  • At the top of the query editor window, set your cluster, catalog, and schema to the ones you configured in the prerequisite tutorial.

For example:

cluster: aws-us-east-1-free

catalog: tmp_cat

schema: tmp_erin_rosas_02152

  • Copy and paste the following SQL into the query editor window.
  • Click the Run (limit 1000) button.
CREATE TABLE cust_json
WITH (type='hive', format='json')
AS SELECT * FROM tpch.sf1.customer;
  • Run the following SQL to confirm the table was created successfully:
SHOW CREATE TABLE cust_json;

Step 5: Create second table

This table will be almost identical to the one you just created. The difference is that it will use ORC files to store its data.

  • Copy and paste the following SQL into the query editor window.
  • Click the Run (limit 1000) button.
CREATE TABLE cust_orc
WITH (type='hive', format='orc')
AS SELECT * FROM tpch.sf1.customer;
  • Run the following SQL to confirm the table was created successfully:
SHOW CREATE TABLE cust_orc;

5. Use the shadow migration process to migrate from Hive to Iceberg

Background

Let's explore the shadow migration process first. We will use the table called cust_json for testing.

To convert this table to Iceberg format, we will leverage a CTAS statement with a single parameter added in the WITH clause to identify that the new table will be created using the Iceberg table format.

Step 1: Execute CTAS statement

It's time to run the SQL to migrate your table from Hive to Iceberg via the shadow migration method.

  • Copy and paste the SQL statement below into the query editor window.
  • Run the SQL by clicking the Run (limit 1000) button.
CREATE TABLE cust_iceberg_shadow
WITH (type='iceberg')
AS SELECT * FROM cust_json;
  • Run the following SQL to confirm the table was created successfully:
SHOW CREATE TABLE cust_iceberg_shadow;

6. Use the in-place method to migrate from Hive to Iceberg

Background

Now it's time to try the in-place method for Hive to Iceberg migration. This method requires an ALTER TABLE statement with the SET PROPERTIES parameter identifying a change to the Iceberg table format, as detailed in our documentation.

Step 1: Run ALTER TABLE statement

You'll be working in the Starburst Galaxy query editor again for this step.

  • Copy and paste the SQL statement below into the query editor window.
  • Run the SQL by clicking the Run (limit 1000) button.
ALTER TABLE cust_orc
SET PROPERTIES type = 'ICEBERG';
  • Run the following SQL to confirm the table format was altered successfully:
SHOW CREATE TABLE cust_orc;

7. Tutorial wrap-up

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've completed this tutorial, you should have a better understanding of how and when to use each of the Hive to Iceberg migration methods.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!