Last Updated: 2024-04-19

Background

A "table format" is an open-source mechanism that manages and tracks all the files and metadata that make up a table. Apache Hive is the first-generation table format which over time has been found to have many limitations.

To address the limitations of Hive, modern table formats such as Apache Iceberg, Delta Lake, and Apache Hudi were designed. These modern formats offer features like ACID transactions, schema evolution, time travel, and improved performance for certain types of operations.

Although Starburst supports all three modern table formats, we recommend Iceberg as our top choice.

Scope of tutorial

In this tutorial, you will learn about the Iceberg table format and its numerous benefits. You'll discover how to create Iceberg tables, add records, and query values using Iceberg. Additionally, you'll explore the metadata tables to understand the information stored for enabling versioning with snapshots.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

Background

In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.

Step 1: Set catalog and schema

You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.

Step 2: Create Iceberg table

Now you're going to use DDL, or data definition language, to create a new table called my_iceberg_tbl.

CREATE TABLE my_iceberg_tbl (
   id integer,
   name varchar(55),
   description varchar(255)
) WITH (
   TYPE = 'iceberg', FORMAT = 'parquet'
);

Step 3: Add records to table

It's time to populate the table. We're going to add three records for characters from the popular film series Dune.

INSERT INTO my_iceberg_tbl
 (id, name, description)
VALUES
 (101, 'Leto', 'Ruler of House Atreides'),
 (102, 'Jessica', 'Consort of the Duke'),
 (103, 'Paul', 'Son of Leto (aka Dale Cooper)');
SELECT * FROM my_iceberg_tbl;

Background

Iceberg stores table metadata such as schema, partitioning details, statistics, and versioning information on the data lake alongside the actual data files. The following diagram presents the richness of this architectural approach.

A detailed explanation of this architecture can be found in the Apache Iceberg Specification. The key takeaways are as follows:

You can easily query the Iceberg metadata tables to avoid having to inspect the metadata files directly from the data lake.

Step 1: Query $history metadata table

The $history table provides a log of the metadata changes performed on the Iceberg table. Let's take a look at how the changes to my_iceberg_tbl were captured.

SELECT made_current_at,
       snapshot_id, parent_id
  FROM "my_iceberg_tbl$history"
 ORDER BY made_current_at;

The first snapshot_id in the list refers to the snapshot that was created when the table was created. The next one is from the INSERT statement execution. Notice that its parent_id value is the same as the snapshot_id from the row before it.

Step 2: Query $files metadata table

The $files metadata table provides a detailed overview of the data files in the current snapshot.

SELECT
 substring(file_path, position('/data/' IN file_path) + 6)
   AS file_path,
 record_count,
 value_counts,
 null_value_counts,
 lower_bounds,
 upper_bounds
FROM
 "my_iceberg_tbl$files";

Here is an explanation of some of the information that was returned by this query:

Step 3: Add additional records to table

You can see snapshots in action by adding records to the Iceberg table and then querying the metadata tables again.

INSERT INTO my_iceberg_tbl
 (id, name, description)
VALUES
 (104, 'Thufir', 'Mentat'),
 (201, 'Vladimir', 'Ruler of House Harkonnen'),
 (202, 'Rabban', 'Ruthless nephew of Vladimir'),
 (203, 'Feyd-Rautha', 'Savvy nephew of Vladimir (played by Sting)'),
 (301, 'Reverend Mother Gaius Helen Mohiam', null);
SELECT * FROM my_iceberg_tbl

Step 5: Query $files metadata table again

It's time to query the $files metadata table again to see what has changed now that you've added five records to your Iceberg table.

SELECT
 substring(file_path, position('/data/' IN file_path) + 6)
   AS file_path,
 record_count,
 value_counts,
 null_value_counts,
 lower_bounds,
 upper_bounds
FROM
 "my_iceberg_tbl$files";

Here is a summary of the information returned by the new file:

Step 6: Verify additional snapshot was created

You can execute the query against the $history table again to verify that a new snapshot was created.

SELECT made_current_at,
       snapshot_id, parent_id
  FROM "my_iceberg_tbl$history"
 ORDER BY made_current_at;

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of how to create and populate Iceberg tables and some of the basic information available via the metadata tables.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.