1. Tutorial overview
Last Updated: 2024-04-19
Background
A "table format" is an open-source mechanism that manages and tracks all the files and metadata that make up a table. Apache Hive is the first-generation table format which over time has been found to have many limitations.
To address the limitations of Hive, modern table formats such as Apache Iceberg, Delta Lake, and Apache Hudi were designed. These modern formats offer features like ACID transactions, schema evolution, time travel, and improved performance for certain types of operations.
Although Starburst supports all three modern table formats, we recommend Iceberg as our top choice.
Scope of tutorial
In this tutorial, you will learn about the Iceberg table format and its numerous benefits. You'll discover how to create Iceberg tables, add records, and query values using Iceberg. Additionally, you'll explore the metadata tables to understand the information stored for enabling versioning with snapshots.
Learning objectives
Once you've completed this tutorial, you will be able to:
- Use Starburst Galaxy to create and populate Iceberg tables.
- Use Starburst Galaxy to query Iceberg tables.
- Explain what information is stored in the Iceberg metadata tables.
Prerequisites
- You need a Starburst Galaxy account to complete this tutorial. Please see Starburst Galaxy: Getting started for instructions on setting up a free account.
- You must complete the prerequisite tutorial titled Configure a Starburst Galaxy data lake catalog and schema before attempting this tutorial. It will guide you in setting up the required environment for this tutorial.
About Starburst tutorials
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
2. Video: Create and populate Apache Iceberg tables
The following video walks through all the steps in this tutorial.
You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.
3. Sign into Starburst Galaxy and set Admin role
Background
You're going to begin by signing in to Starburst Galaxy and setting your role.
This is a quick step, but an important one.
Step 1: Sign into Starburst Galaxy
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Step 2: Set your role
Your current role is listed in the top right-hand corner of the screen.
- Check your role, to ensure that it is set to accountadmin.
- If it is set to anything else, use the drop-down menu to select the correct role.
4. Create new table
Background
In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.
Step 1: Set catalog and schema
You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.
- Hover over the left-hand navigation menu to expand it.
- Select Query>>Query editor.
- Select the
aws-us-east-1
cluster,tmp_cat
catalog, andtmp_first_last_postalcode
schema.
Step 2: Create Iceberg table
Now you're going to use DDL, or data definition language, to create a new table called my_iceberg_tbl
.
- Copy and paste the following SQL into the Query editor window:
CREATE TABLE my_iceberg_tbl (
id integer,
name varchar(55),
description varchar(255)
) WITH (
TYPE = 'iceberg', FORMAT = 'parquet'
);
- Note the table
TYPE
and fileFORMAT
specified in theWITH
clause. If you set your default table format to Iceberg during catalog creation, you could safely omit theTYPE
. - Click the Run (limit 1000) button to execute the SQL.
Step 3: Add records to table
It's time to populate the table. We're going to add three records for characters from the popular film series Dune.
- Run the following SQL to add the records to your table:
INSERT INTO my_iceberg_tbl
(id, name, description)
VALUES
(101, 'Leto', 'Ruler of House Atreides'),
(102, 'Jessica', 'Consort of the Duke'),
(103, 'Paul', 'Son of Leto (aka Dale Cooper)');
- Run a simple query to verify that the new rows are present:
SELECT * FROM my_iceberg_tbl;
5. Explore Iceberg table metadata
Background
Iceberg stores table metadata such as schema, partitioning details, statistics, and versioning information on the data lake alongside the actual data files. The following diagram presents the richness of this architectural approach.
A detailed explanation of this architecture can be found in the Apache Iceberg Specification. The key takeaways are as follows:
- Data and metadata live together inside the table's data lake table location.
- The metadata subdirectory contains JSON and AVRO files to represent multiple versions of a table.
- The data subdirectory contains the actual data files (supports ORC, Parquet, and AVRO) that span all the various versions identified in the metadata.
- Snapshots (i.e. versions) are created anytime the structure or the data changes.
You can easily query the Iceberg metadata tables to avoid having to inspect the metadata files directly from the data lake.
Step 1: Query $history
metadata table
The $history
table provides a log of the metadata changes performed on the Iceberg table. Let's take a look at how the changes to my_iceberg_tbl
were captured.
- Run the following SQL in the Query editor window:
SELECT made_current_at,
snapshot_id, parent_id
FROM "my_iceberg_tbl$history"
ORDER BY made_current_at;
The first snapshot_id
in the list refers to the snapshot that was created when the table was created. The next one is from the INSERT
statement execution. Notice that its parent_id
value is the same as the snapshot_id
from the row before it.
Step 2: Query $files
metadata table
The $files
metadata table provides a detailed overview of the data files in the current snapshot.
- Run the following query to see that there is currently only one data file:
SELECT
substring(file_path, position('/data/' IN file_path) + 6)
AS file_path,
record_count,
value_counts,
null_value_counts,
lower_bounds,
upper_bounds
FROM
"my_iceberg_tbl$files";
Here is an explanation of some of the information that was returned by this query:
file_path
: A specific file name. The remainder of the columns relate to this particular file.record_count
: The number of records in the file. Ours has three.value_counts
: The number of values in each column. Each of our columns has three values.null_value_counts
: The number ofNULL
values in each column. The count is zero for each of our columns.lower_bounds
: The lowest value in each column. For example, theid
field has101
as a lower bound.upper_bounds
: The highest value in each column. For example, theid
field has103
as an upper bound.
Step 3: Add additional records to table
You can see snapshots in action by adding records to the Iceberg table and then querying the metadata tables again.
- Run the following SQL to add records to your Iceberg table:
INSERT INTO my_iceberg_tbl
(id, name, description)
VALUES
(104, 'Thufir', 'Mentat'),
(201, 'Vladimir', 'Ruler of House Harkonnen'),
(202, 'Rabban', 'Ruthless nephew of Vladimir'),
(203, 'Feyd-Rautha', 'Savvy nephew of Vladimir (played by Sting)'),
(301, 'Reverend Mother Gaius Helen Mohiam', null);
- Verify that the new records were added:
SELECT * FROM my_iceberg_tbl
Step 5: Query $files
metadata table again
It's time to query the $files
metadata table again to see what has changed now that you've added five records to your Iceberg table.
- Run the following query to see that a new file has been added:
SELECT
substring(file_path, position('/data/' IN file_path) + 6)
AS file_path,
record_count,
value_counts,
null_value_counts,
lower_bounds,
upper_bounds
FROM
"my_iceberg_tbl$files";
Here is a summary of the information returned by the new file:
record_count
: The record count is now 5.value_counts
: Each column now has 5 values.null_value_counts
: One of the columns now has aNULL
value.lower_bounds
: Theid
field now has a lower bound of104
.upper_bounds
: Theid
field now has an upper bound of301
.
Step 6: Verify additional snapshot was created
You can execute the query against the $history
table again to verify that a new snapshot was created.
- Run the following query:
SELECT made_current_at,
snapshot_id, parent_id
FROM "my_iceberg_tbl$history"
ORDER BY made_current_at;
6. Tutorial wrap-up
Tutorial complete
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You should now have a better understanding of how to create and populate Iceberg tables and some of the basic information available via the metadata tables.
Continuous learning
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Next steps
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Tutorials available
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!