Last Updated: 2024-05-03

Background

Apache Iceberg is an open-source table format for huge analytic datasets. It offers several benefits for managing and querying large datasets efficiently, including, but not limited to:

Scope of tutorial

In this tutorial, you will explore Iceberg's advanced features using an airplane dataset. You will learn how tables can evolve in specific ways, including renaming columns, adding additional columns, renaming partitions, and eliminating partitions. At each step, you will see what impact these changes have on the records in the table and investigate how these changes are tracked by Iceberg's metadata files. Finally, you will learn how to compact small files into fewer/larger ones.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

Background

In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.

Step 1: Set catalog and schema

You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.

Step 2: Create Iceberg table

Now you're going to use DDL, or data definition language, to create a new table called planes.

CREATE TABLE planes (
   tail_number varchar(15),
   name varchar(150),
   color varchar(15)
) WITH ( type = 'iceberg');

Step 3: Add records to table

Let's add two records to the planes table.

INSERT INTO planes (tail_number, name)
VALUES
 ('N707JT', 'John Travolta''s Boeing 707'),
 ('N1KE', 'Nike corp jet');
SELECT * FROM planes;

Background

Schema evolution refers to the modification of tables as business rules and source systems are modified over time. Starburst Galaxy supports different modifications to tables including the table name itself, column, and partition changes.

Step 1: Change column name

You can use the ALTER TABLE command to change the name of a column in an Iceberg table.

Notice that the name column from the planes table is really more of a "description". Let's change the name to reflect this.

ALTER TABLE planes RENAME COLUMN name TO description;
SELECT * FROM planes;

Step 2: Add new record

Add another plane to the table so that you have more data to work with.

INSERT INTO planes (tail_number, color, description)
VALUES
 ('N89TC', 'white',
  '1975 Bombardier Learjet 35 w/Light Jet classification');

Step 3: Add additional columns

You can also use the ALTER TABLE command to add additional columns to your table. Let's add four new columns.

ALTER TABLE planes ADD COLUMN class varchar(50);
ALTER TABLE planes ADD COLUMN year integer;
ALTER TABLE planes ADD COLUMN make varchar(100);
ALTER TABLE planes ADD COLUMN model varchar(100);
ALTER TABLE planes DROP COLUMN color;

Step 4: Modify existing records

In the previous step, you added and deleted columns, which introduced some NULL values to the existing records. You can use the UPDATE command to modify the records accordingly.

UPDATE planes
   SET class = 'Jet Airliner',
       year = 1964,
       make = 'Boeing',
       model = '707-138B'
 WHERE tail_number = 'N707JT';

UPDATE planes
   SET class = 'Heavy Jet',
       year = 2021,
       make = 'Gulfstream',
       model = 'G650'
 WHERE tail_number = 'N1KE';

UPDATE planes
   SET class = 'Light Jet',
       year = 1975,
       make = 'Bombardier',
       model = 'Learjet 35',
       description = null
 WHERE tail_number = 'N89TC';
SELECT * FROM planes;

Background

Partitioning is a way to organize and manage large volumes of data within a data lake. It involves dividing the data into logical segments based on certain criteria, such as date, location, or any other relevant attribute.

One of the big advantages of Iceberg is how it handles partitions, including allowing for partition evolution.

Step 1: Rename and partition table

Imagine that your company has now decided to carry other types of aircraft, not just planes.

You can use ALTER TABLE to rename the planes table to allow for additional types, such as helicopters, to be added. The new name for the table will be aircraft.

ALTER TABLE planes RENAME TO aircraft;
ALTER TABLE aircraft
SET PROPERTIES partitioning = ARRAY['class'];

Step 2: Add new records to table

Add two new helicopter records to the updated table.

INSERT INTO aircraft
 (tail_number, class, year, make, model, description)
VALUES
 ('N535NA', 'Helicopter', 1969, 'Sikorsky', 'UH-19D', 'NASA'),
 ('N611TV', 'Helicopter', 2022, 'Robinson', 'R66', null);

Step 3: Check partition metadata

Let's check the $partitions metadata table to ensure that the metadata for the new records is being collected properly. There should be two records in the new partition.

SELECT partition, record_count, file_count
  FROM "aircraft$partitions";

The second row in the image above indicates that 6 rows were added when there was no partitioning defined. This proves that, as stated previously, changes to the partitioning strategy do not automatically repartition the data previously persisted.

Step 4: Change partition strategy

Suppose that the needs of the business change. It is now determined that analysts are querying across specific years of manufacturing more than the class of the aircraft. You can change the partition strategy to account for this.

ALTER TABLE aircraft
SET PROPERTIES partitioning = ARRAY['year'];
SELECT partition, record_count, file_count
  FROM "aircraft$partitions";

Step 5: Add records to table

At this point, the aircraft table only has one aircraft for each year of manufacture. You can verify this with a quick query, then add some records to the table so that each year has more than one aircraft.

SELECT year, count() nbr_for_year
  FROM aircraft
 GROUP BY year ORDER BY year;

INSERT INTO aircraft
 (tail_number, class, year, make, model, description)
VALUES
 ('dummy', 'unknown', 1964, 'acme', 'cool', null),
 ('dummy', 'unknown', 1969, 'acme', 'cool', null),
 ('dummy', 'unknown', 1975, 'acme', 'cool', null),
 ('dummy', 'unknown', 2021, 'acme', 'cool', null),
 ('dummy', 'unknown', 2022, 'acme', 'cool', null);

Step 6: Verify partitioning for new records

Because you added the new records after you updated the partition strategy, they should be persisted with the year partition strategy. You can verify this by querying the $partitions table.

SELECT partition, record_count, file_count, data
  FROM "aircraft$partitions"
 ORDER BY record_count;

Background

Iceberg includes some file management features that help with performance. Traditional data lakes have scenarios where data is constantly being ingested. This data is written in small files because of the need to have it available to be queried immediately. This can hurt performance in any system that needs to read many small files, especially in cloud storage. Iceberg includes an optimize feature that combines small files into larger ones ensuring maximum performance when it comes to querying. This is known as compaction.

Step 1: Review $files metadata table

The $files metadata table references all files used by the current snapshot. Let's take a look at how many files are in the current snapshot for the aircraft table.

SELECT file_path, record_count, file_size_in_bytes
 FROM "aircraft$files";

The "merge on read" strategy leaves a sprawl of files as changes continue to be made. If the number of rows being updated or deleted is small for each operation, the size of the delete files and data files can be quite small as we see in the results above.

Even without running modification operations, the frequency and size of periodic, or streaming, ingestion can also create small files. This "small files problem" will ultimately affect performance and scalability. The solution to this is compaction, which is essentially reading multiple, smaller files and rebuilding and replacing them with fewer, larger ones.

Fortunately, we have a simple operation to trigger this maintenance that will eventually be needed. This operation uses a default value (100MB) to decide which files are considered small. Any file with a size below that threshold is included in the compaction operation.

Additionally, this compaction process is aware of the delete files and accounts for these by excluding the original row when rebuilding a file, eliminating the delete file, and rolling the net-new records that were created into larger files.

Step 2: Perform compaction

Let's perform compaction to optimize the size of the underlying files.

ALTER TABLE aircraft EXECUTE optimize;
SELECT file_path, record_count, file_size_in_bytes
  FROM "aircraft$files";

There are fewer files present now, as shown in the image below. These files are still very small in size due to the small number of records inserted into them. Nonetheless, the problem has been greatly reduced.

Step 3: Review partitions again

It might seem like fewer/larger files should have been written. Compaction created 5 files because the table's current partition strategy was utilized.

SELECT partition, record_count, file_count, data
  FROM "aircraft$partitions"
 ORDER BY record_count;

Did we just discover a relatively easy way to repartition pre-existing data after changing the partitioning strategy?

Hint: Yes we did!

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of schema and partition evolution with Apache Iceberg. You've also begun exploring data file compaction with Apache Iceberg.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.