Explore data modifications, snapshots, and time travel with Apache Iceberg

46 mins remaining

1. Tutorial overview

Last Updated: 2024-04-24

Background

One of the most powerful features when you're working with Iceberg and Starburst is the ability to modify data that resides in object storage. As you may know, files stored in object stores such as S3 are immutable. This means that after they are initially created, they can be deleted, but not modified. However, Starburst allows full DML (data manipulation language) on Iceberg tables, which means full support for UPDATE, DELETE and MERGE commands. Metadata files are persisted to create a comprehensive picture of changes made to the tables.

Scope of tutorial

In this tutorial, you will explore how Iceberg uses metadata to create a comprehensive picture of structural and content changes made to tables. To do this, you will set up a table, make changes, then see how the metadata tracks the versions of a table. Versions are identified by a snapshot identifier (ID). You will then leverage a snapshot ID in a query to use Iceberg's time travel feature to view results from prior versions of the table. You can even roll the table back to a previous version with a snapshot ID.

Learning objectives

Once you've completed this tutorial, you will be able to:

  • Use Starburst Galaxy to modify Iceberg tables.
  • Use Starburst Galaxy to query Iceberg's metadata tables.
  • Explain how snapshots are used and what information they contain.
  • Use snapshot identifiers to "time travel," or query previous versions of an Iceberg table.

Prerequisites

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

2. Video: Apache Iceberg data modifications, snapshots, and time travel

The following video walks through all the steps in this tutorial.

You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.

3. Sign into Starburst Galaxy and set Admin role

Background

You're going to begin by signing in to Starburst Galaxy and setting your role.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

  • Check your role, to ensure that it is set to accountadmin.
  • If it is set to anything else, use the drop-down menu to select the correct role.

4. Create a testing table

Background

In Starburst Galaxy, you can create and populate new tables from the Query editor. Your new table will go in the catalog and schema that you created in the prerequisite tutorial.

Step 1: Set catalog and schema

You can set the catalog and schema for a session in the Query editor so that you don't have to use the fully-qualified table name in your queries.

  • Hover over the left-hand navigation menu to expand it.
  • Select Query>>Query editor.
  • Select the aws-us-east-1-free cluster, tmp_cat catalog, and tmp_first_last_postalcode schema.

Step 2: Create Iceberg table

Now you're going to use DDL, or data definition language, to create a new table called phone_provisioning.

  • Copy and paste the following SQL into the Query editor window:
CREATE TABLE phone_provisioning (
 phone_nbr bigint,
 event_time timestamp(6),
 action varchar(15),
 notes varchar(150)
)
WITH (
 type='iceberg',
 partitioning=ARRAY['day(event_time)']
);
  • Click the Run (limit 1000) button to execute the SQL.

5. Explore Iceberg snapshots

Background

Iceberg stores table metadata such as schema, partitioning details, statistics, and versioning information on the data lake alongside the actual data files. As part of this metadata, snapshots (i.e. versions) are created anytime the structure of the table or the data changes.

You can easily query the Iceberg $snapshots metadata table for detailed information on the table changes.

Step 1: Verify snapshot creation

You can verify that a snapshot was created when you created a new Iceberg table by querying the $snapshots metadata table. If you completed our tutorial on creating and populating Apache Iceberg tables, you learned how to query the $history metadata table. The $snapshots table is similar but provides a more detailed view.

  • Run the following SQL to query the $snapshots table:
SELECT * FROM "phone_provisioning$snapshots";
  • The query output shows a single row representing the first snapshot which captures the table creation itself. Click on the value of the summary column to see the details in a pop-up.

  • The summary information indicates that no data changes happened as part of this snapshot. In fact, it also shows there are zero total-records as you would expect having only run the DDL.
  • Before you close the summary window, copy the trino_query_id. You'll need it in the next step.

Step 2: Trace snapshots back to specific queries

Starburst Galaxy includes a Query insights page that allows you to review your query history. You can use the query ID provided by the $snapshots table as a filter to trace back to the query that produced a specific version of the table.

  • Use the left-hand navigation menu to select Query>>Query insights.
  • Click Show filters.
  • Click the Add filter button.
  • Select Query ID from the dropdown menu.
  • Paste the trino_query_id you copied from the last step into the Query id field.
  • Click the Apply filter button.

Step 3: Add records to table

Let's add two records from a week ago to capture the initial orders for two new phone numbers. You'll verify that a new snapshot was created after you add the records.

  • Run the following SQL to add the records:
INSERT INTO
 phone_provisioning (phone_nbr, event_time, action, notes)
VALUES
 (
   1111111, current_timestamp(6) - interval '7' day, 'ordered', null
 ),
 (
   2222222, current_timestamp(6) - interval '7' day, 'ordered', null
 );
  • Now verify that the two records are present, as expected:
SELECT * FROM phone_provisioning ORDER BY event_time DESC;

  • Finally, verify that a new snapshot was created:
SELECT * FROM "phone_provisioning$snapshots";

Step 4: Add activation records to table

Now let's add historical records from six days ago to capture the activation activity for the same two phone numbers. Once again, you can check the snapshots to verify that a new one was created.

  • Run the following SQL to add records to your Iceberg table:
INSERT INTO
 phone_provisioning (phone_nbr, event_time, action, notes)
VALUES
 (
   1111111, current_timestamp(6) - interval '6' day, 'activated', null
 ),
 (
   2222222, current_timestamp(6) - interval '6' day, 'activated', null
 );
  • Verify that the new records were added:
SELECT * FROM phone_provisioning ORDER BY event_time DESC;
  • Verify that a new snapshot was created:
SELECT * FROM "phone_provisioning$snapshots";
  • Review the summary column for the most recent snapshot. You should see that two records were added to the table.

Step 5: Add one more record to table

Now let's add one more record to the table. This one is from five days ago and captures an error that was reported on phone number 2222222.

  • Run the following SQL to add the error record:
INSERT INTO
  phone_provisioning (phone_nbr, event_time, action, notes)
VALUES
  (2222222, current_timestamp(6) - interval '5' day, 'errorReported',
  'customer reports unable to initiate call');
  • Using what you have learned from the previous steps, query the phone_provisioning table to ensure that the new record is present and query the snapshots metadata table to verify that a new snapshot has been created.

6. Use UPDATE to modify Iceberg tables

Background

As you've seen, Iceberg records changes to a table as snapshots. Thus far the only modifications you've made to your table are INSERTs. In this section, you'll see how UPDATEs are handled by Iceberg.

How Iceberg handles updates

Iceberg cannot perform an in-place update to the underlying immutable data files. Rather, when an update occurs, Iceberg has to create a delete file referencing the location in the existing file(s) that contain the record(s) to be updated. For this reason, it is more appropriate to think of updates as overwrites.

As part of an atomic operation, Iceberg also creates a new data file that has the full record being updated. It is essentially an "add" of the record with all updated columns as well as the existing values for columns not updated.

The new delete files and data files that are created will be read after the preceding data files. This allows Iceberg to modify the data prior to returning it by applying the delete files (i.e. delete the records) and then including the new data files (which will look like net-new records).

Comparison with RDBMS

This is somewhat analogous to RDBMS transaction logs that store a running history of modifications. The difference is that the classical databases are creating their transaction logs for recovery & replication purposes. Iceberg creates a series of deltas that will be used in a "merge on read" strategy when the table is queried.

Step 1: Modify previous records

Four days ago, a system error prevented the notes column from being populated correctly before it was fixed. Upon review of the problem, it was determined that two UPDATE commands were needed to modify the affected records.

  • Run the following two commands, one at a time, to modify the table:
UPDATE phone_provisioning
  SET notes = 'customer requested new number'
WHERE action = 'ordered'
  AND notes is null;

UPDATE phone_provisioning
  SET notes = 'number successfully activated'
WHERE action = 'activated'
  AND notes is null;
  • Run the following SQL to review the changes to the table:
SELECT * FROM phone_provisioning ORDER BY event_time DESC;

Step 2: Query snapshots table

Review the snapshots table now that you've made some modifications to the phone_provisioning table.

  • Run the following query:
SELECT * FROM "phone_provisioning$snapshots";
  • Note the two new snapshots that were created from the UPDATE statements. Pay attention to the operation column that lists the operation type as overwrite.

Step 3: Review snapshot summary

To get a better understanding of the process Iceberg uses to modify tables, you can review the summary of either of the snapshots that were created from the UPDATE statements.

  • Click the summary value for the second to last snapshot in the list.

The following is a subset of the properties shown in the summary. They are the ones most important to this discussion.

{
 total-position-deletes = 2, 
 total-delete-files = 1, 

 added-records = 2, 
 added-data-files = 1, 
}

These values are for the UPDATE statement that had action = ‘ordered' AND notes is null within it. It logically changed two records.

With Iceberg's inability to perform in-place updates on the underlying files, total-position-deletes = 2 indicates that 2 records were marked for deletion. Fortunately, all of these were placed in a single delete file.

Closely following those deletes, added-records = 2 indicates the records deleted are being re-added as essentially new inserts. As before, these were assembled into a single new data file.

7. Use Iceberg's time travel feature

Background

What can you do with snapshots? One useful feature is called "time travel". This allows you to query prior versions of the table via the snapshot_id or a timestamp.

We recommend that you familiarize yourself with the documentation on time travel. Many unique use cases make this feature invaluable.

Step 1: Time travel using snapshot_id

It's time to check out time travel in action, first by using the snapshot_id to query a prior version of the phone_provisioning table.

  • Using the output of the most recent query on the $snapshots table, copy the snapshot_id for the second snapshot. This is located in the second row of the output, as shown in the image below.

  • Run the following query, replacing snapshot_id with the value you just copied.
SELECT * FROM phone_provisioning
FOR VERSION AS OF snapshot_id
ORDER BY event_time DESC;
  • You should see the first two records that you added, similar to the image below.

Step 2: Time travel against point in time

Another way to leverage time travel is to run a query based on a past point in time. You can exercise this by swapping VERSION in the previous query with TIMESTAMP and replacing the snapshot_id with a timestamp.

Previously, you added a record with a timestamp that was 5 days in the past.

  • Run the following query to see what that table's contents were 5.5 days ago.
SELECT * FROM phone_provisioning
FOR TIMESTAMP AS OF current_timestamp(6) - interval '132' hour
ORDER BY event_time DESC;

You should receive an error like the one shown in the following image:

Do you understand what happened here? The query above assumed that the event_time column's timestamp was being used, but it is only a timestamp-based column and is not directly related to the versioning information. The rows from the $snapshots metadata table have a committed_at timestamp which is leveraged when FOR TIMESTAMP AS OF is utilized.

Run the last query again after changing the interval type from hour to minute and plugging in a single-digit number instead of 132. Increment and/or decrement the number until you get the results you are looking for.

Hint: use an appropriate number of minutes that would make the timestamp slightly before the committed_at column value from $snapshots for the snapshot you are trying to read the data from.

Step 3: Rollback to previous table version

Time travel also provides the ability to roll a table back to a previous snapshot. Once again the snapshot_id is required.

In this step, you are going to simulate a typical use case of time travel by "mistakenly" deleting some records from your table, then rolling it back to the previous version before they were deleted.

  • Begin by copying the snapshot_id for the most recent version of your table. Be sure to paste it in a text editor for later reference.
SELECT * FROM "phone_provisioning$snapshots";

  • Now, run the following SQL to remove all records for one of the phone numbers:
DELETE FROM phone_provisioning
 WHERE phone_nbr = 2222222;
  • Verify that 3 of the 5 rows were deleted:
SELECT * FROM phone_provisioning ORDER BY event_time DESC;

  • Run the following SQL to roll the table back to the version before the deletions. Be sure to replace snapshot_id with the snapshot_id you copied earlier and myschema with your schema name.
CALL tmp_cat.system.rollback_to_snapshot(
   'myschema', 'phone_provisioning',
    snapshot_id);
  • Finally, confirm that the rows have been recovered:
SELECT * FROM phone_provisioning ORDER BY event_time DESC;

8. Tutorial wrap-up

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You should now have a better understanding of how Iceberg uses snapshots to manage table modifications. You've also gained some experience using the powerful Iceberg time travel feature.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!