This post is part of the Iceberg blog series. Read the entire series:
- Introduction to Apache Iceberg in Trino
- Iceberg Partitioning and Performance Optimizations in Trino
- Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
- Apache Iceberg Schema Evolution in Trino
- Apache Iceberg Time Travel & Rollbacks in Trino
- Automated maintenance for Apache Iceberg tables in Starburst Galaxy
- Improving performance with Iceberg sorted tables
- How to migrate your Hive tables to Apache Iceberg
Apache Iceberg is an open source table format that brings database functionality to object storage such as S3, Azure’s ADLS, Google Cloud Storage and MinIO. This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor.
What is Apache Iceberg?
Apache Iceberg is a table format, originally created by Netflix, that provides database type functionality on top of object stores such as Amazon S3. Iceberg allows organizations to finally build true data lakehouses in an open architecture, avoiding vendor and technology lock-in.
The excitement around Iceberg began last year and has greatly increased in 2022. Most of the customers and prospects I speak with on a weekly basis are either considering migrating their existing Hive tables to it or have already started. They are excited a true open source table format has been created with many engines both open source and proprietary jumping on board.
Advantages of Apache Iceberg
One of the best things about Iceberg is the vast adoption by many different engines. In the diagram below, you can see many different technologies can work the same set of data as long as they use the open-source Iceberg API. As you can see, the popularity and work that each engine has done is a great indicator of the popularity and usefulness that this exciting technology brings.
With more and more technologies jumping on board, Iceberg isn’t a passing fad. It has been growing in popularity, not only because of how useful it is, but also because it’s truly an open source table format, many companies have contributed and helped improve the specification making it a true community based effort.
Here is a list of the many features Iceberg provides:
|Choose your engine
|As you can see from the diagram above, there are many engines that support Iceberg. This offers the ultimate flexibility to own your own data and choose the engine that fits your use cases.
|Avoid Data Lock-in
|The data Iceberg and these engines work on, is YOUR data in YOUR account which avoids data lock-in.
|Avoid Vendor Lock-out
|Iceberg metadata is always available to all engines. So you can guarantee consistency, even with multiple writers.
|DML (modifying table data)
|Modifying data in Hadoop was a huge challenge. With Iceberg, data can easily be modified to adhere to use cases and compliance such as GDPR.
|Much like a database, Iceberg supports full schema evolution including columns and even partitions.
|Since Iceberg stores a table state in a snapshot, the engine simply needs to read the metadata in that snapshot then start retrieving the data from storage saving valuable time and reduced cloud object store retrieval costs.
|Partitioning is performed on any column and end users query Iceberg tables just like they would a database.
Iceberg is a layer of metadata over your object storage. It provides a transaction log per table very similar to a traditional database. This log keeps track of the current state of the table including any modifications. It also keeps a current “snapshot” of the files that belong to the table and statistics about them in order to reduce the amount of data that is needed to be read during queries greatly, improving performance.
Everytime a modification to an Iceberg table is performed, (insert, update, delete, etc.) a new snapshot of the table is created. When an Iceberg client (Trino let’s say) wants to query a table, the latest snapshot is read and the files that “belong” to that snapshot are read. This makes a very powerful feature called time travel available because the table at any given point contains a set of snapshots over time which can be queried with the proper syntax.
Under the covers, Iceberg uses a set of avro based files to keep track of this metadata. A Hive compatible metastore is used to “point” to the latest metadata file that has the current state of the table. All engines that want to interact with the table first get the latest “pointer” from the metastore then start interacting with Iceberg metadata files from there.
Here is a very basic diagram of the different files that are created during a CTAS (create table as select):
Metadata File Pointer (fp1) – This is an entry in a Hive compatible metastore (AWS Glue for example) that points to the current metadata file. This is the start to any query against an Iceberg table.
Metadata File (mf1) – A json file that contains the latest version of a table. Any changes made to a table create a new metadata file. The contents of this file are simply lists of manifest list files with some high level metadata.
Manifest List (ml1) – List of manifest files that make up a snapshot. This also includes metadata such as partition bounds in order to skip files that do not need to be read for the query.
Manifest File (mf1) – Lists a set of files and metadata about these files. This is the final step for a query as only files that need to be read are determined using these files saving valuable querying time.
Here is a sample table name customer_iceberg that was created on S3
Table directory – this is the name of the table with a unique uuid in order to support table renames
Data directory – this holds orc, parquet or avro files and could contain subdirectories depending on if the table is partitioned
Metadata directory – ths directory holds the manifest files as covered above
Again, this might be too nitty-gritty for the average user but the point is a tremendous amount of thought and work has been put into Iceberg to ensure it can handle many different types of analytical queries along with real-time ingestion. It was built to fill the gap between low-cost, cloud object stores and the demanding processing engines such as Trino and Spark.
Using partitions in Iceberg is just like with a database. Most data you ingest into your data lake has a timestamp and partitioning by that column is very easy:
Example – partition by month from a timestamp column:
create table orders_iceberg
Querying using a standard where clause against the partitioned column will result in partition pruning and much higher performance:
select * from orders_iceberg
WHERE CAST(create_date AS date) BETWEEN date '1993-06-01' AND date '1993-11-30';
Trino Iceberg Support
Trino has full support for Iceberg with a feature matrix listed below:
|Modify Table (update/delete/merge)
|Add/Drop/Modify table column
|Rollback to previous snapshot
|View support (includes AWS Glue)
|Maintenance (Optimize/Expire Snapshots)
|Alter table partition
Using Iceberg in Trino is very easy. There is a dedicated connector page located here. If you are using Starburst Galaxy which is a fully managed Trino platform, Iceberg is already included in the Great Lakes Connector so there isn’t anything you need to add. It’s even the default table format so that’s how excited and confident we are here at Starburst about it.
We highly encourage our customers, prospects, open-source Trino users to explore and start using this new technology. The need to use a proprietary cloud data warehouse is greatly reduced with the addition of Iceberg and robust ecosystem of engines. I believe we’re at the “tip of the iceberg” on wide adoption of this open source project.
Follow and take part in the quickly growing Trino Slack community and don’t forget to join the #iceberg channel as well.