Open table formats


Understanding open table formats
An open table format provides a layer of abstraction on top of a data lake. This allows data to be managed and optimized more efficiently. At the same time, the increased structure allows for additional features.
Understanding open table formats begins with Apache Hive. Apache Hive was the first data lake table format. However, this format had many functionality limitations owing to its architecture. In recent years, new table formats have pushed data lakes in new directions, increasing functionality and performance.
Importantly, these table formats allow data lakes to achieve some of the efficiencies and compliance standards more typically associated with data warehouses or databases while retaining the versatility of a data lake.
This includes enhanced ACID compliance, the ability to record transactional data efficiently, improved scalability, and the ability to update or delete records. These advancements are so significant that data lakes using these technologies become more like data lakehouses, mixing the versatility of data lakes to handle raw and semi-structured data with the ability to process transactional workloads like a data warehouse.
Let’s explore how newer table formats have improved the versatility of data lakes by adding additional functionality. Particular attention will be paid to improvements in transactional data collection within data lakes.
Features of modern open table formats
While each table format varies, they all extend the features of a data lake in similar ways.
These include:
Full CRUD operations
Data lakes use either HDFS or object storage. Both hold data in an immutable format. These have not typically provided an easy way to update files incrementally. If you consider that a database normally includes the ability to create, read, update, or delete (CRUD), a data lake has often only included the first two. Modern table formats help fix this by allowing the ability to update and delete records.
Improved performance and scalability
Data lakes often grow in size, and many are very, very large. They need to scale their analytic capabilities to match. Newer table formats allow increased scalability compared to Hive by introducing a new way of recording data at the file level. This is a marked improvement over Hive’s record-keeping approach, which organized data by folder. This means that if a query requires data on a specific subset of data, it can search the specific files containing the information rather than searching the whole folder. This vastly improves performance and efficiency.
Transactional support and ACID compliance
With ACID capabilities in table formats like Iceberg and Delta Lake, users can now achieve transactional awareness within a data lake. This doesn’t necessarily mean that a data lake would be a replacement for an OLTP system. However, it does ensure that groups of updates transactionally complete together or are rolled back if they cannot be completed. This is useful for some of today’s evolving ETL pipelines.
Three Types of modern open table formats
Let’s look closer at the modern open table formats: Apache Iceberg, Delta Lake, and Apache Hudi below.
1. Apache Iceberg
Apache Iceberg is an open-source table format used to structure the data held in data lakes. Like the other table formats listed, it was developed to solve the challenges of performance, data modification, and CRUD operations in the data lake. It can be used with HDFS or any object-based cloud platform, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.
Iceberg also offers schema evolution, schema partitioning, and time travel. This allows users to apply and update schemas, apply and update partitions, and enact version control to roll back changes to a system to a previous state. All of these adaptations push the data lake to a new level of functionality and create new use cases for data lakes.
Demo: Iceberg and Trino
In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.
2. Delta Lake
Delta Lake is an open-source framework developed by Databricks. Like other modern table formats, it employs file-level listings, considerably improving the speed of queries compared to Hive’s directory-level listing.
Like Iceberg, Delta Lake offers enhanced CRUD operations, including the ability to update and delete records in a data lake, which would previously have been immutable. It is ACID compliant and often used in transactional systems. This use case makes data lakes a viable replacement for traditional transactional databases while retaining the cost and storage benefits of other data lakes.
Delta Lake can be used with Starburst via the Delta Lake connector.
3. Apache Hudi
Apache Hudi is another table format used less often than Iceberg or Delta Lake. It addresses some of the same problems discussed above.
Open table format architecture
Metadata captures changes in state
Architecturally, modern table formats are composed of a set of hierarchically structured metadata files. These files capture changes in the state of the data in the data lake. A table format is a kind of database transaction log, outlining all of the changes over the life of the data lake. This metadata is stored in a structured, readily accessible format. How does this work? Let’s explore how Iceberg uses enhanced metadata collection to deliver additional functionality.
The image below shows how metadata tracks the changes to the dataset. The files held in the Data layer are captured by the metadata files held in the Metadata layer. As the files change, the metadata files attached to them track these changes.
Record snapshots
To achieve this metadata capture, modern table formats create records pointing to individual metadata file locations. This file is known as a manifest file and includes metadata about the table at a given time. The manifest file acts as a snapshot of the table, detailing the points at which a change is made. Multiple manifest files are stored in Manifest lists.
In the image below, changes in the Data layer have been detected in the Metadata layer. A new Manifest file and corresponding Manifest list have been created to capture these changes.
Create up-to-date record of changes
Manifest files and Manifest lists provide the ability to record an accurate, up-to-date account of the changes over time. This includes inserts, deletions, updates, schema migrations, and partition changes. The changes themselves are stored in Metadata files known as Snapshots. Each snapshot is like a slice in time, allowing the dataset to be queried as it was in multiple instances or rolled back to a previous state.
The image below shows how the changes in the Data layer have created a new Snapshot file, Snapshot 1. The original Snapshot file, Snapshot 0, is also retained. This creates a series of snapshots, each tracking changes to the data and recording those changes in the Metadata layer.
Open table formats vs Open file formats
Table and file formats are different open-source elements of an open data lakehouse. Columnar open file formats like Parquet and ORC ensure data within an object gets written in ways that optimize query performance, while open table formats like Iceberg sit above the files and objects, providing a layer of rich metadata to enable analytics on the underlying data lake.
Open table format feature | Apache Iceberg | Delta Lake | Apache Hudi | Apache Hive |
Transaction support (ACID) | Yes | Yes | Yes | Limited |
File format | Parquet, ORC, Avro | Parquet | Parquet, ORC, Avro | Parquet, ORC, Avro, and more |
Schema evolution | Full | Partial | Full | Partial |
Partition evolution | Yes | No | No | No |
Data versioning | Yes | Yes | Yes | No |
Time travel queries | Yes | Yes | Yes | No |
Concurrency control | Optimistic locking | Optimistic locking | Optimistic locking | Pessimistic locking |
Object store cost optimization | Yes | Yes | Yes | No |
Community and ecosystem | Growing | Growing | Growing | Established |
Choosing the right open table formats based on the vendor
The data industry has woken up to the reality that direct access to the underlying files that store their data should be open. Storing data in open formats, specifically Apache Iceberg, in an object storage lake has enabled this. This is ultimately good news for their customers and helps to reduce vendor lock-in.
Starburst was founded on the idea that data should always belong to the customer. With the announcements — Databricks acquires Tabular, founded by Apache Iceberg creators; Snowflake unveils Polaris, an open-source implementation of the Iceberg REST catalog — there is now a huge potential to improve price-performance by being able to choose the best engine for the job. Databricks and Snowflake have finally recognized this basic customer need.
However, Starburst believes in optionality. That’s only possible with a truly open data lakehouse, a.k.a. the Icehouse. An Icehouse architecture is based on Trino for high-performance and scale SQL querying (read and write) and Iceberg for storage. It complements your Snowflake + Iceberg solution and can help you significantly lower your operating costs.