Open Table Formats

Open table formats are designed to provide enhanced performance and compliance capabilities for data lakes using cloud-based object storage.

An open table format provides a layer of abstraction on top of a data lake. This allows data to be managed and optimized more efficiently. At the same time, the increased structure allows for additional features.

Apache Hive introduced the first data lake table format. However, this format had a number of functionality limitations owing to its architecture. In recent years, new table formats have pushed data lakes in new directions, increasing both functionality and performance.

Importantly, these table formats allow data lakes to achieve some of the efficiencies and compliance standards more typically associated with data warehouses or databases, while retaining the versatility of a data lake.

This includes enhanced ACID compliance, the ability to record transactional data efficiently, improved scalability, and the ability to update or delete records. These advancements are so significant that it is often said that data lakes using these technologies become more like data lakehouses, mixing the versatility of data lakes to handle raw and semi-structured data with the ability to process transactional workloads like a data warehouse.

Let’s explore how newer table formats have improved the versatility of data lakes by adding additional functionality. Particular attention will be paid to improvements in transactional data collection within data lakes.

Features of modern open table formats

While each table format varies, they all extend the features of a data lake in similar ways.

These include:

Full CRUD operations

Data lakes use either HDFS or object storage. Both hold data in an  immutable format. These have not typically provided an easy way to update files incrementally. If you consider that a database typically includes the ability to Create, Read, Update, or Delete (CRUD), a data lake has often only included the first two. Modern table formats help fix this by allowing the ability to update and delete records.

Improved performance and scalability

Data lakes tend to easily grow in size, and many are very, very large. With this, comes the need to scale their analytic capabilities to match. Newer table formats allow increased scalability when compared to Hive by introducing a new way of recording data at the file level. This is a marked improvement over Hive’s record keeping approach, which organized data by folder. This means that if a query requires data on a specific subset of data, it can search the specific files containing the information rather than searching the whole folder. This vastly improves performance and efficiency.

Transactional support and ACID compliance

With the inclusion of ACID capabilities in table formats like Iceberg and Delta Lake, users can now achieve a level of transactional awareness within a data lake. This does not necessarily mean that a data lake would be a replacement for an OLTP system. However, it does ensure that groups of updates transactionally complete together or are rolled back if they cannot be completed. This is helpful with some of today’s evolving ETL pipelines.

3 Types of modern open table formats

We take a closer look at modern open table formats: Apache Iceberg, Delta Lake, and Apache Hudi below.

1. Apache Iceberg

Apache Iceberg is an open source table format used to structure the data held in data lakes. Like the other table formats listed, it was developed to solve the challenges of performance, data modification, and CRUD operations in the data lake. It can be used with either HDFS or any object-based cloud platform, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.

Notably, Iceberg also offers schema evolution, schema partitioning, and time travel. This allows users to apply and update schemas, apply and update partitions, and enact version control to roll back changes to a system to a previous state. All of these adaptations push the data lake to a new level of functionality and create new use-cases for data lakes.

Demo: Iceberg and Trino

In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.

2. Delta lake

Delta Lake is an open source framework developed by Databricks. Like other modern table formats, it employs file-level listings, improving the speed of queries considerably compared to the  directory-level listing of Hive.

Like Iceberg, Delta Lake offers enhanced CRUD operations, including the ability to update and delete records in a data lake which would previously have been immutable. It is ACID compliant and often used in transactional systems. This use-case makes data lakes a viable replacement for traditional transactional databases, while retaining the cost and storage benefits of other data lakes.

Delta Lake can be used with Starburst via the Delta Lake connector.

3. Hudi

Apache Hudi is another table format, which is used less often than Iceberg or Delta Lake. It addresses some of the same problems discussed above.

Open table format architecture

Metadata captures changes in state

Architecturally, modern table formats are composed of a set of hierarchically structured metadata files. These files capture changes in the state of the data in the data lake. You can think of a table format as a kind of database transaction log, outlining all of the changes over the life of the data lake. This metadata  is stored in a structured, readily accessible format.

How does this work? Let’s explore how Iceberg uses enhanced metadata collection to deliver additional functionality.

The image below shows how metadata tracks the changes to the dataset. The files held in the Data layer are captured by the metadata files held in the Metadata layer. As the files change, the metadata files attached to them track these changes.

Record snapshots

To achieve this, modern table formats create records pointing to individual metadata file locations. This file is known as a manifest file, and includes metadata containing information about the table at a given point in time. This acts as a kind of snapshot of the table, detailing the points at which a change is made. Multiple manifest files are stored in Manifest lists.

In the image below, changes in the Data layer have been detected in the Metadata layer. A new Manifest file and corresponding Manifest list has been created to capture these changes.

Create up-to-date record of changes

Manifest files and Manifest lists provide the ability to record an accurate, up-to-date account of the changes that have occurred over time. This includes inserts, deletions, updates, schema migrations, and partition changes. The changes themselves are stored in Metadata files known as Snapshots. Each snapshot is like a slice in time, allowing the dataset to be queried as it was multiple instances or rolled back to a previous state.

How does this work?

The image below shows how the changes in the Data layer have created a new Snapshot file, Snapshot 1. The original Snapshot file, Snapshot 0, is also retained. This creates a series of snapshots, each mapping changes to the data and recording those changes in the Metadata layer.

Open table formats vs Open file formats

Table and file formats are different open-source elements of an open data lakehouse. Columnar open file formats like Parquet and ORC ensure data within an object gets written in ways that optimize query performance; while open table formats like Iceberg sit above the files and objects, providing a layer of rich metadata to enable analytics on the underlying data lake.

Open table format feature Apache Iceberg Delta Lake Apache Hudi Apache Hive
Transaction support (ACID) Yes Yes Yes Limited 
File format Parquet, ORC, Avro Parquet Parquet, ORC, Avro Parquet, ORC, Avro, and more
Schema evolution Full Partial Full Partial 
Partition evolution Yes No No No
Data versioning Yes Yes Yes No
Time travel queries Yes Yes Yes No
Concurrency control Optimistic locking Optimistic locking Optimistic locking Pessimistic locking
Object store cost optimization Yes Yes Yes No
Community and ecosystem Growing  Growing Growing Established

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.