×

What’s the difference between Iceberg vs Delta Lake?

Iceberg is interoperable with a broader range of other open-source technologies. For example, Iceberg supports Avro, ORC, and Parquet data formats, while Delta Lake only supports Parquet.

Last Updated: May 1, 2024

Apache Iceberg and Delta lake are both members of a new generation of data lakehouse technologies built to drive data warehouse functionality and performance out of inexpensive cloud object storage – think AWS S3 or Azure Blob Storage. 

However, with openness baked into its DNA, Apache Iceberg goes much further in delivering the true promise of the cloud-based, open data lakehouse than Delta Lake, which takes an older, more proprietary approach. 

This yields a situation where Delta Lake is primarily beneficial for users already within the Databricks ecosystem. Iceberg, on the other hand, offers advantages for a broader range of users due to its open-source nature. It’s a foundational difference, and one with many implications and consequences for data stacks, not just today but in the future. 

Next generation open data lakehouse technology

In a sense, this is the tale of two data lakehouses. One began as a fresh technological category, charted new territory as a proprietary format before repositioning itself as open source. The other took the next step in terms of functionality, embraced openness from the start and has a growing chorus of support from across the industry, representing one of the most powerful and impactful Apache projects in recent memory. 

If Delta Lake began the data lakehouse movement, Apache Iceberg has truly realized the potential of creating an open data warehouse using the technology of a data lake. In this sense, you can think of Iceberg as perfecting the data lakehouse format, adding additional functionality and openness to Delta Lake’s initial offering, and backing it up with a dynamic, energized open source community that drives it forward rapidly. This combination of ease of use and openness is one of Apache Iceberg’s defining characteristics. 

In this sense, though both Delta Lake and Apache Iceberg offer the promise of the data lakehouse, only Iceberg takes a truly open approach. In the end, that makes all the difference. 

Towards the data Icehouse

The open data lakehouse build on Apache Iceberg is a real revolution. In fact, at Starburst, we’re so excited by this combination of technologies, that we think it constitutes a new era in data analytics, something we call the data Icehouse. So while Apache Iceberg and Delta Lake are both data lakehouse technologies, only Iceberg helps build the Icehouse, with all the accompanying advantages of a data warehouse on a data lake. 

Modern, open table formats like Apache Iceberg store more metadata and make better use of that metadata compared to older table formats like Apache Hive. This builds on the traditional data lake’s strength–cost-effective, flexible object storage, but adds on additional functionality  into a data lakehouse with all the performance advantages of a data warehouse. In just a few years, Apache Iceberg adoption has outpaced Delta Lake, the former front-runner in enterprise data architectures. 

How? Let’s take a look. 

Apache Iceberg tables and Delta Lake

Apache Hive to Iceberg

Let’s start by addressing what both Apache Iceberg and Delta Lake have in common. They are both designed to replace the much older technologies of Hadoop and Hive. If this is a history, the history of data lakes, the first chapter in that history is Hadoop and the second is Hive. But just like any historical innovation, the first entry makes a leap but does not perfect the technology. In the same way, Hive – revolutionary in its day – now struggles with the size, volume, and changeable nature of large, modern datasets used in today’s enterprises. 

This is a problem because Hive workloads still represent the majority of data workloads today. Migrating from Hive to Iceberg is a large but necessary task, and in a sense, you can see the data lakehouse in general as an attempt to develop technologies to replace Hive. 

Data lakehouse metadata collection 

How do they do this? In a word, metadata. All modern table formats – including Delta Lake and Iceberg – aim to address Hive’s shortcomings by collecting, storing, and using metadata in an a more concerted and rigorous way. Metadata is the proverbial foundation upon which the data lakehouse is built, allowing for the implementation of features important to modern analytics, like ACID transactions and time travel.

Origins of Apache Iceberg

Apache Iceberg emerged from Netflix, where data teams had grown increasingly frustrated with the Hadoop ecosystem’s limitations. Changing datasets required an enormous amount of coordination to prevent data corruption. They created Iceberg to provide the scale and functionality that Netflix needed to make big data queries reliable, scalable, and useful. Netflix moved Iceberg to the Apache Software Foundation, leading to its adoption by LinkedIn and other data-intensive enterprises.

Iceberg architecture

Iceberg tables have a three-tier structure: 

  • Manifest files
  • Manifest list files
  • Metadata files

Manifest files point to the data files used in the table, along with the partition data and metrics queries need to retrieve data efficiently. The data itself may be saved as Avro, ORC, or Parquet files.

Manifest list files contain the metadata of a snapshot’s manifests along with other statistics.

A metadata file records an Iceberg table’s schema, partitioning configuration, and other aspects of its state. Iceberg creates a new metadata file when that state changes, replacing the old file atomically. The metadata file also contains a snapshot of the table’s manifest list.

Take a look at the following video that explains the connection between Manifest files, Manifest list files, and metadata files. 

In addition to this, Iceberg uses the Avro file format for manifest and manifest list files and JSON for metadata files.

Importantly, the open-source project does not specify a particular catalog for managing Iceberg tables, giving organizations flexibility in building their data architectures. Ready-to-use implementations include REST, Hive Metastore, a JDBC database, or Nessie.

What is Delta Lake?

Delta Lake began as the format for Databricks’ proprietary cloud analytics platform but is now an open-source project governed by the Linux Foundation. Databricks’ founders helped create the open-source query engine Apache Spark.

Despite differences in transaction strategies, both Iceberg and Delta Lake emphasize the importance of metadata for efficient data management. Specifically, Delta Lake logs changes to a table’s data and metadata as JSON-formatted delta logs. By maintaining a record of every change, these delta logs power Delta Lake’s functionality. 

To improve data lakehouse performance, Delta Lake creates Parquet-formatted checkpoint files that group and summarize older delta logs as permanent records of a delta table’s change history.

Differences between Iceberg and Delta Lake

The different ways each table format supports features like ACID transactions or schema evolution can impact how you implement a data lakehouse. However, the biggest difference comes from the context of each framework’s development.

Simply put, to get the most out of Delta Lake requires a user to lock themselves deeply inside the Databricks ecosystem. The table format’s legacy as a proprietary technology, though now formally ended, still tethers Delta Lake more tightly to a single vendor, a single company, and a single outlook. Though users are often able to achieve powerful results with Delta Lake, these results come at the cost of openness and risks setting up a problem further down the line. This is the old centralization problem yet again, and points once more to the value of openness when constructing a data stack. 

In contrast, Iceberg is all-in on its commitment to open source methodology, which helps to guide the project in exciting and dynamic ways. What’s more, Iceberg is interoperable with a broader range of other open-source technologies. For example, Iceberg supports Avro, ORC, and Parquet data formats, while Delta Lake only supports Parquet.

Although created at Netflix, Iceberg’s cultivation within the Apache Software Foundation has given it a robust and diverse developer community. Over time, this advantage will only become more pronounced. Delta Lake will push further and further into a single ecosystem, even as Iceberg becomes the modern default for new technologies across the board. 

There is an interesting problem at work here regarding futureproofing. Is it better to stick with the technology optimized for a single company or platform, or opt to use an open system that evolves and changes with the times? We believe that the latter is best, and that’s why we’ve singled out Iceberg-based data lakehouses as their own category – ‘Icehouse’. 

How does Iceberg compare to Delta Lake in terms of ACID transactions and data versioning?

In general, Iceberg and Delta Lake differ in their approach to ACID transactions and data versioning. Iceberg employs a merge-on-read strategy, while Delta Lake uses merge-on-write, each with its own implications for performance and data management.

Have a look through the following table to learn more. 

Iceberg

Delta Lake

A State changes result in atomic replacement of metadata file Transaction logs only track data files with successful transaction completion
C Optimistic concurrency locking Optimistic concurrency controls
I Serializable isolation levels

Snapshots

Serializable isolation levels

Snapshots

D All transactions are permanent

Snapshots allow roll-backs

Inherits cloud service provider durability guarantees

How do Iceberg and Delta Lake handle schema evolution, metadata management, and partition evolution?

Iceberg

Delta Lake

Schema evolution Saves schema changes as metadata Schemas are stored in the delta log
Metadata management Table metadata is stored in manifest files, manifest list files, and metadata files Table metadata is stored in the delta log
Partition evolution  Iceberg has included this feature from the outset. Although not always needed, it is an important tool in the data engineering arsenal.  Delta Lake was late to the partition evolution game, and includes less support for the feature at its core. 

Start building your open data lakehouse

Traditional data warehouses, even those like Snowflake running in the cloud, struggle to keep pace with the scale and dynamism of big data processing. A lot of data, including data used for machine learning and artificial intelligence (AI) applications is either semi-structured or unstructured. This is a problem for data warehouses as it must undergo costly ETL before even entering the warehouse. Once there, analytics are performant, but expensive. 

Data lakes were once considered the low-cost solution to this problem, but they lacked the analytics performance of proprietary warehouse systems. An open data lakehouse combines the best aspects of each to create a data platform that is performant, scalable, accessible, and cost-effective.

Amazon S3, Azure Blob Storage, Google Cloud Storage, and Iceberg

Cloud object storage forms the foundation of a data lakehouse. Services like Microsoft Azure Blob Storage, AWS S3, and GCP Cloud Storage offer low-cost storage, enterprise-class infrastructure, and global footprints that few organizations could develop on their own.

Performance of a data warehouse,

Layering open file and table formats on top of a data lake’s object storage provides the rich metadata needed to query petabyte-scale datasets quickly and efficiently.

Scale of a data lake

Commodity object storage gives data lakes the scalability lacking in traditional data warehouses. By decoupling storage and compute, data management teams can optimize the former without compromising the latter. Storage infrastructure can grow in line with data volume trends. Meanwhile, compute spending can scale up and down with demand.

Open source Spark and Trino SQL query engine

An open query engine is the final element of an open data lakehouse. Options like Spark and Trino (formerly Presto) use parallel processing on enormous scales to maximize performance while leveraging the lakehouse’s rich metadata to access the most relevant data, reducing compute costs and system latency.

Impact of a query engine on ETL and data pipelines

These query engines also provide optionality to mix and match the best technologies for your data stack. Data engineering teams can choose the most appropriate engine to power their ETL data pipelines. Data consumers can use the analytics tools they already know since Spark and Trino are based on SQL.

What table format should I choose for my data lake?

Your choice will depend on the nature of your organization’s data analytics use cases. Historically, Delta Lake has had a better handle on ingestion from real-time sources like Kafka. On the other hand, Iceberg supports more file formats and is truer to the principles of open source.

Performance, openness, and compatibility have driven Iceberg’s rapid adoption. That’s one of the reasons why Starburst Galaxy, our fully managed, Trino-based open data lakehouse analytics platform, makes Iceberg its default table format.

Why choose Apache Iceberg over Databricks Delta Lake

Learn more

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.

s