×

The difference between Hudi and Iceberg

Hudi and Iceberg are Apache Software Foundation open-source projects that improve the performance of big data architectures

Last Updated: May 2, 2024

These open table formats address the issues companies increasingly experience with their legacy platforms running Hadoop and Hive. This article will discuss the differences between Hudi and Iceberg and explain how Iceberg is becoming the cornerstone for modern data lakehouse analytics.

Apache Hudi

The Apache Hudi project got its start in 2016 at Uber. The ridesharing company had built a data lake on Hadoop and Hive, but the batch processing pipelines took hours to complete. Traditional streaming stacks excelled at processing row-based data but could not handle the lake’s columnar data. Uber’s solution became Hudi, an incremental processing stack that reduces ingestion latency from hours to minutes.

Apachi Iceberg

Around the same time Uber was struggling with Hive, Netflix faced a different set of issues. Hive did not handle changes well, so the streaming service needed a new table format that supported ACID (atomicity, consistency, isolation, durability) transactions. Since becoming an open-source project, Apache Iceberg tables are increasingly the preferred choice for data lakes thanks to benefits like:

  • Scalability and performance
  • ACID transactions
  • Schema evolution
  • Time travel

Iceberg also provides significant optionality. It supports open file formats like Avro, ORC, and Parquet. The table format also lets users simultaneously use different query engines, such as Flink, Apache Spark, and Trino.

Open table formats: Iceberg, Hudi, Delta Lake

Modern open table formats are essential to maximizing the potential of data lakes by supporting a data warehouse’s processing and analytics capabilities on commodity-priced cloud object storage. Organizations can use Iceberg and Hudi with any Hadoop or other distributed file systems. Another open table format, Delta Lake, is also an option but tends to be used within Databricks platforms.

How does Apache Iceberg handle schema evolution compared to Apache Hudi?

Shifting business priorities and a dynamic data environment frequently require changes to table schema. Older formats like Apache Hive impose significant performance penalties by rewriting entire files in ways that impact existing queries. Schema evolution is one of the key features enabled by modern table formats.

Hudi tables, depending on their configurations, use one of two approaches to schema evolution. Copy On Write (COW) uses columnar formats like Parquet files to store data and performs updates by rewriting the file to a new version. COW is the default approach, proven at scale with high-performance query engines like Trino.

Hudi’s experimental Merge on Read (MOR) approach combines columnar data files with row-based files like Avro to log changes for later compaction. MOR provides greater flexibility, especially for changes to nested columns.

Iceberg uses in-place schema evolution to add, remove, rename, update, and reorder columns without table rewrites. Data files don’t have to be touched because changes are recorded within the table’s metadata. In effect, Iceberg provides a transaction log for each table that includes snapshots of the included data files, statistics to improve query performance, and any changes from previous versions.

Iceberg tables and time travel

Iceberg’s metadata-based approach to documenting change enables time travel, the ability for queries to access historical data. Every change results in a new snapshot that captures the current table’s state, but Iceberg tables keep their old snapshots. Queries can access the table’s list of snapshots to return results from older versions. Rollbacks are common use cases for Iceberg’s time travel functionality, allowing a table to be restored to a previous state after a mistaken change.

What table format should data engineers choose for my data lake?

Each table format brings its own advantages and disadvantages, which data engineering teams need to factor into designing their data architectures. Hudi’s origins as a solution to Uber’s data ingestion challenges make it a good choice when you need to optimize data processing pipelines. In contrast, Netflix developed Iceberg to simplify the big data management issues of the Hadoop and Hive ecosystem. As such, migrating to Iceberg tables is ideal for storing large datasets in a data lake.

Iceberg and Trino MPP SQL query engine, Apache Spark

As mentioned earlier, Iceberg lets different query engines access tables concurrently, allowing data teams to use the most appropriate engine for the job. Trino, a fork of Presto, is a massively parallel processing SQL query engine that uses connectors to query large datasets distributed across different sources.

Trino’s Iceberg connector provides full access to Iceberg tables by simply configuring access to a catalog like the Hive Metastore, AWS Glue, a JDBC catalog, or a REST catalog. Trino will connect to Azure Storage, Google Cloud Storage, Amazon S3, or legacy Hadoop platforms.

Amazon S3, AWS and Iceberg

AWS services like Athena, EMR, and Glue support Iceberg tables to various degrees. Athena, for example, requires Iceberg tables to store data in Parquet files and will only work with Glue catalogs.

What is Iceberg in Snowflake?

Snowflake is a proprietary cloud-based data warehouse solution. Recently, the company began developing support for Iceberg, now in public preview. Snowflake’s users can configure the system to be Iceberg’s metadata catalog or use Snowflake to pull snapshots from either a Glue catalog or directly from an object store.

Getting the most out of Snowflake’s implementation, however, requires integrating Iceberg’s metadata into Snowflake at the risk of greater vendor lock-in.

Start building your open data lakehouse powered by Iceberg table formats

Starburst Galaxy is a modern data lakehouse analytics platform founded by the creators of Trino. With features like federation, near-real-time ingestion, accelerated SQL analytics, and more than fifty connectors, Galaxy unifies enterprise data within a single point of access. Big data becomes easier to manage across a globally distributed architecture, improving compliance with GDPR and other data regulations. At the same time, Galaxy makes data more accessible since data consumers can use ANSI-standard SQL or business intelligence tools to query data anywhere in the organization.

Performance of a data warehouse

Starburst builds upon Trino’s massively parallel processing query engine to give data lakehouses the analytics performance of proprietary data warehouses.

A cost-based optimizer takes SQL queries and evaluates the performance and cost implications of different execution plans, choosing the ideal option to meet data teams’ business criteria.

Starburst’s Cached Views create snapshots of frequently-requested query results to reduce costs and apparent latency. From the user’s perspective, the materialized views are indistinguishable from a fresh query run. And with incremental updates, the cached data remains current.

Additional performance features like pushdown queries and dynamic filtering complete queries faster while also reducing network traffic.

Scale of a data lake

Starburst Galaxy fully enables the scalability enterprises need from their data architectures. A data lake’s object storage provides a central repository for ad hoc, interactive, and advanced analytics. However, it can never hold all the data relevant to insight generation.

By federating data lakes and other data sources within a unified access layer, Starburst Galaxy turns the organization’s entire architecture into a distributed data lakehouse.

Starburst Gravity is the platform’s universal discovery, governance, and sharing layer. Gravity’s automatic cataloging system consolidates metadata from every source, turning Starburst into a central access hub across clouds, regions, and sources.

Gravity provides role-based and attribute-based access controls to streamline governance and support fine-grained access policies down the row and column levels.

The advantages of combining the strengths of Starburst’s analytics platform with the benefits of the Iceberg table format are so strong that Iceberg is the default table format when creating tables in Starburst Galaxy.

Why choose Apache Iceberg over Databrick’s Delta Lake

Learn more

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.

s