Why Apache Iceberg will accelerate competition for compute engines

StrategyJune 13, 2024

Evan Smith

Technical Content Manager

Starburst Data

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Why choose Apache Iceberg over Databricks’ Delta Lake

Apache Iceberg emerged last week triumphant, having won the race to become king of the data lakehouse.

In many ways, this was a long time coming.

No longer the heir presumptive of the table format world, no longer the interesting new technology worth exploring, Iceberg is now the reigning monarch of the open data lakehouse following simultaneous adoption from both Snowflake and Databricks.

It’s a shift whose idea has come at last, and one that we’ve written about and anticipated before, especially given Iceberg’s close connection to Starburst and Trino.

But no matter what data platform you use, the move to Iceberg will have big implications, and its impact will push beyond table formats alone. In fact, the openness that Iceberg facilitates promises to kickstart an arms race for compute engines capable of operating at scale on open data stacks.

May the best compute engine win?

Open compute engines for the open data stack

In this sense, the more things change, the more they stay the same.

Just as the industry converges on an open table format, a new battle begins around open compute. The king is dead; long live the king.

In an open architecture where everyone is using Iceberg, the next logical question to ask is: which compute engine performs best with Iceberg?

Here, Iceberg’s openness promises to set up a more direct comparison between compute engines set on an equal footing for the first time. This raises one fundamental question.

Have open table formats given rise to open compute?

Read on.

How Apache Iceberg won the table format race

Let’s answer the question by looking back at how Iceberg won the table format race in the first place.

Last week was a watershed moment, but Apache Iceberg isn’t new. In fact, it had already been talked about across the industry for some time, including by Starburst.

What Iceberg really did last week was win over the holdouts, namely Snowflake and Databricks.

Let’s look at how each embraced openness one by one.

Snowflake Polaris

The first win for openness was Snowflake’s Polaris announcement. It outlined the creation of their own REST catalog centered on Iceberg. This opens the Snowflake platform, famous for its proprietary, closed approach to direct compute competition for the first time, including from Trino. This is a seismic event and one that promises to have many Snowflake users asking where they can get the best compute costs.

Databricks and Tabular

The second win for openness was Databricks’ acquisition of tabular, a move primarily designed to secure a foothold in the Iceberg ecosystem and integrate a REST implementation of Iceberg into their Unity catalog. The move promises to create a single catalog access level across Delta Lake and Iceberg. This will certainly change things for Databricks, which has always championed Delta Lake before this.

Apache Iceberg > Delta Lake

One way of interpreting this news is that Delta Lake has lost, or at least been relegated to second place.

In many ways, this was Delta Lake’s fight to lose.

Not long ago, it seemed like Delta Lake would win this battle and be the default table format for the data lakehouse era, the one that would unseat the data warehouse from its traditional place of dominance. But even though Delta Lake had many of the same features as Iceberg, its technological DNA was too bound up with a single ecosystem, Databricks.

It lacked a dynamic, open-source community backing and didn’t fit well within an open data architecture. Starburst Galaxy accessed it alongside Iceberg, but not all the engines could.

The Apache Iceberg era

In this sense, Iceberg won because it provided a combination of attributes that other table formats couldn’t–functionality approximating a data warehouse provided on an open data stack that championed the separation of storage and compute.

Both of these approaches will define the Apache Iceberg era.

Let’s look at them one by one.

Iceberg lets the data lakehouse beat the data warehouse

Iceberg won because it has fewer drawbacks compared to traditional data warehouses. In fact, its enhanced metadata collection allows for features that are usually only associated with data warehouses, including ACID compliance, support for transactional data, time travel, and schema evolution.

This closes the gap between what had been a tradeoff between the high cost and superior features of a data warehouse and the low cost but more less feature-rich data lake. With Iceberg, you don’t have to choose, and that’s a big deal. It allows businesses to embrace cloud object storage–so much cheaper than other storage methods–and swap in whatever compute engine works for them.

Iceberg is open

Although every table format makes claims to being open in one way or another, Apache Iceberg embraced an open data architecture at its core.

This is an important distinction. Iceberg isn’t just open-sourced. It represents a paradigm shift from a monolithic data platform that provides an end-to-end experience to a model where data pipelines are composed of interoperable components that can be swapped out as needed. You can see this in the wide range of diverse contributions to the Apache project, ranging from companies such as AWS, to Starburst, to Apple.

Now, even the two most monolithic data platforms in the industry–Snowflake and Databricks–have also embraced an open approach. This move opens up the data stack in a way that few other moves could. It repositions the whole industry for a new, reinvigorated competition between compute engines, each vying to process the workloads of Iceberg tables.

The new battleground over open compute engines

Iceberg levels the playing field and will encourage many organizations to adopt an open data stack. Its openness can’t be worked around and can’t be avoided. The big players have had to embrace it.

Why consensus on table formats means more competition on compute

This opens up a new front going forward. If everyone is using Iceberg, then everyone is using some version of an open data stack, and that means the ability to swap in or out compute engines.

The question then becomes, which compute engine will come to dominate? This is the question that many businesses will ask themselves in the coming months, and it signals the next big fault line in the data industry.

Which compute engine works best on Iceberg?

To answer the question, you have to look at Iceberg’s architecture and ask which technologies work best in conjunction with it. Trying to get a leg up in this space is almost certainly why Databricks acquired Tabular, but it’s likely to be a far more open competition than the battle between a few platforms.

Why Starburst Galaxy performs best on Iceberg

All of this is good news for Starburst Galaxy. It was designed to compete in exactly these conditions, an open data architecture using Apache Iceberg, but supporting Delta Lake, Hudi, and Hive.

Trino

One key reason for this is Trino, the engine powering Starburst Galaxy. When it was developed at Netflix, Apache Iceberg was originally built to run on Trino, and the two technologies have been closely linked ever since.

As the world increasingly moves to embrace Iceberg, Trino is well-positioned to move into this evolving space. Compared to other compute engines, Trino works particularly well with Iceberg, and the connection between the two technologies runs deep.

Icehouse Architecture

Starburst has been so excited about the combination of Trino and Iceberg that we’ve even given it a special name, the Icehouse.

An Icehouse architecture consists of Trino and Iceberg at its core, but also includes four key components using Starburst Galaxy:

Data ingestion
Data governance
Data management
Automatic capacity management

This perfectly fits the needs of the moment, matching an industry-wide shift towards an open architecture with technology designed to work best on Iceberg.

Open data architecture is the modern data stack

With the whole industry shifting towards a more open data stack using Iceberg, a more natural competition for compute technologies will be the next battleground in the industry. This is a significant shift, with far-reaching implications that won’t be immediately apparent.

But as data engineers survey the landscape and review which technologies offer the best performance on Apache Iceberg, the Starburst Icehouse architecture is perfectly positioned to fill the need created for a compute engine that can scale, handle multiple data sources using data federation, and perform well on Iceberg.

In the end, the Icehouse may win the compute wars in the same way that Iceberg did with the table format wars.

The Data Engineers Guide to Iceberg v3

Why Apache Iceberg will accelerate competition for compute engines

More deployment options

Start for Free with Starburst Galaxy

Why choose Apache Iceberg over Databricks’ Delta Lake

Open compute engines for the open data stack

How Apache Iceberg won the table format race

Snowflake Polaris

Databricks and Tabular

Apache Iceberg > Delta Lake

The Apache Iceberg era

Iceberg lets the data lakehouse beat the data warehouse

Iceberg is open

The new battleground over open compute engines

Why consensus on table formats means more competition on compute

Which compute engine works best on Iceberg?

Why Starburst Galaxy performs best on Iceberg

Trino

Icehouse Architecture

Open data architecture is the modern data stack

Why choose Apache Iceberg over Databricks’ Delta Lake

State of data catalogs: The battle for your metadata

Snowflake, Databricks, Tabular, Iceberg, what does it all mean?

Automating the “Icehouse” – Fully-managed Open Lakehouse Platform on Starburst Galaxy