The Icehouse Manifesto: Building an Open Lakehouse

Last Updated: April 15, 2024

There is a revolution brewing. Dave Vellante and George Gilbert sounded the alarm in their October 15th, 2023 article, identifying a new data platform gaining momentum. We believe this new platform is the foundation for an open lakehouse.

At the heart of this revolution is a powerful idea that we’ve been writing and speaking about for years: Optionality. Optionality is only made possible with an open architecture and open components. Specifically, two components form the kernel of this data platform, an open query engine in Trino, and an open table format in Iceberg. We call this platform, the Icehouse.

For over 40 years, data warehouse vendors (predecessors to the modern lakehouse) have locked customers into proprietary data formats that could only be accessed through their software. Then they turned the screws and took your money.

Even with the advent of “storage compute separation” touted by cloud data warehouse (CDW) vendors, the story has always been limited to separating their compute and their storage. This allowed elastic scalability but did nothing to change the fact that every bit and byte ingested into a CDW was only accessible by that CDW. Then add vendor-specific non-standard (not ANSI) SQL syntax to the equation, and customers are totally locked into these proprietary platforms with switching time and costs measured in years and millions of dollars. We have seen this movie over and over again. However, for the first time, that is finally changing.

The Icehouse: An open lakehouse with two key ingredients

Ingredient #1: Trino as the Open Query Engine

When my Starburst co-founders Martin, Dain, and David first created Trino (PrestoSQL became Trino in 2020), their goal was to develop an open-source SQL query engine able to handle Facebook’s enormous petabyte-scale data warehouse. With unmatched query performance at scale, it quickly replaced Hive and allowed Facebook’s employees to run fast interactive queries on 300+ petabytes of data.

In 2019, Martin, Dain, and David left Facebook to focus their efforts on running Trino as an open and independent community project, and Starburst was born to support it. In just 4 years, the data engineering community has rallied behind Trino, making it the most popular open-source SQL query engine for data lakes in existence. Today, Trino is used by tens of thousands of the most sophisticated companies on the planet, including well-known organizations like Electronic Arts, Goldman Sachs, LinkedIn, Lyft, Netflix, Nielsen, Salesforce, and Stripe.

However, using the legacy Hive table format proved challenging in providing Trino users with a full data warehouse experience on the lake. Specifically, the ability to insert, update, and delete rows (as well as make schema changes) in Hive tables is costly, typically requiring a full rewrite of the table. It is also error-prone, provides no guarantees around data consistency, and requires users to have technical knowledge of topics like partitioning schemes and file formats.

Ingredient #2: Iceberg as the Open Table Format

Table formats, like Iceberg, Hudi, or Databricks Delta Lake were created to provide the warehouse-like data management experience (insert, update, delete etc.) for data lakes. A table format is basically a metadata layer over file formats like Parquet or ORC, meaning a single table (e.g., Customers) consists of metadata plus multiple files. While Trino supports all of the major table formats, Iceberg, originally created by Netflix, is the winner here because of its unmatched popularity in terms of vendor support.

By providing ACID (atomicity, consistency, isolation, durability) properties, Iceberg enables engines like Trino to run standard SQL statements like INSERT, UPDATE, DELETE, and MERGE safely and efficiently, against tables in a data lake. Iceberg also eliminates the need to know how the data is stored.

Trino + Iceberg = Foundation for an Icehouse

Now, neither Trino nor Iceberg can provide a data warehouse experience alone. But together, these two technologies complete each other, and when combined create a truly open data warehouse on the lake. This open lakehouse architecture is what we call the Icehouse. The concept of the Icehouse has been over a decade in the making. In fact, Netflix developed Iceberg to pair with Trino, which allowed Netflix to migrate off of their proprietary data warehouse to their Trino + Iceberg lakehouse.

The foundation for an Icehouse architecture rests on three key tenets:

  1. Trino is used to query the data
  2. Iceberg is the table format stored in the lake
  3. SQL is the language for querying, modifying, and managing tables (i.e. Data Definition Language, Data Manipulation Language, Data Control Language)

Operating an Icehouse

For an end-to-end data warehouse experience, an Icehouse implementation must also provide four core capabilities:

  1. Data ingestion (e.g., streaming and batch)
  2. Data governance (e.g., access control, data lineage, auditing)
  3. Iceberg data management (e.g., compaction, retention, snapshot expiration)
  4. Automatic capacity management (e.g., increase or decrease Trino cluster size)

Icehouse use cases

An Icehouse can perform all of the same workloads that you would use a traditional data warehouse for, including:

  • Business Intelligence (e.g., dashboarding)
  • Data transformations and data preparation (e.g., load and transform)
  • Power data-driven applications (e.g., provide in-app analytics)
  • Artificial Intelligence (e.g., quickly provide data for training or scoring)

In summary

We believe that a lakehouse architecture sits at the center of the emerging platform that Vellante and Gilbert wrote about, and further that the Icehouse is that open lakehouse. The lakehouse is the central nervous system of the data platform, it is how you query and store your data, and it is the place where other key tools like dbt, Dagster, or your favorite BI or development tool go for data.

You can build an Icehouse yourself based on open-source Trino and Iceberg. Or, if you’re looking for an easy button, read more about Starburst Galaxy. In a future blog post, we’ll take you through how we built Galaxy – our Icehouse implementation.

“The move to Starburst and Iceberg has resulted in a 12x reduction in compute costs versus our previous data warehouse. This efficiency allows us to focus our attention on using analytics for revenue-generating opportunities.”

Peter Lim, Sr. Data Engineer, Yello

Learn more

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.