What is the Starburst Icehouse Architecture

An Icehouse is a data lakehouse built on top of Iceberg and Trino

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Let’s start with the most pressing question the title presents: what’s an Icehouse? An Australian rock band, a brand of beer, a building where you store ice (usually an “ice house,” with a space), or a New Zealand business development center? Well, yes, yes, yes, and yes, but for the purposes of this blog post, we’re talking about a term coined by Starburst CEO Justin Borgman to describe a data lakehouse built on top of Iceberg and Trino. For those who are new to the data space, this raises a few other questions: what’s a data lakehouse, is Starburst the same as the candy company, and what’s Iceberg? Let’s take a step back, and build our way to an understanding of what these things are. Then we can discuss why the Icehouse may be a good solution for your data problems.

The data lakehouse

The data lakehouse is the amalgamation of the best parts of a data lake and a data warehouse. …and we’re going to need to break this down further.

The data warehouse

One of the oldest concepts in big data storage and analytics, the data warehouse should be familiar to most readers who’ve stumbled upon this blog. A data warehouse is a home for structured, analytics-ready data optimized for queries and business intelligence. A well-maintained, organized, centralized data warehouse that stores most of an organization’s data has long been the north star for a large organization’s data engineering team. The struggle is that structuring all of your data to fit within a warehouse is a massive and messy task. And because it generally requires your data to go through an ETL process, it can lead to data duplication, delay when new data becomes accessible, and limit flexibility. Maintaining a data warehouse is a never-ending, expensive, and time-intensive challenge; failing to maintain it adequately can reduce access to data or render it entirely useless. There is still a time and place for data warehouses, but the flaws in cost, scalability, and maintenance have been a pain point since data warehouses have existed.

The data lake

A reaction to the headache of maintaining a rigorous, structured data warehouse, the data lake takes the opposite approach: throw all the data into a lake. It’s what it sounds like. By storing data in its native format, you avoid the headaches and costs of massive ETL workloads and greatly simplify your data stack. The downside is that your data becomes a bit of a mess. When querying a data lake without structure, your queries become more sophisticated and complex, requiring advanced data science skills and tools to transform unstructured data from storage to meaningful analytics and insights. You’re not getting rid of the task of reshaping the data – you’re pushing it downstream. If the shape and format of your unstructured data meanders or drifts over time, supporting and handling all the edge and legacy cases can become a headache or borderline impossible, leaving you with more of a data bog, swamp, or quagmire. If you grew up on the Chesapeake Bay, you might say it gives you a giant data algae bloom. You don’t want that.

Enter the data lakehouse

What if we took the benefits of both the data warehouse and the data lake? Maintain the flexibility of being able to store unstructured data when it makes sense, but be equally willing to apply some structure and rigor to the data that needs some extra attention? Like a data lake, a data lakehouse is intended to capture all of your data in a single, low-cost cloud object store, while the “house” part enables transactions, transformations, and restructuring of data with ACID (atomicity, consistency, isolation, and durability) properties to glean many of the benefits of a traditional data warehouse. There’s no risk of data duplication, and with some active maintenance, old data shouldn’t become unintelligible or require overly complex queries to understand. Data lakehouses store much more metadata, enabling record-keeping, tracking all transactions, and the ability to rollback or view snapshots of past data. This introduces complexity, especially if you’re trying to build a lakehouse from scratch, which is why many companies are selling lakehouse solutions to save data teams the headache.

For now, we can hopefully say you understand the key concepts of a data lakehouse. There’s more to be said on exactly what a data lakehouse is if you’re looking for more details, but for now, you can consider yourself briefed.

Iceberg

So what’s Iceberg? A floating chunk of ice known for sinking ocean liners, a luxury fashion house, or a Dutch video game publisher? Yes, yes, and yes, but we’re talking about Apache Iceberg, a data lakehouse table format. Iceberg is one of the three main lakehouse table formats (the other two are Apache Hudi and Delta Lake), and its story is built on top of the progression from data warehouses to lakes to lakehouses outlined above. Originally built at Netflix and designed from the ground up to pair with Trino (known as Presto at the time, but we’ll get back to that) as its compute engine, it was an answer to a Hive data lake where transactions were not atomic, correctness was not guaranteed, and users were afraid to change data for risk of breaking something. Even when they did change data, because Hive necessitated rewriting entire folders, writes were inefficient and painfully slow. When you can’t modify your data, change your schemas, or write over existing data, you quickly begin to realize all those downsides, and the data algae bloom rears its ugly head. So… enter manifest files, more metadata, and badabing badaboom – problem solved. Yes, that’s a gross oversimplification, but the reality is that Iceberg’s introduction proved that transactions in a lakehouse could be safe, atomicity could be guaranteed, and snapshots and table history were bonuses that came along for the ride.

Why Iceberg?

On the features front, partition evolution is a big upside, because as your data evolves, your partitions may need to, too. If you don’t know what partitions are, they’re a way to group similar chunks of data so it can be read faster down the line, and they’ve been around for a while. Being able to change how data is partitioned on the fly is new, though, and allows you to adjust and improve your partitioning as your data evolves or changes. Iceberg also hides partitioning and doesn’t require users to maintain it, helping eliminate some of the complexity that would traditionally come from a data lake. You can check out the Iceberg docs for more information on that.

On top of all of that, Iceberg has a lot of momentum behind it. As an open source project with diverse vendor support, many major companies are deploying and using it, it has an extremely active community, and it seems likely to last and continue to receive updates and maintenance into the distant future.

How does Iceberg work?

Metadata and manifest files. A lot of metadata and manifest files.

Metadata files keep track of the table state. Data files are stored in a table rather than in directories, and manifest files are tracked in a manifest list that stores metadata about them. This blog previously mentioned that Iceberg supports “time travel” via snapshots of the table from the past, which can be accessed via a manifest list that points to manifest files representing older versions of the table. On top of that, the format is smart and reuses manifest files when it can for files that remain constant across multiple snapshots. Otherwise, every single transaction is stored, tracked, and able to be accessed as part of a given snapshot.

There’s a ton of complexity to Iceberg working as great as it does. You can read the Iceberg spec or our blog explaining Iceberg architecture for more detailed information.

Frequently asked questions (FAQ)

What exactly is an Icehouse?

An Icehouse is a data lakehouse built on Apache Iceberg as the table format and Trino as the query engine. The term was coined by Starburst CEO Justin Borgman to describe this specific combination of technologies. While the concept of a data lakehouse has been around for several years, the Icehouse represents a specific architectural choice that pairs two open source technologies with deep integration, strong community support, and a growing ecosystem of tools and vendors built around them.

What is the difference between a data lake, a data warehouse, and a data lakehouse?

A data warehouse stores structured, analytics-ready data optimized for queries and business intelligence, but requires significant ETL effort and can be expensive to maintain at scale. A data lake takes the opposite approach, storing data in its native format without enforcing structure, which reduces ETL overhead but makes querying more complex and can lead to data quality problems over time. A data lakehouse combines the best of both, storing data in low-cost cloud object storage while adding transactional guarantees, schema enforcement, and metadata management that make it behave more like a warehouse without the cost and rigidity.

Why is Apache Iceberg the table format of choice for the Icehouse?

Iceberg was originally built at Netflix and designed from the ground up to work with Trino as its compute engine. It solved fundamental problems with older Hive-based data lakes, including non-atomic transactions, unsafe schema changes, and inefficient data rewrites. Iceberg introduced manifest files and rich metadata management that made transactions safe, guaranteed atomicity, and enabled features like time travel and snapshot rollback. It also supports partition evolution, meaning you can change how data is partitioned as your needs change without rewriting the entire table. Among the three main lakehouse table formats, Iceberg has the most momentum, the broadest vendor support, and the most active community.

Why is Trino the query engine of choice for the Icehouse?

Trino was built specifically to solve the problem of querying large data lakes at interactive speeds. Where earlier approaches like MapReduce required submitting a job and waiting hours for results, Trino returns results in seconds to minutes, even at a massive scale. Its connector-based architecture allows it to query data across many different sources without requiring migration or ingestion into a proprietary system. For AI and machine learning workloads, this matters because feature engineering and model training pipelines often need to query large volumes of historical data across multiple sources quickly and concurrently, which is exactly what Trino is designed to do.

What makes the Icehouse particularly well-suited to AI workloads?

AI and machine learning pipelines place demands on a data platform that traditional analytics workloads do not. They require access to large volumes of historical data, often across multiple tables and data sources simultaneously. They generate high-concurrency query patterns as multiple pipelines run in parallel. And they need the data they access to be trustworthy, well-governed, and consistent. The Icehouse addresses all of these requirements. Iceberg’s metadata layer enables efficient file skipping and partition pruning, reducing the compute required to serve large AI queries. Trino’s massively parallel processing architecture handles high-concurrency workloads efficiently. And the open, interoperable nature of the stack means AI frameworks and tools can plug into it without proprietary lock-in.

What does optionality mean in the context of the Icehouse, and why does it matter?

Optionality means choice. Because Iceberg and Trino are both open source and independent technologies, you are never locked into a specific vendor or proprietary system. Historically, data vendors have used proprietary formats and models that make it costly or impossible to move your data elsewhere, giving them leverage to raise prices over time. With the Icehouse, you can switch vendors, deploy on your own hardware, or swap out components if something better comes along, all without being held hostage by a contract or a format you cannot escape. This is especially important as the AI landscape evolves rapidly and the tools and frameworks your organization depends on today may look very different in two years.

How does Starburst make the Icehouse easier to adopt and manage?

While Iceberg and Trino are both well-documented open source projects that a skilled data engineering team can deploy independently, the operational complexity of setting up, configuring, and maintaining the stack over time is significant. Starburst Galaxy abstracts away that complexity by managing the infrastructure for you, providing autoscaling, auto-suspend, and auto-shutdown out of the box, along with features like Warp Speed caching and built-in support for fault-tolerant execution. For organizations moving toward AI-ready data infrastructure, this means the team can focus on building pipelines and delivering value rather than managing cluster configurations and table maintenance schedules.

Is the Icehouse a good fit for organizations that are just starting to build their data infrastructure?

Yes, and arguably more so now than when the concept was first introduced. The rapid growth of AI workloads has made the architectural decisions you make today more consequential than ever. Building on open, interoperable technologies like Iceberg and Trino from the start means you avoid the proprietary lock-in that makes it expensive and disruptive to change course later. It also means your data infrastructure is ready to serve the high-concurrency, large-scale query patterns that AI and machine learning pipelines require, without needing a major rebuild when those workloads arrive.

Optimize your data lake with Iceberg and Trino

Free eBook

Trino

Remember how I mentioned that Iceberg was built to pair with Trino as its compute engine, and said we’d get back to that? We’re getting back to that.

Trino history

Trino was originally created under the name Presto inside Facebook. Facebook’s problem was that they had a massive data lake built on top of Hive, but querying and analyzing their data lake with MapReduce jobs was not performant, especially at scale. Trino was built as query engine which could handle the scale of the data lake and enable analysts and data scientists, writing complicated data lake queries in SQL, to run those queries and get results back at interactive, seconds-to-minutes speeds, a vast improvement over submitting a job and waiting to see the results the next day. It was open sourced on launch, and it saw major uptake in the data community for its ability to process and power analytics at rapid speeds. The connector-based architecture meant that other companies and vendors who wanted to deploy Trino could hook it up to various other data sources beyond what was in use at Facebook, and companies like Netflix, Apple, Uber, LinkedIn, and Amazon did so, contributing to it as well as using it for their own data needs. Starburst, a data startup, entered the picture as a company built on selling a managed version of Presto, and it became one of the main contributors to the project.

Presto eventually forked into two versions – originally Presto and PrestoSQL, but PrestoSQL renamed to Trino a couple years later. The Trino website has a great blog detailing why this happened if you’re curious. Trino has amassed a myriad of features and performance improvements not in Presto that make it the engine of choice these days, though because rebrands are hard and renames are confusing (shoutout to everyone still using the term “Tweets”), you’ll still see it referred to as Presto in some places. Amazon EMR docs are trying their best to clear up the confusion.

What is Trino?

We can start with the big, lengthy definition: Trino is a ludicrously fast, open source, distributed, massively parallel processing, SQL query engine designed to query large data sets from many disparate data sources. The important thing in the center there is that it’s a SQL query engine. You have data, you want to query it with SQL, and Trino allows you to do that. It can do this at a massive scale, making it useful for organizations large and small who are hoping to glean insights from their data.

Trino’s architecture involves deploying a cluster with a single coordinator node and many worker nodes. The coordinator interprets SQL statements into a query, then breaks that query down into stages, breaks the stages down into tasks, and assigns those tasks to worker nodes, which in turn break tasks down into splits so they can be run in parallel. A small Trino cluster may involve a single coordinator and worker node running on the same machine, while a large Trino cluster may involve hundreds of servers each operating as a worker node. Large organizations can run many Trino clusters, using the Trino Gateway as a load balancer and proxy to make the many clusters behave like one massive cluster.

Why Trino?

While the data lakehouse storage format war has three horses in the race (and Iceberg is winning), Trino has been around long enough and is so good at what it does that it more or less stands on its own. If you have already solved the problem of where and how to store your data and now you’re looking for a query engine that can power your analytics, Trino is the answer. Apache Spark is a powerful compute tool for batch processing of data, and many organizations still use it for that, though Trino’s fault-tolerant execution mode can help compete with it on that front, while Spark cannot compete with Trino’s performance for analytics. Other established, performant analytics solutions will require you to migrate or ingest all of your data into a specific system so that you can run analytics on it, which is, to put it mildly, an expensive headache. 

Back to the Icehouse

Believe it or not, “Back to the Icehouse” was the working title for Back to the Future (you shouldn’t believe it, that’s a lie). But with all of this context, you’re ready to understand what the Icehouse is: it’s Iceberg as your storage format, paired with Trino as your compute engine. Yep, that’s it. But in addition to the technologies that compose it being the best at what they do, there’s a few additional key reasons why it’s a great stack.

Close integration

It’s worth explicitly pointing out that Iceberg was built for Trino, and Trino’s ongoing development has included many specific features and improvements for Iceberg. A lot of organizations are using the Icehouse stack already, and this means that there’s a lot of demand for improvements to it. While there are other table formats and other analytics engines, it’s hard for them to compete with the traction that this specific stack has already picked up. Trino and Iceberg go together like peanut butter and jelly.

Open source

Trino and Iceberg are both open source, and they both have thriving user and developer communities that are improving them all the time, sharing knowledge, and working to make them better than ever. They’re both supported by several vendors and deployed at a number of massive tech companies, meaning that contributions come in from all over the world, from wildly different use cases, and will continue to do so for the foreseeable future. The benefits of open source are numerous and varied, but it’s hard to overstate the value that comes from the widespread adoption and continued development. Other tools (such as those in the data visualization space) see the adoption and get a lot of value out of adding integrations, so they do, growing the ecosystem. Bugs are encountered, figured out, and fixed before you even bump into them.

Optionality

With separate storage and compute solutions that are standalone, independent technologies, it’s nearly impossible to get locked into an increasingly-expensive cage where you’re dependent on a specific vendor to keep your data stack afloat. Historically, vendors in the data space have tried using proprietary systems and proprietary data models that make it difficult, costly, or even impossible to get your data out of their system. When there’s no alternative to bail out, you lose optionality and prices go up. With a stack built on free, open, and available technologies, you can’t get stuck, and you’ll always have easy alternatives. If something new comes along and supplants Trino, you can swap to it without hassle. If Hudi ends up being a better solution for your needs down the line, you’ll need to migrate to it from Iceberg, but your compute and analytics can stay the exact same.

In addition to that, because everything is open and interchangeable, vendor prices are forced to remain competitive, which benefits the buyer. It encourages additional innovation as vendors try to differentiate themselves. If you don’t want to pay a vendor to manage your Icehouse for you, you can deploy them on your own hardware yourself. Optionality is the opposite of lock-in, and it means that you don’t get stuck with a bad contract or expensive crutch that you can’t escape.

Starburst makes the Icehouse easy

The expertise and institutional knowledge built up from years of experience and development on Presto and then Trino has made Starburst the best-equipped vendor to help manage and deploy this stack for you. If you have a professional, grizzled data engineering team and the means of spinning up your own servers, point them at the Iceberg docs, the Trino docs and they’ll likely be happy to get cracking. But as said when discussing the downsides of a lakehouse – there’s a lot of complexity involved. Getting set up isn’t easy, and managing and maintaining it over time stays difficult and complex, which is why Starburst Galaxy manages it all for you.

Automate your Icehouse with Starburst Galaxy

Learn more

Icehouse Developer's Resource Center

Learn more