
Let’s start with the most pressing question the title presents: what’s an Icehouse? An Australian rock band, a brand of beer, a building where you store ice (usually an “ice house,” with a space), or a New Zealand business development center? Well, yes, yes, yes, and yes, but for the purposes of this blog post, we’re talking about a term coined by Starburst CEO Justin Borgman to describe a data lakehouse built on top of Iceberg and Trino. For those who are new to the data space, this raises a few other questions: what’s a data lakehouse, is Starburst the same as the candy company, and what’s Iceberg? Let’s take a step back, and build our way to an understanding of what these things are. Then we can discuss why the Icehouse may be a good solution for your data problems.
The data lakehouse
The data lakehouse is the amalgamation of the best parts of a data lake and a data warehouse. …and we’re going to need to break this down further.
The data warehouse
One of the oldest concepts in big data storage and analytics, the data warehouse should be familiar to most readers who’ve stumbled upon this blog. A data warehouse is a home for structured, analytics-ready data optimized for queries and business intelligence. A well-maintained, organized, centralized data warehouse that stores most of an organization’s data has long been the north star for a large organization’s data engineering team. The struggle is that structuring all of your data to fit within a warehouse is a massive and messy task. And because it generally requires your data to go through an ETL process, it can lead to data duplication, delay when new data becomes accessible, and limit flexibility. Maintaining a data warehouse is a never-ending, expensive, and time-intensive challenge; failing to maintain it adequately can reduce access to data or render it entirely useless. There is still a time and place for data warehouses, but the flaws in cost, scalability, and maintenance have been a pain point since data warehouses have existed.
The data lake
A reaction to the headache of maintaining a rigorous, structured data warehouse, the data lake takes the opposite approach: throw all the data into a lake. It’s what it sounds like. By storing data in its native format, you avoid the headaches and costs of massive ETL workloads and greatly simplify your data stack. The downside is that your data becomes a bit of a mess. When querying a data lake without structure, your queries become more sophisticated and complex, requiring advanced data science skills and tools to transform unstructured data from storage to meaningful analytics and insights. You’re not getting rid of the task of reshaping the data – you’re pushing it downstream. If the shape and format of your unstructured data meanders or drifts over time, supporting and handling all the edge and legacy cases can become a headache or borderline impossible, leaving you with more of a data bog, swamp, or quagmire. If you grew up on the Chesapeake Bay, you might say it gives you a giant data algae bloom. You don’t want that.
Enter the data lakehouse
What if we took the benefits of both the data warehouse and the data lake? Maintain the flexibility of being able to store unstructured data when it makes sense, but be equally willing to apply some structure and rigor to the data that needs some extra attention? Like a data lake, a data lakehouse is intended to capture all of your data in a single, low-cost cloud object store, while the “house” part enables transactions, transformations, and restructuring of data with ACID (atomicity, consistency, isolation, and durability) properties to glean many of the benefits of a traditional data warehouse. There’s no risk of data duplication, and with some active maintenance, old data shouldn’t become unintelligible or require overly complex queries to understand. Data lakehouses store much more metadata, enabling record-keeping, tracking all transactions, and the ability to rollback or view snapshots of past data. This introduces complexity, especially if you’re trying to build a lakehouse from scratch, which is why many companies are selling lakehouse solutions to save data teams the headache.
For now, we can hopefully say you understand the key concepts of a data lakehouse. There’s more to be said on exactly what a data lakehouse is if you’re looking for more details, but for now, you can consider yourself briefed.
Iceberg
So what’s Iceberg? A floating chunk of ice known for sinking ocean liners, a luxury fashion house, or a Dutch video game publisher? Yes, yes, and yes, but we’re talking about Apache Iceberg, a data lakehouse table format. Iceberg is one of the three main lakehouse table formats (the other two are Apache Hudi and Delta Lake), and its story is built on top of the progression from data warehouses to lakes to lakehouses outlined above. Originally built at Netflix and designed from the ground up to pair with Trino (known as Presto at the time, but we’ll get back to that) as its compute engine, it was an answer to a Hive data lake where transactions were not atomic, correctness was not guaranteed, and users were afraid to change data for risk of breaking something. Even when they did change data, because Hive necessitated rewriting entire folders, writes were inefficient and painfully slow. When you can’t modify your data, change your schemas, or write over existing data, you quickly begin to realize all those downsides, and the data algae bloom rears its ugly head. So… enter manifest files, more metadata, and badabing badaboom – problem solved. Yes, that’s a gross oversimplification, but the reality is that Iceberg’s introduction proved that transactions in a lakehouse could be safe, atomicity could be guaranteed, and snapshots and table history were bonuses that came along for the ride.
Why Iceberg?
On the features front, partition evolution is a big upside, because as your data evolves, your partitions may need to, too. If you don’t know what partitions are, they’re a way to group similar chunks of data so it can be read faster down the line, and they’ve been around for a while. Being able to change how data is partitioned on the fly is new, though, and allows you to adjust and improve your partitioning as your data evolves or changes. Iceberg also hides partitioning and doesn’t require users to maintain it, helping eliminate some of the complexity that would traditionally come from a data lake. You can check out the Iceberg docs for more information on that.
On top of all of that, Iceberg has a lot of momentum behind it. As an open source project with diverse vendor support, many major companies are deploying and using it, it has an extremely active community, and it seems likely to last and continue to receive updates and maintenance into the distant future.
How does Iceberg work?
Metadata and manifest files. A lot of metadata and manifest files.

Metadata files keep track of the table state. Data files are stored in a table rather than in directories, and manifest files are tracked in a manifest list that stores metadata about them. This blog previously mentioned that Iceberg supports “time travel” via snapshots of the table from the past, which can be accessed via a manifest list that points to manifest files representing older versions of the table. On top of that, the format is smart and reuses manifest files when it can for files that remain constant across multiple snapshots. Otherwise, every single transaction is stored, tracked, and able to be accessed as part of a given snapshot.
There’s a ton of complexity to Iceberg working as great as it does. You can read the Iceberg spec or our blog explaining Iceberg architecture for more detailed information.
Frequently asked questions (FAQ)
What exactly is an Icehouse?
An Icehouse is a data lakehouse built on Apache Iceberg as the table format and Trino as the query engine. The term was coined by Starburst CEO Justin Borgman to describe this specific combination of technologies. While the concept of a data lakehouse has been around for several years, the Icehouse represents a specific architectural choice that pairs two open source technologies with deep integration, strong community support, and a growing ecosystem of tools and vendors built around them.
What is the difference between a data lake, a data warehouse, and a data lakehouse?
A data warehouse stores structured, analytics-ready data optimized for queries and business intelligence, but requires significant ETL effort and can be expensive to maintain at scale. A data lake takes the opposite approach, storing data in its native format without enforcing structure, which reduces ETL overhead but makes querying more complex and can lead to data quality problems over time. A data lakehouse combines the best of both, storing data in low-cost cloud object storage while adding transactional guarantees, schema enforcement, and metadata management that make it behave more like a warehouse without the cost and rigidity.
Why is Apache Iceberg the table format of choice for the Icehouse?
Iceberg was originally built at Netflix and designed from the ground up to work with Trino as its compute engine. It solved fundamental problems with older Hive-based data lakes, including non-atomic transactions, unsafe schema changes, and inefficient data rewrites. Iceberg introduced manifest files and rich metadata management that made transactions safe, guaranteed atomicity, and enabled features like time travel and snapshot rollback. It also supports partition evolution, meaning you can change how data is partitioned as your needs change without rewriting the entire table. Among the three main lakehouse table formats, Iceberg has the most momentum, the broadest vendor support, and the most active community.
Why is Trino the query engine of choice for the Icehouse?
Trino was built specifically to solve the problem of querying large data lakes at interactive speeds. Where earlier approaches like MapReduce required submitting a job and waiting hours for results, Trino returns results in seconds to minutes, even at a massive scale. Its connector-based architecture allows it to query data across many different sources without requiring migration or ingestion into a proprietary system. For AI and machine learning workloads, this matters because feature engineering and model training pipelines often need to query large volumes of historical data across multiple sources quickly and concurrently, which is exactly what Trino is designed to do.
What makes the Icehouse particularly well-suited to AI workloads?
AI and machine learning pipelines place demands on a data platform that traditional analytics workloads do not. They require access to large volumes of historical data, often across multiple tables and data sources simultaneously. They generate high-concurrency query patterns as multiple pipelines run in parallel. And they need the data they access to be trustworthy, well-governed, and consistent. The Icehouse addresses all of these requirements. Iceberg’s metadata layer enables efficient file skipping and partition pruning, reducing the compute required to serve large AI queries. Trino’s massively parallel processing architecture handles high-concurrency workloads efficiently. And the open, interoperable nature of the stack means AI frameworks and tools can plug into it without proprietary lock-in.
What does optionality mean in the context of the Icehouse, and why does it matter?
Optionality means choice. Because Iceberg and Trino are both open source and independent technologies, you are never locked into a specific vendor or proprietary system. Historically, data vendors have used proprietary formats and models that make it costly or impossible to move your data elsewhere, giving them leverage to raise prices over time. With the Icehouse, you can switch vendors, deploy on your own hardware, or swap out components if something better comes along, all without being held hostage by a contract or a format you cannot escape. This is especially important as the AI landscape evolves rapidly and the tools and frameworks your organization depends on today may look very different in two years.
How does Starburst make the Icehouse easier to adopt and manage?
While Iceberg and Trino are both well-documented open source projects that a skilled data engineering team can deploy independently, the operational complexity of setting up, configuring, and maintaining the stack over time is significant. Starburst Galaxy abstracts away that complexity by managing the infrastructure for you, providing autoscaling, auto-suspend, and auto-shutdown out of the box, along with features like Warp Speed caching and built-in support for fault-tolerant execution. For organizations moving toward AI-ready data infrastructure, this means the team can focus on building pipelines and delivering value rather than managing cluster configurations and table maintenance schedules.
Is the Icehouse a good fit for organizations that are just starting to build their data infrastructure?
Yes, and arguably more so now than when the concept was first introduced. The rapid growth of AI workloads has made the architectural decisions you make today more consequential than ever. Building on open, interoperable technologies like Iceberg and Trino from the start means you avoid the proprietary lock-in that makes it expensive and disruptive to change course later. It also means your data infrastructure is ready to serve the high-concurrency, large-scale query patterns that AI and machine learning pipelines require, without needing a major rebuild when those workloads arrive.










