Data Lakehouse Resource Center

Our Data Lakehouse platform combines the best of data lakes, data warehouses, and data virtualization

Operationalize your data lake whitepaper

What is Data Lakehouse?

A data lakehouse is new data management paradigm combining the benefits of a data lake and a data warehouse. Object storage also known as a data lake is used for it’s flexibility, low cost and large volumes of data. With the data lake, open table formats, like Apache Iceberg, are used in conjunction with open file formats such as Parquet files to improve reliability and provide data warehouse-like functionality and performance. The combination of a data lake with the recent developments within open table formats allows for low cost, high performance analytics at scale.

Learn more Building a modern data lakehouse webinar

Data Lakehouse has four key features:

Low cost, highly scalable storage

Modern open table formats

Distributed MPP query engine

Fine-grained access control

“It was not that hard at all, and what resulted was real-time analysis powered by Starburst.”

— Mitchell Posluns, Senior Data Scientist, Assurance

“The ability to access the data in Google Cloud Storage is where Starburst really stands out and stands apart.”

— Sachin Gopalakrishna Menon, Senior Director of Data, Priceline

“We can make decisions faster based on the analytics. Instead of waiting for days, this happens in near real-time.”

— Sachin Gopalakrishna Menon, Senior Director of Data, Priceline

Unlock the full potential of your lakehouse was Apache Iceberg<style>.slider-header {margin-bottom: 0; }</style>

Introduction to Apache Iceberg In Trino

Apache Iceberg is an open source table format that brings database functionality to object storage such as S3, Azure’s ADLS, Google Cloud Storage and MinIO. This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor.

Iceberg Partitioning and Performance Optimizations in Trino

Partitioning is used to narrow down the scope of the data that needs to be read for a query. When dealing with big data, this can be crucial for performance and can be the difference between getting a query that takes minutes or even hours down to seconds!

Apache Iceberg DML (update/delete/merge) & Maintenance in Trino

One key feature of the Apache Iceberg connector is Trino’s ability to modify data that resides on object storage. As we all know, storage like AWS S3 is immutable which means they cannot be modified. This was a challenge in the Hadoop era where data needed to be modified or removed at the individual row level. Trino allows full DML(data manipulation language) using the Iceberg connector which means full support for update, delete and merge.

Apache Iceberg Schema Evolution in Trino

Schema evolution simply means the modification of tables as business rules and source systems are modified over time. Trino’s Iceberg connector supports different modifications to tables including the table name itself, column and partition changes.

Apache Iceberg Time Travel & Rollbacks in Trino

Time travel in Trino using Iceberg is a handy feature to “look back in time” at a table’s history. As we covered in this blog, each change to an Iceberg table creates a new “snapshot” which can be referred to by using standard sql.

Building Reporting Structures on S3 using Starburst Galaxy and Apache Iceberg

AWS S3 has become one of the most widely used storage platforms in the world. Companies store a variety of data on S3 from application data to event based and IoT data. Oftentimes, this data is used for analytics in the form of regular BI reporting in addition to ad hoc reporting.

Trino On Ice I: A Gentle Introduction To Iceberg

Back in the Gentle introduction to the Hive connector blog, I discussed a commonly misunderstood architecture and uses of our Hive connector. In short, while some may think the name indicates Trino makes a call to a running Hive instance, the Hive connector does not use the Hive runtime to answer queries.

Trino on Ice II: In-Place Table Evolution and Cloud Compatibility with Iceberg

Welcome back to this blog post series discussing the awesome features of Apache Iceberg. The first post covered how Iceberg is a table format and not a file format. It demonstrated the benefits of hidden partitioning in Iceberg in contrast to exposed partitioning in Hive. There really is no such thing as “exposed partitioning.”

Trino on Ice III: Iceberg Concurrency Model, Snapshots, and the Iceberg Spec

Welcome back to this blog series discussing the amazing features of Apache Iceberg. In the last two blog posts, we’ve covered a lot of cool feature improvements of Iceberg over the Hive model. I recommend you take a look at those if you haven’t yet.

Trino on Ice IV: Deep Dive Into Iceberg Internals

Welcome back to the Trino on Ice blog series that has so far covered some very interesting high level concepts of the Iceberg model, and how you can take advantage of them using the Trino query engine. This blog post dives into some of the implementation details of Iceberg by dissecting some of the files that result from various operations carried out using Trino.

Starburst Galaxy and the Data Lakehouse with Dain Sundstrom, Trino Co-Creator & Starburst CTO

Building a Data Lake Strategy with Starburst Galaxy

Starburst Data Products

Starburst Enterprise Security Guide

Starburst Delta Lake Connector

Building a federated data lakehouse with Starburst Galaxy

Starburst and Trino really enabled us to accelerate time to insight improve our conversion rates and enable robust modeling.

— Mitchell Posluns, Senior Data Scientist

We can make decisions faster based on the analytics. Instead of waiting for days, this happens in near real-time.

— Sachin Gopalakrishna Menon, Senior Director of Data

Starburst was very data-lake friendly. It was as if it was built for that model. That was a key differentiator for us. We were very invested in the data lake.

— Ivan Black, Director

Our data lake backbone was on a traditional Hadoop infrastructure. While that approach had its day, it’s not flexible. We needed to scale out and separate our compute from our storage without moving the data.

— Mike Prior, Principal IO Engineer

The data warehouse and data lake space is constantly evolving, and our enterprise focus means we have to support customer requirements across different platforms. Starburst gives us the ability to move quickly to support ever-changing use cases within complex enterprise environments.

— David Schulman, Head of Partner Marketing

Starburst powers the serving layer of Zalando’s data lake on S3. We have more than a thousand internal users and 100s of applications using Starburst daily, running 30k+ queries that are processing half a petabyte per day.

— Onur Y., Engineering Lead

What is Data Lakehouse?