What is an open data lakehouse?

An open data lakehouse is a data analytics architecture that combines a data lake’s cost-effective storage with a data warehouse’s robust analytics.

April 8, 2024

Evan Smith
Technical Content Manager
Starburst Data

Evan Smith
Technical Content Manager
Starburst Data

More deployment options

Request Enterprise trial license key →

How do I build an open data lakehouse?

These lakehouses combine open-source table formats, file formats, and query engines on commodity cloud services like AWS and Azure to make big data analytics scalable and accessible.

Open data lakehouse architectures address the growing need for warehouse-like analytics that work at scale with different data formats and disparate data sources that systems like AI require. Open source lets companies build an affordable yet performant analytics infrastructure that keeps pace with data’s rapid growth.

This article will introduce the open data lakehouse concept and its use cases, as well as explain how Apache Iceberg and Trino combine to form a powerful analytics resource.

What are the key components of an open data lakehouse platform?

An open data lakehouse platform builds upon commercial cloud object storage services but otherwise draws from the open-source ecosystem to construct a scalable, performant analytics solution. It comprises three components: file formats, table formats, and compute engines.

Commodity object storage

Amazon S3, Azure Blob Storage, and other cloud platforms provide commodity data storage at petabyte scales. By decoupling storage from compute, data teams can optimize the costs and performance of each independently. In addition, these cloud platforms are also more flexible, scalable, and affordable than on-premises storage infrastructure.

Open file format

Open file formats define how a lakehouse writes and reads data. Columnar file formats structure data in ways that enhance query performance, providing detailed metadata and indexes that queries use to skip irrelevant data. Examples of open file formats include Apache Parquet and ORC.

Open table format

Open table formats add an abstraction layer to a data lake’s sparse metadata, creating warehouse-like storage with structured and unstructured data. Table formats define the schema and partitions of every table and describe the files they contain. By providing a separate reference for this information, tables let queries avoid opening every file’s header and instead go to the most relevant files. Delta Lake and Apache Iceberg are commonly used open table formats.

Open compute engine

Open compute engines are the lakehouse components that elevate big data analytics far beyond a conventional warehouse’s capabilities. Designed for massive parallelization in cloud environments, these compute engines can process large datasets quickly while balancing compute costs. As a result, lakehouses can support streaming ingestion and near real-time analytics, at the same time giving data consumers access to database-like functionality. Frequently used open compute engines include Trino and Apache Spark.

Open Source Trino

Many organizations use Trino with their data lakehouses because its massively parallel, distributed SQL query engine provides a unique combination of performance, cost-effectiveness, and accessibility. Facebook originally developed Trino to improve query results in the Hadoop ecosystem, but it now works on multiple data platforms.

Trino use cases

Common Trino use cases include:

Interactive data analytics

Unlike the proprietary implementation of a data warehouse, Trino uses ANSI-standard SQL to maximize analytics accessibility. SQL-compatible business intelligence applications like Tableau can easily return data from the lakehouse. Exploration and data extraction becomes easier when scientists can write standard SQL statements in Python or other programming languages. Engineers can quickly develop dashboards and other data products for the least technical users. Up and down the organization, Trino lets data consumers conduct interactive analysis of large datasets.

Centralized data access and federated analytics

Trino connectors eliminate data silos by federating data sources across the company so a single Trino query can access data lakes, relational databases, and streaming sources. Besides streamlining query development, this federation lets data teams optimize lakehouse storage for frequently accessed data without isolating potentially valuable data in other systems.

High-performance analytics of object storage

Parallelization at scale and query optimizations make Trino an ideal solution for performing big data analytics on object storage. Trino can push queries down to source systems to reduce compute costs and leverage the source’s indexes. Dynamic filtering lets Trino skip data that the query would end up filtering. A cost-based optimizer distributes compute loads to balance cost with performance.

Batch ETL processing across disparate systems

Data ingestion and other workflows that require complex ETL pipelines typically run in overnight batches because they take so long and consume significant resources. Trino accelerates and simplifies these pipelines by letting engineers use standard SQL statements within a single system to query multiple data sources. Besides streamlining pipelines, faster batch processing speeds an ad hoc research project’s time to insight.

Why Apache Iceberg? Why open table formats?

Apache Iceberg is a highly-performant open table format that brings data warehousing functionality to a data lake’s object storage and integrates with modern query engines like Trino.

Developers at Netflix created Iceberg’s table format to address the challenges of working with Hive. Basic data management, like deletes and updates, had become increasingly difficult, as had meeting data governance demands. A single change required overwriting huge datasets.

4 Benefits of open table formats and Apache Iceberg

Open table formats like Iceberg can better meet modern analytics needs than older technologies like Hive. Some benefits include:

1. Central table storage

An Iceberg table’s catalog provides a central starting place for queries to find metadata without accessing files individually. The catalog points to table metadata files where the query can find schema, file metadata, and other information.

2. Access control

Open table formats like Iceberg leave access control to the open compute engines, which leverage table metadata to secure access, protect data, and apply privacy rules.

3. Enables portable compute

Open table formats help eliminate vendor lock-in and make data more portable. They work with multiple compute engines, so companies are not tied to one vendor’s product.

4. Schema evolution

Open table formats are also less sensitive to how data changes over time. Schema evolution lets these tables evolve without requiring massive — and expensive — rewrites.

What are the key benefits of implementing an open data lakehouse architecture?

An open data lakehouse architecture combines the benefits of data warehouses and data lakes into a single, performant analytics platform:

Data warehouse benefits	Data lake benefits
ACID transactions	Separation of storage & compute
Fine grained access control	Large scale
Data quality	Cost efficient
High performance and concurrency	Open formats
Highly curated data	Structured and unstructured ata
Typical proprietary systems	Open source options
Best for business intelligence use cases	Best for data science and data engineering uses cases

Lakehouse a new generation of open platforms

Starburst Galaxy uses Trino to improve modern data lakehouse architectures, becoming a single point of access to all data formats stored in any data source. Galaxy’s enhancements, like smart indexing and caching, accelerate query performance. Granular access controls provide the granular governance enforcement needed to deliver a self-service model that makes data more accessible and secure.

Starburst Galaxy lets companies implement Trino on their data lakehouses without worrying about the open-source software’s operational aspects.