What is a data lakehouse?

A data lakehouse, combines a data lake and a data warehouse, creating a centralized data repository, that uses cost-effective data storage, usually in the cloud, and a metadata layer to optimize compute resources for big data queries.

April 8, 2024

Evan Smith
Technical Content Manager
Starburst Data

Evan Smith
Technical Content Manager
Starburst Data

More deployment options

Request Enterprise trial license key →

What is a data lakehouse vs database?

To understand a data lakehouse, you first need to understand what a database is. Relational databases are optimized for the highly structured data of traditional business apps like employee records or point-of-sale systems. Data structures are irrelevant to lakehouses, which store raw data, including unstructured and semi-structured data. The object storage of a lakehouse offers superior scalability over traditional databases.

Related reading: Data Lakehouse vs Data Warehouse

Why do organizations need a data lakehouse?

Enterprise data management becomes increasingly resource-intensive as data grows in volume and complexity. Data engineers need more time to maintain data systems that become more expensive every year. These pressures were what first drove enterprise adoption of data warehousing. Creating a centralized repository lets data engineering teams manage storage, compute, and data governance within a single system.

Companies soon ran into the limits of their data warehouses. Monolithic, proprietary systems weren’t flexible or scalable enough to meet growing data demands. And vendor lock-in made the cost of proprietary data warehouses unpredictable.

Data lakes seemed to offer a path to lower costs by decoupling storage from compute and switching to more cost-effective open source software. However, the original data lakes only replaced the storage layer of a data warehouse solution. These systems could not replace the analytics and governance capabilities, which led to increasingly complex workarounds.

Enter the lakehouse. By combining the analytics and governance capabilities of a warehouse with the efficient storage of a data lake, this modern architecture delivers multiple benefits.

Eliminate data silos

The purpose of warehouses and lakes is to centralize enterprise data by consolidating datasets from multiple data sources in a single location. By eliminating data silos, these approaches should have reduced duplication and redundancy while making data more accessible.

However, these approaches sometimes do the opposite. Warehouses become cluttered with data formatted for particular workloads. Without robust analytics and management features, lakes often require multiple warehouses to make data usable.

Data lakehouses make data easier to manage and more accessible, which lets organizations eliminate redundant warehouses and break down silos.

Supports transactional systems

Transactional systems are among the most significant enterprise data silos. To ensure data integrity, their processing systems must comply with ACID (Atomicity, Consistency, Isolation, and Durability) standards. Lakehouses tear down these last silos by supporting ACID transactions. As a result, this data can live in the lakehouse’s centralized data stores and allow the business to draw insights based on the most current data.

Keeps data secure and private

Iceberg and other open table formats allow lakehouses to collect more varied metadata than lakes. Governance and access control systems can draw on this rich metadata to create granular rules that ensure appropriate access to data and compliance with data regulations.

For example, human resource analysts in Europe can query detailed employee records, while business analysts on another floor would only see aggregated data. Governance rules would prevent analysts in an American office from moving employee data out of European data storage locations.

Manages data more efficiently

Replacing proprietary data warehouse solutions with cloud object storage lets companies manage their data more efficiently. They no longer need separate storage systems to handle different data structures. Lakehouses can store structured and unstructured data just as easily.

Lakehouses also simplify the maintenance of data pipelines. Since the lakehouse stores raw data, the ETL pipelines at ingestion can be less complex without compromising data quality. Dedicated ELT pipelines for each data product handle the final transformation without altering the lakehouse’s repository.

Improves query performance

Data lakes promised to decouple storage from compute letting data teams optimize their investments in each. Lakehouses are more performant thanks to their columnar and read-optimized open table formats, which support performance-boosting features like data skipping and partition handling.

Pairing Lakehouse storage layers with efficient, high-performance query engines accelerates analysis, making this architecture as performant, if not better, than a data warehouse. Query engines that support features like in-memory execution, predicate pushdown, and columnar reads can achieve incredibly fast results without excessive compute costs.

Supports advanced analytics

Eliminating data warehouses and other silos turns lakehouses into that long-promised central source of truth. Business intelligence teams can use tools like Tableau to analyze current, historical, and real-time data to produce timely insights for decision-makers. Data scientists can leverage data lakehouses to develop machine learning, artificial intelligence, and other big data analytics projects.

Democratizes data access

Since data lakehouses provide a robust metadata layer, governance teams can develop the controls needed to democratize data access without compromising security or privacy. Analytics is no longer limited to data scientists and engineers. With the right analytics layer, non-technical users can bring more data into their decision-making processes.

Example of data lakehouse and analytics architecture

A data lakehouse analytics architecture consists of several elements. Commodity storage and compute infrastructure from data platforms like Microsoft’s Azure and Amazon’s AWS offer affordability and scalability.

Unlike data lakes, however, lakehouses use advanced open table and file formats like Iceberg, Delta Lake, Parquet, and ORC to make enterprise data more portable and performant.

In addition, lakehouses leverage high-performance query engines like Spark or Trino to handle data processing at scale.

Starburst’s modern data lake analytics solution expands upon the general data lakehouse architecture to give enterprises optionality and a more robust data storage infrastructure.

Optionality and Starburst

Starburst abstracts data sources, including data lakehouses, to create a virtualized access layer that unifies an enterprise’s data architecture behind a single point of access. As a result, enterprises have the optionality to build their data lakes on whatever combination of Amazon AWS, Microsoft Azure, or Google Cloud they use in their hybrid or multi-cloud architectures.

Starburst’s open table format, open file format, and multi-engine support lets companies balance compute costs and performance while reducing data movement and associated costs.

Starburst Galaxy and the Great Lakes Connector

Starburst Galaxy’s Great Lakes feature is a single connector for multiple storage systems, table formats, and file formats. Engineers can quickly configure file and table formats from Galaxy’s interface. Everything is transparent to end-users, allowing them to run queries without knowing anything about the data source’s design.

Related reading: Data lakehouse architecture | When to adopt it

Data lakehouse case study | use cases

Starburst enables many data lakehouse use cases. Consider the case of 7bridges, an AI-powered supply chain management platform, that replaced its relational databases with a data lakehouse and Starburst Galaxy to access data faster and streamline decision-making.

The company’s growth ran into the limits of its database architecture as queries took longer to execute and non-technical users struggled to access data. Although 7bridges’ data platform handled current workloads, it would not scale with large data volumes and complexity.

At first, 7bridges based its lakehouse implementation on Delta Lake and the Trino query engine. It became apparent that this approach would consume too much time and resources.

“We chose Galaxy because of the flexibility it offers to connect to so many different types of tools and data sources,” 7bridges Lead Data Engineer Simon Thelin said. “Galaxy allows us to use Lakehouse tables for both transformations and reporting, and on top of that, Galaxy provides access to multiple data formats. This ensures that we can stay flexible and iterate quickly as the Lakehouse technology evolves.”

With Starburst, the 7bridges data lakehouse returned significant results, including:

98% faster query speeds
Shortened development cycles
More accessible data
Optimized infrastructure costs

In addition to streamlining data lakehouse management, 7bridges has enhanced its customer experience. Clients can access their supply chain data faster. They can also better integrate historical and new data to analyze trends and develop better insights for agile decision-making. As a result, clients are more satisfied with their 7bridges platform.