×
×

What is the value of a data lakehouse?

Lakehouses offer a massive performance benefit when compared to traditional data lakes, while also offering many additional features.

In this post, we take a look at the advantages that data lakehouses hold regarding: (1) performance, (2) cost, (3) flexibility, and (4)compliance.

Modern Data Lakes For Dummies

Data Mesh Book Cover

Get your free copy

Increased performance

Benefits without sacrifice

Importantly, data lakehouse benefits are achieved without sacrificing anything in return. Organizations that adopt a lakehouse, or modern data lake, use the same cloud based object storage or HDFS as before, but gain significant performance enhancements on top of that alongside additional features. 

Open table formats

How does this happen? Lakehouses employ a more modern architecture, which enables organizations to remake their workflows more efficiently. Often, this lets them move away from older technologies altogether, particularly from Hive. Although Hive was once advanced in its time, more modern open table formats like Iceberg and Delta Lake offer better performance and a host of features not offered. 

More features, better performance

The two differences–added features and performance–work together. Often, the reason that a lakehouse table format performs better is because the newer features allow workloads to be processed in novel and efficient ways.

For instance, Iceberg and Delta Lake allow users to insert data into a row directly on a record-by-record basis. This ensures that only the changes needed are made, which allows for better workflows, and lets users make more efficient choices. To achieve the same results with Hive, changes would often have to be made at the partition level.

This is a good example of Hive’s architecture creating performance drawbacks which are solved by more modern lakehouse architecture

Reduced costs

Lakehouses allow businesses to reduce costs and improve query performance simultaneously. This is achieved in a number of ways.

Moving past Hive

Typically, users migrating to Iceberg or Delta Lake will be moving from Hive. Hive’s architecture is outdated and excludes things like record-level updates. This slows performance, which is bad for productivity, but it also increases spending on cloud resources. 

Slow queries increase costs

Longer query times equal more money spent on compute resources. In this way, modern lakehouse architecture not only increases efficiencies but also decreases costs.

Greater flexibility

Lakehouses offer better flexibility. Whereas traditional data lakes based on cloud object storage offer only limited abilities to update or delete records, lakehouses offer full CRUD capabilities. This offers a more database-like experience built on top of the same cloud object storage infrastructure, allowing scenarios either impossible or impractical in a data lake.

Advantages include: 

  • Improved row-level updates 
  • ACID compliance
  • Enhanced support for transactional systems

Meeting Compliance

Data lakehouses offer better governance and compliance when compared to traditional data lakes. There are a number of reasons for this.

Overcoming immutability problems

Traditional data lakes are built on cloud object storage, a technology which is often immutable. This means that records cannot easily be updated or deleted. This can represent a governance issue, as many jurisdictions require the ability to delete data on request. 

Complying with data protection legislation

This can put data lakes in a difficult position when attempting to comply with certain legal requirements. This includes General Data Protection Regulation (GDPR) in the European Union (EU), and the California Consumer Privacy Act (CCPA) in California.

Leveraging record-level details

Data lakehouses solve this issue by including metadata transaction logs and snapshot files detailing all of the changes made to a table. With this log in place, record-level deletions become possible for the first time, along with the ability to roll back the entire database to a previous state, or query the database from a particular time index. All of these features ensure that data lakehouses maintain governance control over the data inside them, helping the organizations involved remain both GDPR and CCPA compliant. 

 

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.