Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
In this post, we take a look at the advantages that data lakehouses hold regarding: (1) performance, (2) cost, (3) flexibility, and (4)compliance.
Importantly, data lakehouse benefits are achieved without sacrificing anything in return. Organizations that adopt a lakehouse, or modern data lake, use the same cloud based object storage or HDFS as before, but gain significant performance enhancements on top of that alongside additional features.
How does this happen? Lakehouses employ a more modern architecture, which enables organizations to remake their workflows more efficiently. Often, this lets them move away from older technologies altogether, particularly from Hive. Although Hive was once advanced in its time, more modern open table formats like Iceberg and Delta Lake offer better performance and a host of features not offered.
The two differences–added features and performance–work together. Often, the reason that a lakehouse table format performs better is because the newer features allow workloads to be processed in novel and efficient ways.
For instance, Iceberg and Delta Lake allow users to insert data into a row directly on a record-by-record basis. This ensures that only the changes needed are made, which allows for better workflows, and lets users make more efficient choices. To achieve the same results with Hive, changes would often have to be made at the partition level.
This is a good example of Hive’s architecture creating performance drawbacks which are solved by more modern lakehouse architecture.
Lakehouses allow businesses to reduce costs and improve query performance simultaneously. This is achieved in a number of ways.
Typically, users migrating to Iceberg or Delta Lake will be moving from Hive. Hive’s architecture is outdated and excludes things like record-level updates. This slows performance, which is bad for productivity, but it also increases spending on cloud resources.
Longer query times equal more money spent on compute resources. In this way, modern lakehouse architecture not only increases efficiencies but also decreases costs.
Lakehouses offer better flexibility. Whereas traditional data lakes based on cloud object storage offer only limited abilities to update or delete records, lakehouses offer full CRUD capabilities. This offers a more database-like experience built on top of the same cloud object storage infrastructure, allowing scenarios either impossible or impractical in a data lake.
Traditional data lakes are built on cloud object storage, a technology which is often immutable. This means that records cannot easily be updated or deleted. This can represent a governance issue, as many jurisdictions require the ability to delete data on request.
This can put data lakes in a difficult position when attempting to comply with certain legal requirements. This includes General Data Protection Regulation (GDPR) in the European Union (EU), and the California Consumer Privacy Act (CCPA) in California.
Data lakehouses solve this issue by including metadata transaction logs and snapshot files detailing all of the changes made to a table. With this log in place, record-level deletions become possible for the first time, along with the ability to roll back the entire database to a previous state, or query the database from a particular time index. All of these features ensure that data lakehouses maintain governance control over the data inside them, helping the organizations involved remain both GDPR and CCPA compliant.
Up to $500 in usage credits included