Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
An evolutionary approach to managing data at enterprise scales, data lakehouses are more performant, efficient, and compliant than their predecessors. Increasingly, enterprises are adopting these modern data lake architectures to drive advanced analytics within data-driven cultures. This guide will explain data lakehouses, their benefits and structures, and how companies use data lakehouses today.
Last updated: December 1, 2023
A data lakehouse is a centralized data repository that uses cost-effective data storage, usually in the cloud, and a robust metadata layer to optimize compute resources for big data queries. Unlike data warehouses, data lakehouses can store any type of data. Unlike data lakes, data lakehouses have strong governance layers and better support for advanced analytics.
Relational databases are optimized for the highly structured data of traditional business apps like employee records or point-of-sale systems. Data structures are irrelevant to data lakehouses, which store raw data, including unstructured and semi-structured data. The object storage of a data lakehouse offers superior scalability over traditional databases.
Enterprise data management becomes increasingly resource-intensive as data grows in volume and complexity. Data engineers need more time to maintain data systems that become more expensive every year. These pressures were what first drove enterprise adoption of data warehousing. Creating a centralized repository lets data engineering teams manage storage, compute, and data governance within a single system.
Companies soon ran into the limits of their data warehouses. Monolithic, proprietary systems weren’t flexible or scalable enough to meet growing data demands. And vendor lock-in made the cost of proprietary data warehouses unpredictable.
Data lakes seemed to offer a path to lower costs by decoupling storage from compute and switching to more cost-effective open source software. However, the original data lakes only replaced the storage layer of a data warehouse solution. These systems could not replace the analytics and governance capabilities, which led to increasingly complex workarounds.
Enter the data lakehouse. By combining the analytics and governance capabilities of a warehouse with the efficient storage of a lake, this modern architecture delivers multiple benefits.
The purpose of warehouses and lakes is to centralize enterprise data by consolidating datasets from multiple data sources in a single location. By eliminating data silos, these approaches should have reduced duplication and redundancy while making data more accessible.
However, these approaches sometimes do the opposite. Warehouses become cluttered with data formatted for particular workloads. Without robust analytics and management features, lakes often require multiple warehouses to make data usable.
Data lakehouses make data easier to manage and more accessible, which lets organizations eliminate redundant warehouses and break down silos.
Transactional systems are among the most significant enterprise data silos. To ensure data integrity, their processing systems must comply with ACID (Atomicity, Consistency, Isolation, and Durability) standards. Data lakehouses tear down these last silos by supporting ACID transactions. As a result, this data can live in the lakehouse’s centralized data stores and allow the business to draw insights based on the most current data.
Iceberg and other open table formats allow lakehouses to collect more varied metadata than lakes. Governance and access control systems can draw on this rich metadata to create granular rules that ensure appropriate access to data and compliance with data regulations.
For example, human resource analysts in Europe can query detailed employee records, while business analysts on another floor would only see aggregated data. Governance rules would prevent analysts in an American office from moving employee data out of European data storage locations.
Replacing proprietary data warehouse solutions with cloud object storage lets companies manage their data more efficiently. They no longer need separate storage systems to handle different data structures. Lakehouses can store structured and unstructured data just as easily.
Lakehouses also simplify the maintenance of data pipelines. Since the lakehouse stores raw data, the ETL pipelines at ingestion can be less complex without compromising data quality. Dedicated ELT pipelines for each data product handle the final transformation without altering the lakehouse’s repository.
Data lakes promised to decouple storage from compute letting data teams optimize their investments in each. Lakehouses are more performant thanks to their columnar and read-optimized open table formats, which support performance-boosting features like data skipping and partition handling.
Pairing data lakehouse storage layers with efficient, high-performance query engines accelerates analysis, making this architecture as performant, if not better, than a data warehouse. Query engines that support features like in-memory execution, predicate pushdown, and columnar reads can achieve incredibly fast results without excessive compute costs.
Eliminating data warehouses and other silos turns lakehouses into that long-promised central source of truth. Business intelligence teams can use tools like Tableau to analyze current, historical, and real-time data to produce timely insights for decision-makers. Data scientists can leverage data lakehouses to develop machine learning, artificial intelligence, and other big data analytics projects.
Since data lakehouses provide a robust metadata layer, governance teams can develop the controls needed to democratize data access without compromising security or privacy. Analytics is no longer limited to data scientists and engineers. With the right analytics layer, non-technical users can bring more data into their decision-making processes.
A data lakehouse analytics architecture consists of several elements. Commodity storage and compute infrastructure from data platforms like Microsoft’s Azure and Amazon’s AWS offer affordability and scalability.
Unlike data lakes, however, lakehouses use advanced open table and file formats like Iceberg, Delta Lake, Parquet, and ORC to make enterprise data more portable and performant.
In addition, data lakehouses leverage high-performance query engines like Spark or Trino to handle data processing at scale.
Starburst’s modern data lake analytics solution expands upon the general data lakehouse architecture to give enterprises optionality and a more robust data storage infrastructure.
Starburst abstracts data sources, including data lakehouses, to create a virtualized access layer that unifies an enterprise’s data architecture behind a single point of access. As a result, enterprises have the optionality to build their data lakes on whatever combination of Amazon AWS, Microsoft Azure, or Google Cloud they use in their hybrid or multi-cloud architectures.
Starburst’s open table format, open file format, and multi-engine support lets companies balance compute costs and performance while reducing data movement and associated costs.
Starburst Galaxy’s Great Lakes feature is a single connector for multiple storage systems, table formats, and file formats. Engineers can quickly configure file and table formats from Galaxy’s interface. Everything is transparent to end-users, allowing them to run queries without knowing anything about the data source’s design.
Starburst enables many data lakehouse use cases. Consider the case of 7bridges, an AI-powered supply chain management platform, that replaced its relational databases with a data lakehouse and Starburst Galaxy to access data faster and streamline decision-making.
The company’s growth ran into the limits of its database architecture as queries took longer to execute and non-technical users struggled to access data. Although 7bridges’ data platform handled current workloads, it would not scale with large data volumes and complexity.
At first, 7bridges based its lakehouse implementation on Delta Lake and the Trino query engine. It became apparent that this approach would consume too much time and resources.
“We chose Galaxy because of the flexibility it offers to connect to so many different types of tools and data sources,” 7bridges Lead Data Engineer Simon Thelin said. “Galaxy allows us to use Lakehouse tables for both transformations and reporting, and on top of that, Galaxy provides access to multiple data formats. This ensures that we can stay flexible and iterate quickly as the Lakehouse technology evolves.”
With Starburst, the 7bridges data lakehouse returned significant results, including:
In addition to streamlining data lakehouse management, 7bridges has enhanced its customer experience. Clients can access their supply chain data faster. They can also better integrate historical and new data to analyze trends and develop better insights for agile decision-making. As a result, clients are more satisfied with their 7bridges platform.
Up to $500 in usage credits included
Up to $500 in usage credits included