Data lakehouses have emerged in the past five years as a hybrid middle ground between data lakes and data warehouses. For decades, data warehouses were the primary solution to bringing data across an organization together into a central, unified location for subsequent data analysis. Data are extracted from source data systems via an “extract, transform, and load” (ETL) process, integrated, and stored inside dedicated data warehouse software such as Oracle Exadata, Teradata, Vertica, or Netezza products, and made available for data scientists and data analysts to perform detailed analyses.
These data warehouses stored data in highly optimized storage formats in order to analyze data at high performance and throughput, so that data analysts could experience near-interactive latency even when analyzing very large datasets. However, due to the large amount of complexity in the software, these data warehouse solutions were expensive and charged by the size of data stored; therefore they were often prohibitively expensive to store truly large datasets, especially when it was unclear in advance if these datasets would provide significant value to data analysis tasks. Furthermore, data warehouses typically required upfront data cleaning and schema declaration, which typically involved non-trivial human effort – which gets wasted if the dataset ends up not being used for analysis.
Data lakes therefore emerged as much cheaper alternatives to storing large amounts of data in data warehouses. Typically built via relatively simple free and open source software, the only cost of storing data in a home-built data lake was the cost of the hardware for the cluster of servers which were running the data lake software, and the labor cost of the employees overseeing this deployment. Furthermore, data could be dumped into data lakes without upfront cleaning, integration, semantic modeling, and schema generation, thereby making data lakes an attractive place to store datasets whose value for analysis tasks has yet to be determined. Data lakes allowed for a “store-first, organize later” approach.
Nonetheless, over time, some subset of the data in data lakes ends up proving to be highly valuable for data analysis tasks, and therefore the human effort to clean, integrate, and define schemas for it becomes justified. At this point, historically it was moved from the data lake to the data warehouse, despite the increased costs of storing data there.
More recently, data lakehouses have emerged as an alternative approach to moving this data to a data warehouse. Rather, the data can remain in the data lake, stored using read-optimized open data formats, and a lakehouse (specialized software running on the data lake) handles the management of the schema, metadata, and other administrative functions that have historically been handled by the data warehouse.
In addition to managing the schema and metadata of data stored in the data lake, a data lakehouse typically also provides a query interface through which queries over data that it manages can run. These queries are run at high performance, in parallel across the servers in the data lake, using similar scalable query processing techniques used in high end data warehouses.
Unfortunately, today, data lakehouses are often limited to querying data that it manages within the data lake. They implement extremely powerful query engines, yet fundamentally provide a narrow view of data within an organization since they are only capable of querying data in a given data lake. Yet most organizations keep their most valuable and highly used datasets in traditional data warehouses or other high-end database management software. Therefore, queries run by the lakehouse necessarily must ignore the most valuable data owned by the organization.
Data virtualization and data lakehouse
Some of the newer approaches to implementing data lakehouse software include data virtualization technology that solve this problem by allowing lakehouse users to include data stored in external systems within queries over the lakehouse. At a high level, all data within the organization is virtualized – the lakehouse provides a unified interface through which all data within that organization can be queried and joined together – both data in the data lake and data in external systems such as traditional data warehouses. The lakehouse user can query this virtualized data as if it is physically all stored together, and the lakehouse software takes care of all of the complexities of combining data stored in physically different locations and stored in different types of software.
In 2024 we predict that data virtualization technology will become a core component of data lakehouse solutions. Data lakehouse query processing software has become extremely powerful, providing scalable query performance of petabytes of data stored in a data lake. It is such a waste if it cannot query an organization’s most valuable datasets stored outside of the data lake! Data virtualization enables the lakehouse to reach a much higher level of its potential – deploying the high performance query processing software on a much broader set of data.
O’Reilly Data Virtualization in the Cloud Era
We are currently writing a book on data virtualization in the cloud era. In the book, we discuss some of the technical challenges behind data virtualization and how advances in networking hardware and machine learning technology have enabled data virtualization to work for modern applications in areas that did not work in the past.
We specifically focus on data lake use cases, and the differences between pull-based systems (in which query processing is performed by the data virtualization software itself) vs. push-based systems (in which query processing is pushed down to the underlying systems that store data being virtualized).
In reading the book, the reader will get a better understanding of how data virtualization software works, some practical pitfalls that data virtualization users may run into, and how to tune these systems to achieve better performance. This allows for better outcomes for data virtualization users. We also discuss how data virtualization fits into the modern data mesh and data fabric paradigms.
Published via VMblog
O'Reilly: Data Virtualization in the Cloud Era
Data Lakes and Data Federation At Scale