A data lakehouse is a relatively new form of data repository. It combines the benefits of two tested data storage systems: a data warehouse and a data lake. Read on to learn more about a data lakehouse.
How did we get to a data lakehouse?
Initially, the data warehouse achieved a lot of popularity among data users. It provides the most structure for data and has the ability to use the data for reporting and business intelligence. It is a tried and true data repository. In the late 1980s, the data warehouse was developed in order to allow organizations to house data from various sources. This was mostly achieved by copying and moving data from these different sources into the warehouse. This is a very time consuming process that means that data users spend most of their time performing ETL and less time on analytics. Furthermore, the data going into a warehouse also has to be cleaned and processed. It doesn’t allow for unstructured and varied types of data, like photos and videos, which limits the analytics that could be performed.
So in 2010, along came the data lake. The data lake, now often referred to as a “data swamp” allows users to hold a wide variety of unstructured data and perform data science operations. However, the ability to perform BI and analytics on data in a data lake proves to be a struggle. The unstructured data means that data is unreliable much of the time and can get lost in the “swamp.” In the past, many organizations tried to use both, but this created an endless cycle of copying and moving data, which creates inefficiencies and data silos.
Finally, we arrived at a data lakehouse, a combination of a warehouse and a lake. A data lakehouse is still relatively new in the big data world and has had difficulty infiltrating the market over the tried and true data warehouse. However, in a previous discussion with Justin Borgman, Kamil Bajda-Pawlikowski, and Dr. Daniel Abadi, they discuss the importance of sharing data both within and outside of an organization and how the key to do this is with a lakehouse architecture.
Can you have the best of both worlds?
The data lakehouse seeks to provide a happy medium between a data warehouse and a data lake. Users have the ability to store varied types of data while also being able to perform important reporting and business intelligence analytics. Data lakes provide an inexpensive storage option that allow organizations to store all kinds of data. By combining the benefits of a data warehouse with the benefits of a data lake, organizations can reduce costs in addition to achieving a higher level of efficiency and performance. Open table formats in a lakehouse also allow organizations to perform data warehouse-like queries in a data lake. A data lakehouse is applicable to any organization. Particularly, it is useful to organizations with a more robust data science team, data native companies, and those who use data as a differentiator to create a better experience for their customers.
The debate over data infrastructures is never ending. But ultimately, each organization has different needs, different data, different teams and therefore will need different architectures. There is no right or wrong answer. The world of big data has come to realize that achieving a single point of truth is merely a pipe dream. However, it is crucial to have the ability to access all your data in order to make the fastest and most accurate decisions. A data lakehouse is as close as possible to a single point of access. It eliminates the need to have both a data warehouse and a data lake.
Starburst as the analytics engine for the data lakehouse (or for the data lake)
Starburst allows organizations to bring all forms of data, raw, structured, or unstructured, into their data lakehouse and still perform best in class, high performance queries on that data. Data is able to be ingested into the data lakehouse through Starburst, ETL is performed on the data to transform it into a structured form. Then another transformation occurs within Starburst to apply business logic to it to be able to be consumed in a business intelligence tool. By using Starburst as your data lake engine, you can empower your organization to quickly access and deploy your data lakehouse in addition to being able to apply BI analytics.
Starburst gives organizations the tools they need in order to construct a lakehouse. By providing a query abstraction layer on top of an organization’s data. All data is able to be queried together, no matter if it is stored in a data warehouse, data lake, or data lakehouse, or if it is structured or unstructured.