This article defines what a data lake is, why they are important and how they compare to other big data storage technologies, including:
- Databases
- Data warehouses
- Data lakehouses
It also discusses the common pros and cons that you might expect when operating a data lake. Finally, it outlines how data lakehouses represent the future of data lake technology, including the Starburst Icehouse architecture using Apache Iceberg and Trino.
What is a data lake?
A data lake is defined as a type of data architecture used to store large amounts of data flexibly and cost-effectively through the use of cloud object storage or on-premises servers running on the Hadoop Distributed File System (HDFS) or local object storage.
In recent years, cloud computing has grown quickly. Cloud-based data lakes running on AWS, Microsoft Azure, or Google Cloud Platform have become the default cloud storage solution for many organizations.
To do this, lakes traditionally use the Apache Hive table format. Hive works either on object storage or on HDFS and may make use of various table formats, including Apache ORC and Parquet. They also store data in a raw format, leaving transformation to a later stage. This approach makes data lakes particularly suitable for storing large amounts of data. This data can later be used for data analysis, machine learning (ML), or artificial intelligence (AI) modeling.
Data Lake vs Database
Databases handle the day-to-day transactional data acquired by an organization, and as such, represent the foundation of any data stack. Databases create records recording each transaction, whether that involves creating, reading, updating, or deleting each record (CRUD).
Data lakes operate as a storage layer for raw data as it is prepared for analysis, machine learning, or other purposes. To do this, lakes separate the analytic workload from the transactional one. This helps eliminate performance bottlenecks that arise from performing transactional and analytic workloads on the same system. This is one example of the division between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) systems. Most databases are OLTP, whereas most lakes are OLAP.
Data Lake vs Data Warehouse
Unlike OLTP databases, both data lakes and data warehouses traditionally handle OLAP workloads. In this sense, you can think of data lakes and data warehouses as two different solutions to the OLAP problem.
One of the key differences between data lakes and data warehouses involves the handling of data structure. Data warehouses require all data entering the warehouse to be structured into a table made up of columns and rows. Any data that does not conform to a preset schema, must be transformed into the correct structure using a process known as ETL. Using this method, data is extracted, transformed, and loaded into place. This process is known as schema on write because the schema is applied when it is written to the warehouse.
Although the ETL process is costly and time-consuming, once complete, the resulting data warehouse is highly performant. For traditional data workloads, which are often pre-structured, the data warehouse is a natural fit. For more modern workloads, where the data entering the warehouse is semi-structured, for example JSON, the ETL process is costly.
In contrast, data lakes store data of any type in their raw format. This includes structured data, semi-structured data, or unstructured data. ETL is performed only when it is needed using an approach known as schema on read.
Data Lake vs. Data Lakehouse
Data lakehouses are best understood as the next generation of data lakes. They typically make use of cloud object storage, but they do not use Apache Hive. Instead, they use one of the three data lakehouse table formats:
All three of these table formats work by collecting enhanced metadata and using that metadata to track minute changes in the state of the dataset. This allows for the implementation of features typically associated with data warehouses. The approach is particularly useful in datasets that update or delete records frequently and allows for ACID compliance for transactional data using cloud object storage.
Although Starburst Galaxy works well with Hive, it works particularly well with data lakehouses, especially those running on Apache Iceberg. In fact, this architecture is so powerful, that we’ve given it a new name, the Icehouse Architecture.
What is required to implement a data lake?
A data lake is not a single entity. In fact, creating one requires multiple components working in tandem. Let’s look at each of these components one by one.
Storage
The most essential aspect of any data lake implementation is storage, which is typically cloud object storage. For some organizations, lakes are also implemented using on-premises object storage or HDFS file storage.
Compute
Any data lake requires compute resources to run analytic workloads on it. There are many compute engines, including Starburst Galaxy, Snowflake, Databricks, and others. Separating compute and storage allows you to scale each independently as needed, increasing efficiency.
Metastore
Lakes require metadata to help locate the data inside them and keep track of any changes. Metadata is held in a metastore. There are several types of metastores, including:
- Starburst Galaxy Metastore
- AWS Glue
- Hive metastore
Data Governance
Data lakes require strict access controls to manage who can access the data inside them. Typically this is handled using both role-based and attribute-based access controls. For example, Starburst Galaxy handles data governance using Starburst Gravity. Although other, self-managed technologies and platforms can be used, achieving reliable results can be difficult, and involves risks. These risks are greatly reduced when using Starburst Galaxy to manage data governance.
Data Management
Data inside the lake must be regularly cleaned, compacted, and managed. This essential maintenance task is essential to ensure the efficient operation of the lake and the analytic workloads operating on it. Starburst Galaxy handles data management tasks, and other technologies can be added for certain data management use cases.
3 benefits of using a Data Lake
Data warehouses represent the most traditional approach to data analytics. In comparison, lakes offer greater flexibility and lower costs compared to data warehouses.
Separation of storage and compute
Cloud object storage allows you to scale your compute and storage independently. Data lakes benefit from this separation, which allows you to match your resources to your needs in an agile, dynamic way.
Lower storage costs
Data lakes using cloud object storage are very low cost compared to data warehouses. This allows organizations to scale their data needs while maintaining budgets.
Structured and semi-structured data
Modern data comes in many forms, both structured and semi-structured. Storing data in a lake allows users to keep data without requiring costly ETL. This makes data lakes particularly suitable for large amounts of semi-structured data.
5 Issues that you may be experiencing with your Data Lake
As good as data lakes are, using them can create problems. Many of these problems are the very reason that data lakehouses were created. These issues include:
Slow query speed
Hive was originally designed to work with the HDFS file system, despite most modern data lakes using cloud object storage. Because of this, Hive queries object storage using an emulated folder structure. Because of this, data lakes using Hive need to query all objects in a given folder to return a result. This is much slower than alternative approaches seen with Iceberg or Delta Lake.
Ungoverned data
Data lakes operate a more permissive data structure, allowing unstructured and semi-structured data to be held. While this offers flexibility, it must also be governed using strict access controls to ensure that data is located correctly and accessed appropriately.
Version control
Traditionally, Hive does not include a version system. This means there is no possibility of rolling back to a previous state unless you copy the entire dataset multiple times. This problem is addressed in lakehouses, which all allow some aspect of Git-like version control. The problem has also been addressed in later versions of Hive, including Hive ACID.
Exploding storage costs
Because Hive cannot roll back to previous states, organizations often have to resort to copying the entire dataset multiple times. This drives up data storage costs, and although lakes have some of the least expensive storage around, this can still add additional costs to your data lake. Lakehouses improve on this by improving metadata management, eliminating the need to copy the dataset multiple times in most scenarios.
ACID compliance
Although some very recent iterations of Hive support ACID compliance, most Hive systems in operation do not. This can quickly present a problem for organizations with datasets with records that are frequently updated or deleted. Data lakehouses are designed to solve this problem, providing full ACID compliance whether you use Apache Iceberg, Delta Lake, or Hudi.
What to know more about data lakes?
The video below unpacks everything you need to know, guiding your understanding using animations.