Last Updated: May 22, 2023 Published: August 1, 2022
Author: Evan Smith, Technical Curriculum Developer, Starburst
Technical reviewer: Tom Nats, Modern Data Lakes For Dummies author
Data lakes often enable data managers to transform data from multiple sources to get insights that were once impossible. With its vast storage capacity, a data lake can accommodate a company’s data needs without breaking a sweat and are typically less expensive than traditional data warehouse appliances. Plus, its intelligent search and retrieval features make it easy to find the information you need when you need it.
Let’s take a closer look at what data lakes are and how some organizations may start with data silos, raw data, and end up with data lake analytics.
A data lake is a centralized repository that holds large amounts of raw data in its native format, regardless of data type or data structure, including structured data, semi-structured data, and unstructured data. All the various data types can then be imported from multiple sources, such as social media, IoT devices, apps, and clickstream data.This “big data” approach to data storage enables businesses to store large volumes (think terabytes and petabytes of data) of enterprise data more quickly and cost-effectively than traditional methods.
Unlike a dashboard, data mart or data warehouse, data lakes are increasingly popular for storing and processing big data because of their low cost and the ability to save data in its native format, in addition to allowing analysts to promptly extract, load, and transform (ELT) the data into the desired format for rapid analysis.
However, data lakes are more complex to manage than traditional systems, and can quickly become data swamps. By fully understanding the business value and specific needs of data lakes, companies can fully utilize and optimize data for advanced analytics.
Like any technology, data lakes must be used correctly. The list below outlines some of the things that should be avoided when constructing a data lake.
Although data lakes are very versatile, without significant planning they can become difficult to manage and govern effectively. Without the right tools and processes in place, data lakes can devolve into data swamps, making it hard to find and utilize the data inside them. Starburst can help immensely in this regard by ensuring that data is both queryable and navigable regardless of how large the data lake becomes.
Most data lakes are designed to give users self-service access without involving a central IT department, but managing access to the data in a data lake is a concern. While self-service access improves efficiency and enables more people to work with data, it can also expose sensitive data to major security risks. For this reason, strict security measures are essential to prevent unauthorized access and ensure that users are trained appropriately in the safest way to use data lakes.
Data lakes are highly performant under many conditions. Despite this, their efficiency varies considerably depending on a variety of factors. This includes the storage size of the data lake, the amount of compute power applied to it, and the underlying data structures involved. To overcome these issues, advanced query engine technologies, such as Starburst, can be used to improve the performance.
One of the key challenges facing the use of data lakes is the need to ensure that the data inside them adheres to minimum compliance specifications. Data lakes can be subject to regulatory requirements, making it crucial to have a data governance framework in place. In certain circumstances, this increased need for compliance can limit some of the versatility and adaptability benefits that users seek when establishing a data lake. For this reason, a careful balance between compliance and versatility is often required.
ACID compliance is another potential area of concern for data lakes. ACID stands for Atomicity, Consistency, Isolation, and Durability, and it is a set of design properties that guarantee the reliable processing of transactions. ACID is not usually a critical need in analytical systems. As most data lakes are used for analysis, traditional implementations have not focused on ACID compliance. Nonetheless, it is desirable in some circumstances and remains a drawback of traditional deployments.
In recent years, data lakes have adopted modern table formats which better support ACID compliance. Storage layer technologies such as Hudi, Delta Lake, and Iceberg have been developed to enhance ACID compliance and provide other enhancements to data lakes, bringing their performance closer to that of a data warehouse.
To help you understand the features that make data lakes attractive to businesses, let’s take a look at what data lakes can do. Data lakes are designed to both store and analyze all types of data, often using machine learning or artificial intelligence algorithms. This type of repository has various benefits over traditional data storage techniques.
Data lakes offer many benefits to help organizations utilize their data assets better and improve their decision-making process.The way in which an organization uses a data lake also depends on the types of business insights it hopes to gain. Bottomline: as organizations increasingly seek to gain insights from all their data, data lakes will become essential to their overall big data strategy.
In the past, compute and storage resources were combined on the same machines. This was due to the prevalence of on-premises systems and the nature of the Hadoop Distributed File System (HDFS). In contrast, data lakes enabled the separation of compute and storage, ensuring that each resource can be scaled as needed. This is often one of the main ways that data lakes reduce cost.
Data lakes are designed to be queryable, meaning that they can be easily analyzed using a variety of tools such as Hadoop, Trino, and Spark. This makes them ideal for extracting insights from large data sets. In addition, data lakes can be used for a variety of purposes, such as predictive analytics, machine learning, and data visualization.
Data lakes are a cost-effective type of storage for large amounts of data from various sources. Data lakes typically allow data of any structure, which reduces cost because data is more flexible and scalable as it doesn’t need to fit a specific schema.
Data lakes are typically both large and inexpensive. Because of this, they are well suited to the rapid increase in data volumes seen in recent years. In fact, they are often the most affordable data storage option, typically costing far less than data warehouses.
Data lakes are designed to store data from multiple sources and multiple data structures in the same repository. This includes structured, semi-structured, and unstructured data. Such a diverse approach would not be possible in a data warehouse.
To navigate different data structures, data lakes typically deploy intelligent search and retrieval systems like Starburst. This helps ensure that you can find the information you need, regardless of the original structure of the data involved.
Data warehouses require all data entering the system to be structured according to a predefined schema. All new data is schematized using an Extract, Transform, and Load (ETL) process before it arrives in the system.
In contrast, data lakes do not need to apply a single structure to the data inside them until that data is read. Although data pipelines are still often needed, there is no overarching requirement for a predefined schema at the outset.
In many data lakes, this allows for data to be stored in a raw state until specific datasets are needed. Using this approach, schemas are prepared ad hoc in response to changing conditions and business needs. Transformation still occurs, but it is applied after the data has entered the lake using ETL.
Data lakes can handle large volumes of data without compromising performance. This is particularly important as organizations build large, expanding data repositories and need a reliable system capable of matching the growing size of their data. A data lake creates more options for expansion and helps ensure that a solution put in place today is still suitable in the future.
Data lakes are designed to be platform independent. This means that all data types can easily be analyzed together in the same data lake. This critical distinction makes data lakes ideal for business analysts as they extract insights from large, varied data sets.
New sources can be added and new data types incorporated at a later date, allowing organizations to harness all of their data for real-world insights. This versatility has traditionally driven the data lake’s adoption in cases where multiple sources and data structures are either required or an unavoidable by-product of the systems in question. This contrasts with data warehouse solutions which are much less flexible.
Data lakes store and process data differently from other database technologies. While data warehouses require all of the data to be structured according to a predefined schema before it is added to the system, data lakes store data in a raw format.
Let’s explore the ways in which data lakes store data, and the impact that this approach has on their operation and use. Special attention will be paid to the variety of data types that can be added to a data lake, including examples of the types of systems that produce this data. You will also explore the mechanisms by which schemas are applied to this data and how this approach operates differently in data lakes.
Structured data makes up a significant amount of the data stored in any database, including a data lake. Data lakes are able to make use of structured data, and can store this data type alongside other formats. It is worth noting that although data lakes make use of raw data of any type, they can and do make use of structured data as well.
Semi-structured data is minimally structured, but has limitations to fit effortlessly into a relational database or other traditional data storage systems. Examples include JSON, XML, and CSV files.
That’s why data lakes are helpful as they are able to store semi-structured data alongside structured data, enabling data from one data type to coexist with data from the other.
Unlike structured data and semi-structured data, unstructured data does not conform to any preset schema or format. As such, unstructured data is fundamentally unsuited to storage in a relational database and must be stored in a data lake.
Schemas define the structure of a dataset.
There are two main methods of organizing schemas that impact the storage of data in a data lake: schema-on-write and schema-on-read.
Schema-on-write is a data management construct where data schemas are created before the data is written to the database. When data later enters the system, it must be compliant with this schema from the outset. Data that does not fit the schema will be disallowed by the system.
This construct is closely tied to relational database management systems, and is useful in cases where the data in question already fits a specific, predictable format known in advance. Although used in some data lakes, this approach is more often associated with data warehouses.
Schema-on-read is a data management construct where a schema is validated when the data is read. Unlike schema-on-write, a schema validation is not completed when data is written to the data lake. Instead, the data is validated only when it is read. Schema-on-read is often associated with data lakes.
Data lakes do not traditionally make use of built-in indexing capabilities in the same way as relational databases. Some vendors, including starburst, are solving that problem with software enhancements.
By their nature, data lakes store large volumes of data. However, running traditional SQL queries requires the system to read every row in every table. This causes SQL run times to be long and hampers overall system efficiency.
Partitioning and bucketing address this by dividing large datasets into smaller groupings. You can think of each of these as sub-directories in a larger directory system. Their usage helps enhance the speed and efficiency of the system.
The data inside a data lake can be stored in a number of different formats. Keeping track of this information requires the ability to keep track of these differences. This kind of data is called metadata, and the storage of metadata is an integral aspect of any data lake.
Here, we’ll learn how metadata is stored, and why it is critical in data lakes.
Metadata contains information showing how all of the data files in the data lake are organized. For this reason, even though the data lake itself is not structured, the metadata about that data is always structured and held in a separate repository.
In order to query data in a data lake, we need to understand how the data is structured. Since structure wasn’t imposed on the data when it was brought into the data lake we must supply this metadata via a metastore before the data can be effectively queried.
A metastore is a special, dedicated repository used to store metadata relating to the data held in the data lake. It operates as a separate datastore that keeps track of the metadata for the system and fields requests about the structure of a given dataset. Typical metadata includes information relating to the storage system such as the file format, directory structure, and location of data within its files.
Two popular metastores include the Hive metastore and Amazon Glue. These services act as an intermediary between the user’s request and the datasets held in the data lake. When a request is made, information about the structure of the dataset in question is retrieved from the metastore. This information is then used to retrieve the source data from the data lake.
Sure, Hadoop was able to process large amounts of raw data using distributed systems. At an architectural level, these early systems used the Hadoop Distributed File System (HDFS) to store their data in large, on-premises installations.
Over time, the rise of cloud computing disrupted Hadoop’s dominance, replacing it with object storage. Cloud object storage allowed for much greater separation of both compute and storage on a scale impossible before. At the same time, costs for cloud object storage were much, much lower.
This began a shift in data lakes from exclusive use of HDFS towards the predominant use of distributed object storage. This sparked further developments in adjoining technologies to make better use of object storage. This is particularly true of query engines as cloud object storage requires a separate query engine to run. Starburst is designed to use both object storage and HDFS as needed.
Currently, the three largest providers of cloud data lake storage services include: Amazon S3 (AWS), Microsoft Azure Blob Storage/Azure Data Lake, and Google Cloud Storage.
Related reading: The difference between cloud object storage and HDFS
The Hadoop framework brought the ability to distribute large computing jobs using parallel processing. With the advent of cloud-based object storage, a technological revolution was under way.
But there was a problem. Hadoop was complex, especially for analytical tasks. Creating MapReduce jobs required an intricate knowledge of Java that many users lacked. This gap would give birth to a new technology, Hive, which enabled users to interact with Hadoop by controlling MapReduce using SQL syntax. This was a game changing step as it opened up data lake analytics to a new audience and helped drive its adoption.
Most data lakes are built on Hadoop, a distributed file system that can store vast amounts of data. Hadoop is designed to be scalable and fault-tolerant, meaning it can keep working even if some of the system’s servers fail, making it an ideal platform for data lakes.
When you build a data lake on Hadoop, you can use any number of technologies to access the data. You can use SQL-based tools like open source Trino, Hive, or Impala to run queries against the data. Or you can use Hadoop’s MapReduce framework to process and analyze the data.
Hive was built on top of HDFS to provide SQL-like query functionality. This approach had many limitations owing to the compilation process needed to turn HiveQL into MapReduce. Starburst presents an alternative approach to HiveQL.
Starburst query engine conforms to the ANSI SQL standard. It allows for a platform-independent, single source of access for data from any data source. Data can be housed in data lakes, data warehouses, or databases. Queries can be federated across multiple sources, providing a best-of-all worlds approach.
For example, transactional data is often best served in a database, as those systems are designed to act as systems or record. At the same time, structured analytical data may still be processed in a data warehouse. Data lakes excel at semi-structured and unstructured data analytics. With Starburst, all of these systems can work together in a single query engine.
Starburst also offers superior performance when compared to other technologies. This is achieved by deploying a Massively Parallel Processing (MPP) architecture that is able to leverage the combined processing power of large clusters to achieve superior processing speeds.
Finally, by facilitating the storage options most suitable to a given use case, costs can be reduced when compared to other techniques. Highly-structured data can be retained in data warehouses, while unstructured data can be held in a less expensive data lake without sacrificing access. At the same time, the ability to scale compute resources to meet a number of different needs helps save costs in another way.
Don’t just take our word for it, here’s what a Starburst customer, Comcast had to say, “When end users are going into on-prem or cloud environments, they will be presented with all the data sets they have access to, irrespective of where the data is located. This offered huge value to our end users.”
Our SQL query engine can securely access data stored anywhere; across cloud and hybrid environments.
Apache Iceberg is a table format, originally created by Netflix, that provides database type functionality on top of object stores such as Amazon S3.
There are 4 things that companies are hesitant about using their data lakes to provide a majority of their analytics
Up to $500 in usage credits included