Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
A data lake is flexible and scalable centralized repository that can store a wider variety of data types to generate innovative insights that drive data-driven decision-making. This guide will introduce the concept of data lakes, its challenges and benefits, and explain how data lakes paired with modern data analytics are the ideal solution for managing big data.
Last updated: January 11, 2024
A data lake is a single store of data that can include structured data from relational databases, semi-structured data and unstructured data. Unlike databases or data warehouses, data lakes can store different types of data and let enterprises optimize compute and storage costs. It can include raw copies of data from source systems, sensor data, social data and more. The structure of the data is not typically defined when the data is captured. Data is typically dumped into a data lake without much thought about accessing it.
Related webinar: Building a Data Lake Strategy
Cloud data lakes are cloud-based implementations of on-premises data lakes. They leverage the benefits of cloud storage providers to provide cost-effective, on-demand scalability.
A data-driven organization needs a single source of truth to build its business intelligence, advanced analytics, and data science resources. For example, reports prepared by sales departments must use the same revenue number as reports prepared by the finance department. In addition, data scientists can develop machine learning projects faster when they can draw upon a single repository for their data.
Turning a data lake into the data center of gravity also creates a single point of access that streamlines data governance practices. Data quality standards applied during ingestion ensure users can access clean and consistent data. Access controls ensure compliance with data privacy and security policies.
As with any aspect of information architecture, the benefits of a data lake do not come without challenges. Without a well-planned design and adequate resources, a data lake could become as difficult to support as the warehouses it replaces.
Data warehouse vendors promote a total storage, compute, and analytics solution. Data lakes, on the other hand, only address the storage component. As such, companies that do not plan their solution carefully can face several challenges: (1) data swamps, (2) security and privacy, (3) accessibility, (4) performance
Data lakes demand constant attention and maintenance. Otherwise, they risk becoming data swamps — vast pools of inaccessible, poor-quality data that require time, effort, and resources to support useful analysis.
The self-service capabilities of a data lake can over-provision access, resulting in violations of security and privacy policies — and potential regulatory violations.
At the same time, excessive compliance and governance enforcement can undermine the data lake’s versatility, preventing users from accessing data essential to business progress.
In an effort to optimize costs, the data lake may lean too heavily on slower, cheaper storage options that result in high-latency queries. The wrong combination of file format, table format, and query engine can also cause performance to lag.
Related reading: Data lake challenges and why modern data lakes are better
It’s time for a new approach.
Data lakes promised a cost-effective, scalable storage solution but lacked critical features around data reliability, governance, and performance. And legacy lakes required data to be landed in their proprietary systems before you could extract value.
Enter the modern data lake.
Here are the modern data lake’s many benefits: (1) Decoupling storage from compute, (2) data flexibility, (3) data accessibility
Enterprises can choose the optimal storage and compute architectures for their data lakes. The storage architecture can scale independently of the cloud computing architecture, resulting in a more efficient, performant, and scalable solution.
Data lakes are not tied to a single, immutable schema or limited to structured data. Companies can use any type of data to inform their business analyses, and their data lake can evolve to store new types of data.
Data lakes within a federated architecture make more data accessible to users throughout the organization. Business intelligence analysts can produce more complete reports. Data scientists get access to a deeper and broader pool of data for generating novel insights.
Modern data lake, data lakehouse, and modern data architecture are mostly synonymous and describe a data management architecture that combines advanced data warehouse-like capabilities for scaling data analytics with the flexibility and cost-effectiveness of a data lake. Modern data lakes go one step further than data lakehouses by providing federated access to data around the lake, so you can explore your data in real-time before centralizing it.
Modern data lakes include a high performance query engine, open table formats (e.g. Iceberg), open file formats, governance layer as well as commodity object storage and elastic compute, particularly using cloud infrastructure (e.g. AWS S3, EC2, Google Cloud Storage, Azure Data Lake Storage).
As the organization’s data center of gravity, a data lake supports multiple use cases, from everyday analysis to data management to the most advanced data science.
A data lake’s flexible schema-on-read approach allows users to explore and discover new insights from raw, unstructured, and semi-structured data. Data scientists, analysts, and business users can perform ad-hoc queries, data exploration, and interactive data visualizations.
Data lakes support data transformation and integration tasks, allowing data engineers to clean, enrich, and transform data within the lake. Data engineers can build ETL/ELT (Extract, Transform, Load) pipelines to perform data transformations, join multiple datasets, aggregate data, and apply business rules or data validation logic.
Data lakes provide a rich environment for data scientists to access and analyze large volumes of data for building and training machine learning models. Data scientists can perform tasks such as predictive modeling, anomaly detection, customer segmentation, recommendation systems, and natural language processing using the diverse data available in the lake.
Data lakes can serve as a scalable and cost-effective alternative to traditional data warehouses for storing and analyzing structured data. By integrating data lake technologies with business intelligence tools like Tableau, Power BI, or Looker, organizations can perform interactive reporting, dashboarding, and self-service analytics, enabling business users to access and analyze data with ease.
Related reading: Data-driven innovation
There are four core elements comprise a modern data lake analytics architecture. These components include:
Modern data lakes are often built on cloud platforms that provide commodity storage and compute resources. For example, Amazon S3 (Simple Storage Service) or Elastic Compute Cloud (EC2) in AWS, or Azure Blob Storage and Azure Virtual Machines in Azure. These resources offer scalable, cost-effective, and durable storage and compute capabilities required for storing and processing large volumes of data (i.e. big data).
To maximize data accessibility and compatibility, modern data lakes support open file formats and modern table formats that provide additional capabilities beyond traditional file formats like CSV or JSON. Examples of modern table formats include Iceberg, Delta Lake, and Hudi. These formats offer features such as schema evolution, transactional consistency, ACID (Atomicity, Consistency, Isolation, Durability) properties, and time travel capabilities. They enhance data management, improve data quality, and enable efficient data processing and analytics workflows.
A crucial component of a modern data lake is a high-performance and scalable query engine. It allows users to run fast and complex queries on large datasets stored in the lake. Popular query engines for data lakes include Apache Spark and Trino. These engines leverage distributed processing techniques to perform parallel and optimized query execution across a cluster of machines, enabling efficient data retrieval and analysis.
In addition to the data residing in the data lake, organizations often have data distributed across various sources and systems. To provide comprehensive insights, a modern data lake should offer federated access to these external data sources. This involves integrating connectors and interfaces that allow querying and accessing data from databases, data warehouses, APIs, or other external systems. By providing a unified view of all data, regardless of its location, users can analyze and derive insights from diverse datasets within the data lake environment.
By incorporating these components, organizations can create a powerful and flexible data lake architecture that enables efficient data management, analytics, and decision-making.
Related reading: Data lake architecture strategy
Starburst abstracts enterprise storage to create a virtualized, federated architecture. By providing a single point of access to any enterprise data source, Starburst activates all the data in and around your data lake.
Starburst extends the Trino query engine to let you find the right balance between performance and cost. Pushdown queries, dynamic filtering, and other features improve query performance. At the same time, Starburst’s query optimizers let you maximize your compute investment.
From a single pane of glass, Starburst’s fine-grained role-based and attribute-based access controls let you create rules governing who may see what kinds of data. These granular controls let you dramatically expand data access without compromising data security or privacy.
To see a more concrete example of how modern data lakes unlock business value, consider the case of a large food and beverage company. This global conglomerate generated large volumes of data across its portfolio of billion-dollar brands. Data warehouses isolated much of this data within data silos, limiting the company’s ability to discover insights.
With the creation of a modern data lake analytics solution, the company built a more efficient, future-proof data architecture. The ability to independently optimize storage and compute resulted in a more performant and cost-effective architecture. Strong governance capabilities let the company democratize data access, giving analysts access to data stored across its global storage infrastructure and a more holistic view across its consumer brand portfolio.
AI Data Analytics: How an open data lakehouse architecture supports AI & ML
While Hadoop itself is not a data lake, it is the foundational technology upon which the concept of data lakes was built. Hadoop provided the necessary infrastructure, principles, and capabilities that have shaped the way data lakes are designed and used in the industry. Moreover, Hadoop played a pivotal role in enabling the storage and processing of extremely large datasets, which traditional platforms and tools were unable to handle effectively.
Related reading: Cloud object storage vs HDFS
Object storage is a storage technology, while a data lake is an architectural approach or system for data storage and analysis. In practice, a data lake may use object storage as its underlying storage technology due to the benefits of object storage, such as scalability and cost-effectiveness. When a data lake uses object storage, it stores its data in a sequence of objects, which impacts how data is managed and accessed. However, the concept of a data lake is broader and includes aspects of data governance, processing, and analytics, which are not inherently part of object storage itself.
Starburst Galaxy supports access to the following object storage systems:
Snowflake is not a data lake. Snowflake at its core is a cloud data warehouse that is built on top of AWS. As a part of its data ingestion process, it recommends using a cloud data lake (S3) as the staging environment before moving the data into Snowflake.
Databricks is not a data lake. Databricks at its core is an analytics and AI platform that enables data teams to transform and analyze data on cloud data lakes using open-source Delta Lake data formats.
Azure Data Lake Storage (ADLS) is a hyperscale cloud data lake offered by Microsoft as a part of its Azure platform. Databricks on the other hands is an analytics and AI platform that can run on top of Azure.
Data mesh is a data management approach not a technology like that of a data lake. Data lakes can be a component of the data management strategy that help activate a data mesh approach but can’t replace data mesh.
Trino (formerly known as PrestoSQL) was built by four Facebook engineers to address performance, scalability and extensibility needs for analytics at Facebook. Trino is a distributed SQL query engine designed for efficient, low latency analytics at scale. It emerged from Facebook as a faster and more powerful way to query a very large Hadoop data warehouse than what Hive and other tools could provide. Modern data lakes often use other object storage systems beyond HDFS from cloud providers including Amazon Simple Storage Service (S3), Google Cloud Storage and Microsoft’s Azure Blob Storage. By leveraging connectors to these cloud object stores, Trino is able to query these systems and enable high performance SQL analytics on your data lake no matter where it’s located or however it stores the data.
Trino has become the choice for querying the data lake due to its high performance at scale. Unlike other options available today, Trino’s concurrency is limited only by the size of your cluster which can be scaled up and down as required. Trino also has connectors to the most popular data sources allowing for data federation across multiple data sources, providing the user with a holistic view of their entire data ecosystem. These connectors allow you query the data where it resides, shortening the data pipeline for your organization.
Up to $500 in usage credits included