×
×

The difference between a data mesh and data lake

Data mesh is a decentralized approach to managing and distributing data, meanwhile data lakes serve as centralized repositories.

In this post, we discuss how data mesh shifts the focus from a centralized data lake to a distributed data architecture, emphasizing data products.

Data Lake BlogsData Mesh Blogs

O’Reilly Book

Data Mesh Book Cover

Get your free copy

Last updated: December 8, 2023

Centralized architecture  vs. Decentralization

Data lakes are centralized repositories designed to store vast amounts of data in a scalable and cost-effective manner.

Data mesh decentralizes data ownership, distributing responsibilities to individual domains or business units, fostering a more collaborative and scalable approach.

Data governance as it relates to a data lake vs. data mesh

Governance in a data lake may face challenges such as excessive security and data accessibility. For example, accessibility often face challenges related to over-provisioned access and potential restrictions on data ownership.

Data mesh integrates federated computational governance, enabling each domain to have autonomy over its data while ensuring overall compliance. Data mesh ensures that each domain or business unit owns and manages its data, promoting self-serve data infrastructure and accessibility.

Data lake vs. Data mesh: Scalability

While data lakes can scale, they may face challenges related to constant attention, maintenance, and potential scalability issues with increasing data volumes. With a modern data lake, it brings quality to your data lake by adding key data warehousing capabilities such as transactions, schemas and governance. 

Data mesh addresses scalability by distributing data ownership, allowing each domain to independently optimize storage and compute, resulting in a more scalable solution.

Data lake architecture

In response to the challenges of data warehouses, the data lake architecture emerged. Many were thrilled with this new option because of its access to data based on data science, machine learning model training workflows, and support of parallelized access to data.

The data lake architecture is similar to a data warehouse in that the data gets extracted from the operational systems and is loaded into a central repository.

However, unlike data warehousing, a data lake holds a vast amount—terabytes and petabytes—of structured, semi–structured, and unstructured data in its native format until it’s needed. Once the data becomes available in the lake, the architecture gets extended with elaborate transformation pipelines to model the higher value data and store it in lakeshore marts. Essentially, we moved from ETL to ELT processing.

The data lake architecture is often described in the following way:

  • Data is extracted from operational databases
  • Data is raw and minimally formatted
  • Data is accessed through the object storage interface
  • Data lakes are designed to handle enterprise-grade analytics
  • Data lakes also answer big questions such as: “How is your business doing?” and “What investments and opportunities should you be making?”

You can see from the visual below that a data lake architecture generates complex, unwieldy data pipelines resulting in unmanaged, untrustworthy and inaccessible data sets. Also as data lakes grow in size and in usage, they become expensive to scale and to meet the performance demands of the business. Unfortunately, we still relied on a centralized team to perform the ELT, so again, as business users request a change, they have to wait for the central team to respond. Similar to the data warehouses, this approach limits the value of data to data analysts, which ultimately restricts the business in making informed data-driven decisions.

Data Lake Architecture

Related reading: Data mesh architecture

How Starburst helps with your data lake and data mesh strategy

Data Mesh is certainly the future for our business, and probably for many others, particularly ones which have a legacy of acquisitions, and the need for merging of different data sets to form a new larger entity. Having the ability to query data where it resides using Starburst is enormously powerful and makes a huge impact on the ability for data to provide answers.”  Richard Jarvis, CTO, EMIS Group

“The implementation of Starburst on the data lake allows analysts and data scientists quick and simple access to data that exists in the organization for business value and insights. ETL processes that took many months and at high costs have become extremely fast and accessible to analysts at negligible costs.” — Shlomi Cohen, EVP, Head of Business Data and Analytics, Bank Hapoalim

Start for Free with Starburst Galaxy

Up to $500 in usage credits included

Please fill in all required fields and ensure you are using a valid email address.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.