Despite the investments and effort poured into next-generation data storage systems, data warehouses and data lakes have failed to provide data engineers, data analysts, and data leaders trustworthy and agile business insights to make intelligent business decisions. The answer is Data Mesh – a decentralized, distributed approach to enterprise data management.
Founder of Data Mesh Zhamak Dehghani defines Data Mesh as “a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments – within or across organizations.” She’s authoring an O’Reilly book, Data Mesh: Delivering Data-Driven Value at Scale and Starburst, the ‘Analytics Engine for Data Mesh,’ happens to be the sole sponsor. In addition to providing a complimentary copy of the book, we’re also sharing chapter summaries so we can read along and educate our readers about this (r)evolutionary paradigm. Enjoy Chapter Three: Before the Inflection Point!
“The definition of insanity is doing the same thing over and over again, but expecting different results.” Albert Einstein
That remark may just be the right capsule description of data architecture for the last few decades. To appraise why we need a new approach to data management, let’s go back in time to understand the evolution of analytical data architecture. Why? While the number of analytical data technologies have grown, the architecture has seen very limited advancements and we’ve created systems that do not scale or live up to the promise, expectation, and investment of becoming a valuable data-driven organization.
Data Warehouse Architecture
The data warehouse was traditionally designed to enable and support an organization with business intelligence through analytics, reports and dashboards. Often associated with an organization’s “single source of truth,” a data warehouse’s analytical capabilities fortified an organization to make valuable operational, tactical, and strategic business decisions.
The characteristics of a data warehouse architecture have largely remained the same and can be described in the following way:
- Data is extracted from operational databases
- Data is transformed into a universal schema
- Data is loaded into the warehouse tables
- Data is accessed through SQL–like querying operations
- Data primarily serves data analysts to produce reports and visualizations
Essentially, organizations attempted to build the “enterprise data warehouse.” However, coming to a consensus on the definition of terms across a wide portfolio of use cases, along with relying on a centralized team responsible for the creation, management, and retirement of thousands of ETL jobs, tables, and reports meant that over time, organizations moved from a single enterprise data warehouse target to many data warehouses, each focused on supporting a specific part of the business.
Unfortunately, this resulted with the scenario where there were now multiple definitions of data entities, but still a centralized team responsible for the creation, management, and retirement of thousands of ETL jobs, tables, and reports across a number of sometimes differing data warehouse technologies. One of the biggest issues with this approach is that a business function would request to change to a table, job, or report and then wait weeks or even months for the central team to respond. Inevitably, this resulted in missed revenue opportunities for the business, increased cost or poorer risk control.
Data Lake Architecture
In response to the challenges of data warehouses, the data lake architecture emerged. Many were thrilled with this new option because of its “access to data based on data science, machine learning model training workflows, and support of parallelized access to data.”
The data lake architecture is similar to a data warehouse in that the “data gets extracted from the operational systems and is loaded into a central repository.”
However, unlike data warehousing, a data lake holds a vast amount—terabytes and petabytes—of structured, semi–structured, and unstructured data in its native format until it’s needed. “Once the data becomes available in the lake, the architecture gets extended with elaborate transformation pipelines to model the higher value data and store it in lakeshore marts.” Essentially, we moved from ETL to ELT processing.
The data lake architecture is often described in the following way:
- Data is extracted from operational databases
- Data is raw and minimally formatted
- Data is accessed through the object storage interface
- Data lakes are designed to handle enterprise-grade analytics
- Data lakes also answer big questions such as: “How is your business doing?” and “What investments and opportunities should you be making?”
You can see from the visual below that a data lake architecture generates complex, unwieldy data pipelines resulting in unmanaged, untrustworthy and inaccessible data sets. Also as data lakes grow in size and in usage, they become expensive to scale and to meet the performance demands of the business. Unfortunately, we still relied on a centralized team to perform the ELT, so again, as business users request a change, they have to wait for the central team to respond. Similar to the data warehouses, this approach limits the value of data to data analysts, which ultimately restricts the business in making informed data-driven decisions.
Multi–Modal Cloud Architecture
The current generation of data architecture is similar to previous generations and has redeeming qualities such as real-time data analytics and reduces the cost of managing a big data infrastructure. They’re often described in the following way:
- Streaming for real–time data availability
- Attempting to unify the batch and stream processing for data transformation
- Embracing cloud–based managed services with modern cloud–native implementation
- Converging data warehouses and data lakes to extend the existing framework
But alas, whilst the promise of cloud technologies resolve many infrastructure issues this type of architecture retains a lot of the limitations of the previous generations, whereby data needs to be moved and transformed before its useful, especially as they have not resolved the organizational issue of relying on a centralized team to perform, manage, and maintain much of the data processing life cycle.
Monolithic Architecture Does Not Work In the Real World
Here’s where we’re at: organizations have lots of data and believe that when combined, the data would yield tremendous business insights. A common problem at many organizations starts like this: A VP of Sales needs to make a decision. However, in order to make that decision, she needs better data. So she asks an analyst, but the analyst only has part of the data. The rest of the data may belong to someone else (i.e. A data scientist who could also be suffering from slow performance). Then, they go to the data engineer and the engineer responds that by the time the data gets extracted, transformed and loaded, it’ll take six months!
Why does it take so long? An ETL process needs to be created, tested, and deployed, by an already overworked team that has to support many different technologies and data requests. Once the ETL process is in place, any business insight that might have been useful six months ago, might not be relevant at all in the present moment. All of this brings us back to a desire for a new approach.
Interestingly this aligns very strongly with what our customers have been saying.
Unpacking Data Mesh
I hope all this background will pay off as next time, we’ll enter chapter four, Principle of Domain Ownership, where we’ll unpack a core principle of Data Mesh.
Read along with us!
Get your complimentary access to pre–release chapters from the O’Reilly book, Data Mesh: Delivering Data–Driven Value at Scale, authored by Zhamak Dehghani now.