Despite the investments and effort poured into next-generation data storage systems, monolithic, centralized data warehouses and data lakes have failed to provide the line of business, data scientists and data analysts the rapid and trustworthy business insights needed to make intelligent business decisions. Many feel that the answer might lie in a concept called “Data Mesh”—a decentralized, distributed approach to enterprise data management.
Founder of Data Mesh Zhamak Dehghani defines the concept as “a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments—within or across organizations.” She’s authoring an O’Reilly book, Data Mesh: Delivering Data-Driven Value at Scale and Starburst, the ‘Analytics Engine for Data Mesh,’ happens to be the sole sponsor. In addition to providing a complimentary copy of the book, we’re also sharing chapter summaries so we can read along and educate our readers about this revolutionary paradigm. Enjoy Chapter Four: Principle of Domain Ownership!
Over the past few weeks, we may have stirred up a few Starburst blog readers by recognizing that the single-source-of-truth model, which focuses on centralizing data in one place, isn’t helping data leaders make fast and accurate data-driven decisions. As a result, businesses aren’t realizing the full value of their big data analytics investments. If you’re a data engineer, data analyst or data leader, you’re almost certainly aware of these challenges and might be looking for a practical methodology to lead your organization down a new path. Today is your lucky day, because we’re getting into the thick of Data Mesh, which aims to tackle these challenges head-on, by completely rethinking how we make useful, accurate data available to the entire organization.
As you saw in our Chapter Two Book Bulletin, Zhamak’s vision for Data Mesh seeks to reorient how we break down the problem of frictionless access to accurate analytical data at scale. The first step she takes is to embrace that business complexities are here to stay. By doing so, she acknowledges that the people best positioned to manage and respond to that business complexity are the people who live in that business and therefore understand it. Central IT and data engineering departments that have traditionally been tasked with sourcing, transforming, and presenting data lack the understanding of the business which damages data quality and responsiveness.
Decomposing the Business into Domains
Zhamak brings in an approach from software engineering called Domain-driven Design (DDD) to help organizations produce high-quality analytical data needed by everyone. DDD played a big part in the microservices revolution that decomposed monolithic application stacks into adaptable, flexible, and resilient microservices that were built around the needs of the business. Zhamak extends what DDD accomplished in the operational plane to the analytical plane. She invites us to decompose the business into individual domains. These are the domains that have traditionally owned responsibility for the operational systems that power the business. They may very well have been the same ones identified for the launch of microservices. It is these domains that will now also be given direct responsibility over the data in the analytical plane.
Autonomy and Responsibilities in Domain Ownership
There are many reasons why each part of the business must be aligned with the data, but it largely comes down to a domain-level understanding of the business and the data. The people who truly understand that part of the business are best positioned to manage the associated data, and to ensure it is accurate. This is the principle of Domain-Oriented Ownership. Zhamak entrusts these domain teams to manage not just the associated operational data, but the analytical data as well. She also provides specific guidance: enlist these teams to provide useful, usable, high-quality data to the rest of the organisation (i.e. other domains who consume data, and data consumers in general).
Not only do the domain teams understand the business better, the domains can design the data types and file formats to conform to what the business needs, and can also respond and pivot faster. As large scale organizations conduct planning exercises around new products or features, they can also plan for how to represent data to take account of those changes. Even though we have assigned additional responsibilities to the domains, we have also empowered them with autonomy to represent their data in the way they see fit.
One key part of domain ownership is that domain teams are responsible for producing high-quality data products. This means that the pipelines that create high-quality analytical data, including the extraction from operational systems, the cleaning and conforming, must be done within the domain. It doesn’t mean data pipelines are going away, but they are shrinking. There will be a much smaller set of pipelines running within the domain data products. The good news is that the self-service data platform makes it easier for non-technical data staff to support the creation and management of necessary pipelines for data products.
Zhamak also counsels against trying to create overly-complex data models. DDD has the notion of bounded context, meaning that the context must be limited to the part of the business associated with the domain. Only data created from the domain needs support from that domain. Other domains may model the same concept in different ways, and there may even be overlap between them. Expect a reasonable amount of copying data between domains. In the world of Data Mesh, all of this is perfectly fine, and seen as a reasonable trade-off for the agility gained by decomposing the business into domains. Moreover, it is the separation of concerns at the domain level, and the avoidance of overly-planned global concepts and models, that enables more nimble thinking. Yes, it’s acceptable to relate different notions of the same concept or model across domains, but you should not attempt to create a single über-model that spans domains.
Categories of Domain Data
Data Mesh also defines three categories of domains — the author calls them “archetypes” — which relate to how close the data is to the source in its original form. The first category is source-aligned domain data, or data that reflects the business facts generated by the operational systems held within the domains. Analytical data that joins data from multiple upstream domains is called aggregate domain data. Finally, we have fit-for-purpose domain data, or data that has been customized for specific end-user use cases. The first of these is the most foundational, since the other two can be synthesized from it. Source-aligned data must be permanently, immutably captured for the benefit of the rest of the business.
Augmenting the Principle of Domain-Oriented Ownership
Decentralized control over data can improve data quality, enabling a large organization to rapidly respond to constant changes in an evolving business climate, and cultivate resilience. It also fortifies the scalability of the entire data organization. However, without a central team to organize the data, emergent risks may arise, and so Zhamak augments three other principles to support and reinforce the main idea of domain-oriented ownership. The four principles listed together are:
- Domain-oriented ownership
- Data as a product
- Self-service infrastructure as a platform
- Federated computational governance
The various principles address the issues surrounding domains. For instance, increased autonomy granted to domains is balanced with the responsibility to provide usable, high-quality data products. In order to support non-technical data experts within the domains, we provide domains with a self-service platform to create and manage data products. Also, provisioning this self-service platform reduces the cost of building those data products. Since absolute autonomy could yield inconsistencies and violate corporate governance rules, we enforce consistency through a federated governance model. Zhamak says that the four principles are “collectively necessary and sufficient; they complement each other and each addresses new challenges that may arise from the others.” The diagram below shows how these principles complement each other, and their main dependencies.
We have spent decades orienting data creation around technologies, and lines of communication, rather than around the components of the business. Data Mesh aims to end that misalignment, which is cracking at the seams under the increasing pressure to add more sources, and simultaneously deliver more insights and greater innovation. Reorienting our data practices much more closely to the business offers larger organizations a chance to respond more rapidly to changes in the business, to innovate faster with data and deliver greater value.
This chapter has focused mostly on the benefits of decentralization and the notion of domain-oriented ownership. The next chapters will focus on the other principles, which ensure that decentralization and autonomy doesn’t lead to inconsistency and a lack of governance.
Read along with us!
Get your complimentary access to pre–release chapters from the O’Reilly book, Data Mesh: Delivering Data–Driven Value at Scale, authored by Zhamak Dehghani now.