As we’ve gone from Data Mesh theory to practice, organizations have been shifting their focus towards the central tenet of Data Mesh — building and managing valuable data products. It’s become a strategic factor in how data products enable organizations to make informed data-driven decisions to reduce costs, innovate or cultivate a business opportunity.
After two years of working with organizations that have adopted Data Mesh and data products, I’ve distilled my thoughts and observations at the most recent Big Data London event and now, in a two-part blog series. Part one focuses on why data products are necessary, the three kinds of data products, and how to govern and manage them. Part two will focus on who will build data products as well as the timeline of creating data products.
Overall, I outline the learnings around how organizations should consider skills and technology to iterate, fail fast, and create adaptable data products.
Why do we need data products? To close the gap between operational and analytical planes
In the original Data Mesh post by Zhamak Dehghani and in her book (you can get your free copy, here), a key aspect of Data Mesh is to close the gap between the operational data and analytical data planes. The operational data plane is the combination of the technology and people supporting the operational data platforms. Meanwhile, the analytical plane is the combination of technology and people supporting the analytical data platforms.
In Zhamak’s book, she goes so far as to state that organizations transitioning from a data warehouse approach to a Data Mesh approach will involve removing the data warehouse layer and having domains responsible for data from the operational plane and the analytical plane.
Another way to think about the two planes and much to the chagrin of everyone I tell, I have long thought of the operational and analytical plane as a slice of Victoria Sponge cake:
The slice of cake is comprised of two layers, the analytical plane at the top and the operational plane at the bottom. A lovely layer of sticky strawberry jam in between the two planes, represents the data pipeline responsible for getting data from the operational plane to the analytic plane.
The data warehouse, data lake, or data lakehouse is situated in the analytical plane, and so if we intend to build data products that are based on only this layer, we are consuming only the top half of the cake. This inevitably results in us getting sticky, jammy fingers. What this means in the data world is that we cannot achieve the promised agility of decentralized data ownership.
The reason is: to be truly agile, domains need to be responsible for ingesting data from the operational system, transforming the data, and then serving it. When we introduce a data warehouse, we rely on a centralized data team to perform the ingestion and at least some transformation, which is a Data Mesh anti-pattern. This inevitably results in slow data product development and management.
What we have learned from successful Data Mesh adoption is that the domains need to build and manage data products whose data spans the operational and analytical data plane. They need to consume an entire slice of cake, from top to bottom.
To motivate domains to build data products and achieve agility, there are various approaches that we have seen in terms of skills, responsibilities and incentivization. In all scenarios, we need to ensure that each domain has the technology and data skills required to build data products.
This can significantly increase spend at the enterprise level, and there will likely be costly data engineering skills, duplicated across the domains. An alternative approach to this is to provide simplified access that abstracts much of the need for technology knowledge and skills to access the data in the operational and analytical planes.
This approach greatly lessens the level of technology skills and thus the expense of specialized resources within each domain and ensures that data remains a first-class concern. This is a reason why organizations have adopted Starburst as a key component in their Data Mesh implementation.
Data pipelines in a Data Mesh
Over the last year, I’ve heard data pros say that Data Mesh removes the need for data pipelines, however that is not what I have observed. Pipelines are alive and well. However, when we think about pipelines in a Data Mesh, they are essentially a ‘chain’ of data products.
For example, in the image above, we have a data product, which sources data from the CRM system. Its output data is then consumed by another data product that transforms it in a particular way. And then, we have another data product that joins that data with another data product based on the ERP system.
The reason this is interesting and different from what has been done before is that we now have clear ownership of each of those data products throughout the pipeline. If there is an issue with the data pipeline, we know immediately who is responsible.
Furthermore, data product owners know whose data they are consuming and who is consuming their data product. This means that the data product owners can notify and collaborate with their upstream data providers and downstream data consumers around changes that they need to make. This area of collaboration and notification is currently experiencing significant debate in the Data Mesh community, especially around the concept of data contracts.
From my observation, these changes are now being integrated into version control systems, so that individual data product owners can make versioned changes as they need without being constrained by consumers of their data products.
Three types of data products
Next, when we’re thinking about the data product types that are identified in Zhamak’s book, there are three that are clearly defined.
#1 Source-aligned data products
The first one is the source-aligned data product. This represents the data as it is in the operational system with minimal transformation. I am seeing organizations use these as a first step to creating more valuable data products.
The interesting observation I would make here is that data fabric technologies are beginning to be used to semi-autonomously create these first level data products. I think this puts to bed the debate that has come up again and again, around which is the right route forward for an organization’s data mesh or data fabric; I would suggest that the answer might be both.
This might be a topic that I revisit in a later blog, however in the diagram below we can see the use of a data fabric to automate the creation of the source aligned data products, which can act as a source for consumer aligned data products.
#2 Consumer-aligned data products
The next data product type is the consumer-aligned data product. When ‘data products’ are referred to generically — these are the data products that people think about and discuss most.
These data products are produced by business experts within the domain that generate value through the codification of business knowledge and expertise. To create these data products, we need as little ‘technology friction’ as possible. Domain experts should be able to create these data products with as little additional help and expertise from within or without the domain as possible.
#3 Aggregate data products
Lastly, the TL;DR definition of aggregate data products is that they’re built at a corporate level to drive global KPIs.
There have been many discussions on what these are and how they differ from consumer-aligned data products — this is perhaps a discussion for another day. But, we have seen that organizations have defined aggregate data products in their own way. Below is an image that addresses how data products align to the enterprise-level KPIs and the corporate objectives of the business.
Further, we can see a top down approach, where we have defined corporate KPIs that are comprised of cross-business unit KPIs. The lower level KPIs are enabled by source-aligned or consumer-aligned domain-created data products. In this picture, the aggregate data products are those data products that bring together the data from the cross-business unit KPI data products to support the corporate level KPIs.
Leverage usage metrics to build valuable data products
Data governance is often top of mind when it comes to new initiatives such as building data products. And when we think about governance, historically, we think about access controls, security, ownership, lineage, and usage metrics. Where usage metrics are a way of documenting, reporting, and categorizing how data consumers are leveraging data in their analytics.
I would propose that through usage metrics, we can start to drive behaviors that are important to an organization.
From the perspective of the data product developer, usage metrics are critically important because they are a simple (perhaps too simple) way to measure the value of a data product. Meanwhile, the higher the usage, the higher the value of that data product to the organization. This means that data product developers know which data products to focus on, and which to retire. From a senior management perspective, we can implement usage metrics as a vehicle for employee incentivization and motivation.
From an end user perspective, the usage metrics of data products provide us with insight into the trustworthiness of a data product. The higher the usage, the higher the trust that we can have in a data product.
Initially, we need to perform business analysis to decipher which data products we think will be valuable. Then, based on the data usage reports, data producers can take proactive action to make the data products easier to use, easier to find and more useful.
In short, with data products, we want to know exactly who’s using them and we want to know how data consumers are using them, so that we can measure their value. Then as a result, we can move from reactive data management to proactive data management.
From reactive to proactive data management
Historically, data ownership has been an afterthought, and since data was not treated as a product, the adjustment of data being consumed for strategic reasons was incredibly reactive. However, with data treated as a product in a Data Mesh, the lifecycle of data becomes proactive and akin to any other product. And this is the main difference between a ‘data asset’ and a ‘data product’, and it’s a very simple way to define any data product.