Proponents of Data Mesh understand its many game-changing benefits for large scale organizations. For those who are new to this reimagined framework, Data Mesh will enable organizations to respond gracefully to an uncertain and volatile business climate; to sustain agility in the face of growth; and to accelerate business value in proportion to already-made data analytics investments. These benefits are very encouraging, and some organizations are very far along in their Data Mesh journey with existing practices and ideas that we can learn from and one area is data governance.
A move from centralized to decentralized data governance
Data governance is one area that is traditionally thought of as part of data management, with responsibilities and activities performed as a centralized function. Immediately, we can see that the decentralized nature of Data Mesh poses some interesting questions. In this post, we’ll explore securing data within a Data Mesh as well as who is responsible for providing this service.
Domain Owners: data products and decentralized policy management
Once on the Data Mesh journey, domain ownership of data means decentralizing ownership and as such, the central data team is no longer responsible for traditional data management functions. Instead, domains are responsible for data products, and it’s an important responsibility. A data product can contain data from both operational and analytical systems, it gets transformed, and is served to the rest of the business.
Data Mesh also positions domains to focus on shaping and managing the data, rather than managing the data infrastructure. As such, data product developers, data product owners, and the domain owners are focused on molding the data, making it useful and valuable to others. As part of the data product definition, the domain will also be partly responsible for who can access the data that forms part of the data product.
Within a Data Mesh, a group of domain owners would form a governance function and define the security policies that will apply to enterprise data. For example, the federated governance team might define a policy that states, “fields containing personal address information should be protected.”
Then, it is the responsibility of the domain to implement a policy that meets the requirement of the policy, and the responsibility of the self-service infrastructure team to provide capabilities to enable this policy to be autonomously verified.
This is a very different approach from how we previously managed data architectures because of its decentralized nature where individual domains (or individual business units) are tasked with policy management and implementation. However, it’s simple to make the case for this responsibility to lie with the domains as they contain the data subject matter experts for their associated business function.
As we shift from centralized policy management into decentralized policy management, policy management will be performed by domain owners, and delegated to data engineers and subject matter experts in their respective domains.
Data products contain policy as code
There are numerous definitions out ‘in the wild’ as to what a data product actually is — a data product is more than simply data — so below we have listed a few that we’ve seen proposed.
Data Product: Table(s)
Is a data product simply a collection of tables or views? This arrangement suggests that a data product is similar to the well known Kimball data mart and does not allow for metadata to be part of the data product. The truth is that a data product is more than just a table(s).
Data Product: Metadata and Table(s)
If we move one step to the right, we introduce metadata to the data product. We use the term metadata in its broadest sense, so it includes aspects such as schema, column and row metadata, catalog tags and definition, and policies that define who can see what data within the data product.
Data Product: Table(s), Metadata, Access Pattern
Another step to the right. Now, the definition of the data product includes the different access patterns (or output ports) by which we can exploit the data product. An example could be that the data products expose a SQL interface. The benefit of this arrangement is that you can use BI tools and run SQL against the underlying data that forms part of the data product.
Data Product: Table(s), Metadata, Access Pattern, Code
The next one on the right is where we start thinking about including code. And this is the code that we would use to generate the data within the data product. This is important because this essentially makes a data product something that we can execute on a regular basis.
So now we might want to schedule our data product to execute the code, and when that code gets executed, we create the relevant data sets, which are accessible through the appropriate access patterns and have that metadata, including the catalog information and the security policies.
Data Product: Table(s), Metadata, Access Pattern, Code, Infrastructure
And then the last one on the right is where we start choosing infrastructure, and it’s infrastructure as code.
For example, we can start up some servers, deploy some services on which we run the SQL code to generate the data, then we push the data product metadata into the catalog and push the associated policies into a policy management system. So, when someone accesses the data defined by the data product through their access pattern, they see the relevant information restricted by the policies that are part of the data product.
Self-Service Platform Interoperability for Data Mesh
One of the most important technical capabilities of any technical product deployed within a self service infrastructure for Data Mesh is interoperability. In fact, we would suggest that interoperability trumps many functional capabilities.
As an example, at the top, you might have these four domains: merchants, marketing, best sellers, and users. We may want to access data in the data products using a SQL interface like Starburst and/or a Spark-based interface.
In the middle, we’ve got a Hive Metastore and so if we’re able to share Hive Metastore between say Starburst and Spark, then it enables us to have a single definition of data. This architectural pattern sharing metadata between Starburst and Spark is a common one that we see at Starburst.
The benefits of a single metadata definition are obvious, but in a Data Mesh, perhaps more relevant. Metadata is inherently a data centric concern, so if we need to have multiple stores of metadata to support different computation engines, then the responsibility for synchronization of that metadata lies within the domain. This, then means that we now need technology expertise in our domains, folks who understand the data and technology concerns to be able to support this synchronization. This could be a significant overhead.
A similar approach is important for policy management. In the diagram above, we see that Immuta is in the same layer as the Hive Metastore, and for good reason. This forms the central policy manager for the different computational engines. When we execute a data product using Starburst — with the policies defined as code — those policies will be deployed into Immuta. Now, when that data is exploited using either Starburst or Spark, those same policies will be seamlessly enforced across both platforms.
In the example above and as we think about planes of access, it underscores the significance of providing interoperability and governance to different consumers of the data products.
The difference between role-based access control (RBAC) vs attribute-based access control (ABAC)
Above, we described data policies as code and how they form a part of the data product definition, but we haven’t unpacked what those policies should include or what the definitions should look like.
Access permissions based upon role or group
Traditionally, authorization policies have been defined using Role Based Access Controls (RBAC). In this approach, an employee’s role in an organization determines the data permissions that an individual is granted. As an example, we could state:
- Andy can read UK data
- AndyGroup (the group that Andy belongs to) can read UK data
- AndyRole (one of the roles that Andy ascribes to be) can read UK data
The key here is that we know the role, group or user that we wish to apply data permissions to when we create these policies.
Limits of Role-Based Access Controls (RBAC) with Data Mesh
We might say that all employees with the role of senior manager should get access to specific data, which might be initially correct. Later, there might be a specific subset of data that only some senior managers can see. We might also not know who all the employees are who hold the senior manager title and function.
This latter example is especially true when we consider Data Mesh. In a large organization, for example, the domain owner of the marketing domain won’t know all of the roles of employees in the operations domain. This is where attribute-based access control (ABAC) comes into play.
Attribute-Based Access Control (ABAC) with Data Mesh
Attribute-based access control (ABAC) controls access based on a combination of attributes, i.e., user attributes, resource attributes, attributes associated with the system or application to be accessed and environmental attributes.
We might tag a dataset as being “UK data.” Now, all users that have been tagged with access to UK data will be able to access this data. This means that the domain defining the policies don’t need to know about roles, groups and users in other domains, but rely on those other domains to maintain the correct mapping from roles, groups and users to the correct tags.
A decentralized approach to data management, specifically data access control like Data Mesh ABAC provides significant flexibility and reduces friction of creating and maintaining security in data products.
Data Mesh policies for governance
There are a number of security aspects that need consideration when we move toward a data mesh. The table below attempts to summarize these for easy consumption.
|Managed using SQL Grants or IT tools||Managed in plain language with intuitive UI or “as code” for data products|
|Data policies managed centrally by IT||‘Global data policies’ are created and enforced as part of the data infrastructure provisioning. ‘Domain data policies’ are created and managed in the domain that owns the data product|
|Policy drivers in InfoSec and Privacy are centralized||Policy drivers in Infosec and Privacy are defined and agreed by the federated governance function|
|Role-Based Access Control||Attribute-Based Access Control where “role” is one of many possible attributes|
|Sensitive data classification and tagging is centrally defined without enforcement||Responsible for defining the regulation requirements for the platform to build in and monitor automatically|
This blog post centered on policy-based data access, and discussed ideas for data governance in a data mesh, however it’s important to be mindful of the data governance processes you already have in place, and the process that will need to be undertaken to move from your current state to your target state. To learn more, check out Starburst’s 90-Day Data Mesh pathfinder series.