How data products help AI data governance
Evan Smith
Technical Content Manager
Starburst Data
Evan Smith
Technical Content Manager
Starburst Data


More deployment options
As Artificial Intelligence (AI) and Machine Learning (ML) use cases continue to expand, businesses are facing an increasing demand for raw data. While this was already a key trend in data analytics and Business Intelligence (BI) dashboards, AI workloads have only intensified this trend.
AI is only as good as the data that fuels it, but without data governance, AI is missing a critical pillar of success.
AI workloads drive demand for data governance
The result? The insatiable demand for more data is pushing existing data strategies to their limits, particularly when scaling data governance processes to become AI-ready.
Many data teams are finding that the strict, top-down data centralization approaches they’ve taken to data projects in the past don’t scale to the data volume and data quality needed for AI use cases. Worse still, data governance approaches of the past are not always up for the needs of AI. Both data and the governance that surrounds it need to scale equally to be successful.
The data governance problem is the hidden piece of the AI puzzle.
Enter data products
Data products solve the AI data governance problem. Although they are not new, their use in AI is new, and it’s made for this moment. This is a case of a technology built for analytics finding its ideal use case in the age of AI.
Data products excel at making data governance easy, whether you’re talking about analytics or AI. They are the best way to streamline data pipeline automation for AI across multiple data sources, ensuring that your generative AI (GenAI) models and AI agents always receive the correct data. As such, data products—and the data governance they facilitate—should be viewed as the missing piece of the puzzle for AI adoption.
In this article, we’ll examine what data governance entails, the role of data products, and why data products are crucial to effectively governing data at an AI scale.
What is data governance?
Let’s start by understanding the problem. What is data governance, and what does having proper data governance require in the age of AI?
Data governance: Both a technical and a business problem
Data governance is a broad term that encompasses the management of data quality, data management, data security, and data privacy. Importantly, it isn’t just a technical problem. It’s also a business problem. Strong data governance ensures that the right data is accessed by the right people at the right time in the right way. To get data governance right, you need both technological and business processes to help support it.
Data governance requires forethought and planning
Most importantly, strong data governance isn’t an add-on or afterthought. In modern business, it’s the necessary foundation of every data-related project. Without a firm data governance plan in place, a new data project will never get off the ground.
Data governance from a business perspective
Let’s examine some of the key ingredients that help ensure effective data governance. From a business standpoint, there are multiple aspects to data governance.
Data quality
Data quality involves ensuring that the data entering systems is accurate, relevant, and aligned with business needs. To achieve this, organizations validate that data meets defined criteria for consistency, accuracy, and trustworthiness. Data quality is both a technical and a business challenge. It can be compromised by issues in the underlying technology delivering the data pipeline or by misalignment with real-world conditions, business processes, or stakeholder expectations.
Data security
Data security is about ensuring that only authorized users with the correct permissions can access data. Achieving this involves instituting processes and backups for technologies to govern data access and prevent unauthorized use, including data leaks.
Data compliance and regulation
Data compliance involves aligning internal, organizational data practices, policies, and realities to the rules, laws, and guidelines under which you operate. Strong compliance involves an accurate and level-headed approach to risk assessment across the enterprise. Its scope is wide, but it might include the handling of customer data, intellectual property, and other sensitive information. There are often severe penalties for organizations that fail to adhere to data compliance standards, which can involve multiple jurisdictions and legal frameworks.
Data sovereignty
Data can seem to exist outside of the physical world. But in reality, it’s stored in servers either in cloud data centers or on-premises data centers.
In both cases, geography matters. Data stored in servers is subject to the laws in place within that jurisdiction. This means that governing data in accordance with the variety of global laws governing data access is an important part of data governance.
Data sovereignty is also an increasingly important issue, and an increasing number of jurisdictions require data to be handled in specific ways in certain scenarios. There are also legal frameworks governing the transfer of data across international borders, making international data transfer as much a legal matter as a technological one.
Data governance from a technical perspective
Next, let’s examine the types of technological objectives required to achieve data governance. Each of these connects with the business needs listed above.
Data architecture that supports comprehensive governance
A data governance architecture needs to be part of your overall technology stack. It needs to enable teams to manage permissions and access to their data, using mechanisms such as role-based access control (RBAC). This should also include tools for bringing data governance to data pipelines to manage and monitor data quality.
Tools to help enhance data context
Context makes data easier to understand and trust. A key piece of context is data lineage, which tracks the journey that data takes as it moves through the data pipeline. Context is even more important for AI workflows, where factors such as RAG architecture significantly impact the value generated by AI-driven solutions.
Along with other metadata, such as ownership information, data producers and consumers can use data lineage to understand and verify the origin of data. This fosters greater transparency, leading to increased data usage.
Control sensitive data access
Sensitive data is often subject to strict governance requirements from both regulatory and legal perspectives. As a result, access to personally identifiable information (PII)—including details such as age, birth date, home address, and credit card numbers—requires technology that can meet these specialized demands. A robust data platform needs to be able to classify data by sensitivity level and enforce access controls based on organizational roles and responsibilities.
Monitor data access
Data is constantly being accessed, whether by individual users or AI agents. The data architecture we use must help regulate and control access to ensure compliance. This can take many forms.
For example, audit logs identify who accessed data, when, and what changes they made. This means that a company is always accountable for its data. Monitoring processes can also track attributes such as data costs and data access frequency, which enables the optimization of data pipelines based on usage.
What are data products?
Data products help deliver data governance, both at the business level and the technological level. How do they do this?
How data products work
At their core, data products create packaged datasets that are easily accessible and usable by downstream data consumers. Data products are curated datasets that consist of three components:
- Data
- Metadata
- Data access patterns
Collectively, this provides a single package for your data. Data products aim to make it easy to use and share data. And that’s as true of data analytics workloads, like data visualization or BI tools, as it is of AI workloads. Data products can also be used for Agentic AI, as in the case of the Starburst AI agent.
Supporting a single point of data access
Data products turn data silos into purpose-built, curated datasets that anyone with the appropriate permissions can use. Teams can securely share their data products, enabling the secure cluster-to-cluster sharing of data without physically relocating it.
For a more in-depth look at data products, check out the following video:
How data products assist AI data governance
The demand for high-quality data was already exploding before the advent of GenAI use cases. The arrival of AI, however, has accelerated that demand even further. With that rise comes the need for data governance.
Why? GenAI solutions are probabilistic systems that use large datasets to make good predictions. The higher the quality of the input, the better the output. To achieve that quality requires data governance.
The role of context in data governance and AI
To produce domain-relevant answers, GenAI applications that leverage probabilistic AI solutions, such as Large Language Models (LLMs), must provide their own data as context. These apps use processes such as retrieval-augmented generation (RAG) and fine-tuning to generate large quantities of data, resulting in more accurate results. This data must be well-governed to ensure high quality, security, and compliance with all applicable laws and regulations.
The stakes get even higher when it comes to agentic AI, an evolving form of GenAI in which autonomous agents interact with datasets and, increasingly, one another without direct human oversight.
Data products have several unique attributes that make it easier to provide the strict data governance needed for AI solutions. They drive:
- Data efficiency
- Data quality
- Data interoperability
- Data transparency
Let’s look at each of these one by one.
Data efficiency
A traditional method for handling data governance is to centralize a company’s data. However, mandatory centralization doesn’t work. It’s an impossible lift that leaves companies stuck with never-ending migrations and an ungovernable Shadow IT infrastructure.
A single point of access offers a better solution. This strategy deploys data federation to access data sources across your organization. Centralization still occurs when it makes sense, but this time, it’s a choice, not a prerequisite.
Data products help with this by enabling choice around centralization vs. decentralization decisions without sacrificing governance:
- Data producers can publish their data products to a data catalog, where consumers can find and experiment with incorporating multiple curated datasets into unique new AI solutions.
- The company can monitor all-up data access, quality, and governance through the data catalog or other monitoring solutions.
Data quality
Typically, locating data is only half the battle. Often, there is no way to determine if it’s the correct data. It may also be challenging to determine who owns it or the reliability of that data.
Again, data products offer a solution. Using data products, users can locate high-quality, curated, and purpose-built data across the organization. Data lineage and other metadata, including documentation, data quality metrics, and ownership information, enable the verification of the data’s origin and accuracy.
All of this is designed to work with data analytics, data applications, and AI workloads.
Data collaboration
Sharing data across teams can be frustrating in many organizations. Data collaboration is often hindered by a lack of uniform access or differences in data formats, resulting in a loss of a comprehensive organizational perspective. Often, the data governance that’s needed for compliance works against the ability of teams to collaborate.
What’s worse, the problem propagates. Since AI relies on this data, a lack of data also results in a lack of insights.
Data products are designed to facilitate data collaboration by making it easy to share data across teams securely. And since data governance is built into data products, collaboration and governance no longer conflict with each other. Both are achievable within the same technological and business framework.
Data transparency
Evaluating the quality, origin, and restrictions surrounding a dataset is critical, whether considering analytics or AI. This can be hard, if not impossible, to accomplish if the data used isn’t well-documented.
Using data products, data transparency is built into the system.
For more information on this topic, check out this video:
The importance of AI data governance for regulated industries
Data governance is a hard prerequisite for any AI project. However, for highly-regulated industries, it’s critical. These industries operate in high-compliance environments where data governance is both an industry expectation and a legal obligation. As a result, a rigorous data governance implementation must be incorporated into every feature and phase of the project.
Data products are indispensable for achieving better AI governance in these industries, such as:
- Health
- Finance
- Insurance
- Public sector
Healthcare
Health data is some of the most sensitive data in the world. As such, numerous laws worldwide govern its usage, access, and analysis, whether considering analytics or AI.
Full data compliance is both the starting point and the expectation for all healthcare organizations. In this environment, data products are an ideal fit. By building data governance into the technology from the ground up, organizations can ensure that their analytics and AI projects grow within a framework of compliance and regulatory alignment. Knowing this can make the difference between projects going forward and being delayed.
Finance
Financial services organizations handle some of the most sensitive data in the world. They operate in tightly regulated environments for a reason, and operate under regulations that often span multiple jurisdictions. Within this context, data compliance is an expectation supported by regulations, industry practices, corporate governance, and other frameworks.
Data products are a perfect fit for the finance industry. By restricting data access at the data architectural level, financial organizations can move forward with analytics and AI projects, knowing that data governance is already in place.
Insurance
The insurance industry operates on trust. Like other regulated industries, this trust requires governance at all levels, including data governance.
Within this context, data products provide the trust necessary to advance analytics and AI projects.
Public sector
All public sector organizations, regardless of their location, operate in high-compliance environments. Citizen data is highly sensitive. It contains information such as national ID numbers that are central to a person’s identity, as well as highly sensitive financial and health information.
The public sector also faces a high degree of scrutiny regarding its procurement processes. All procurement must be documented to prove there was no undue influence or favor involved.
The movement and handling of government data is strictly regulated and subject to numerous legal frameworks and requirements. For example, the use of air-gapped servers is common in the public sector.
Within this context, data products provide the perfect solution. By incorporating data compliance at the grassroots level, public sector organizations can build a solid foundation for data governance.
Managing data products at scale with Starburst
Starburst is built for data governance, and our data products provide the perfect pathway for organizations of any size.
Whether building scalable analytics as a data engineer, data applications as a developer, or AI workflows as part of the next generation of innovation, Starburst data products help you achieve these objectives with data governance from the start. We help some of the largest organizations worldwide achieve business value and drive actionable insights by drawing together data from across their data ecosystem.
We’re also built for the AI era. Our Icehouse architecture, powered by Trino and Apache Iceberg, makes all of your data projects possible using a single foundation. Using Starburst Galaxy, your teams can build, govern, access, share, and explore data products regardless of where the data lives.