3 characteristics of a data product: Structural, process, and functional
Here at Starburst, within the context of our Enterprise and Galaxy data platforms, we define a data product as a dataset that meets structural, process, and functional characteristics. Let’s break this down.
1. Structural characteristics of a data product
First, a data product dataset is a package that consists of
2. Process characteristics — a set of actions taken in developing the dataset to derive quality and value
3. Functional characteristics — criteria to be met to foster trust and encourage utilization
- Discoverability. Publication in an easy-to-find, accessible, and searchable registry that enables consumers to discover and utilize high-quality / high-business-value data sets
- Understandability. Upon discovery, data products should provide robust business and technical documentation, as well as information about the underlying representation of the data, with a goal of rapid comprehension of the nature of the data set that assists in quick decision-making around consumption
- Trustworthiness. The previous characteristic helps answer the question “Is this the correct data product?” This characteristic answers the question “Should I use this data product right now?” Information such as profile, lineage, and other trust validating metrics enables users the necessary information to make a decision on consumption
- Standardization for accessibility. Uniform specific standards that enable downstream consumers and data teams to consistently and quickly access any data product in the same manner
- Interoperability. Data products should be interoperable with the tools, skill sets, and languages of the consumer’s choice. For example, a business analyst will prefer SQL, whereas a data scientist may prefer Python for their day-to-day activities in tools of their choice
- Security. Data products should be governed to be secure, ensuring adherence to regulatory and organizational data security policies while also enabling correct access to meet the needs of specific teams.
When a data set goes through product processes, is placed into a container that meets the functional characteristics, and business and operational metadata is applied, the output is a data product that exists as a technical manifestation in a searchable catalog or registry.
The result is an easy-to-find, accessible body of data that is far easier to interpret from both business and technical perspectives, distinguishing it from much of the noise a data catalog may introduce. This lowers the barrier to consumption and, with the right feedback loop between the data consumer and data product owner, helps drive further iterative value from the various registered data products.
Anatomy of data product metadata
Metadata is what elevates a dataset from raw information to a true data product. It provides the context, accountability, and operational guarantees that consumers need to trust and use the data effectively. Here are the essential components that transform a standard dataset into a viable data product:
- Domain Owner and Contact Point: Every data product must have a clearly assigned owner and a contact for support or escalation. This ensures accountability and provides consumers with a direct line for questions or issues.
- Service Level Objectives (SLOs) and Freshness Metrics: SLOs define expectations for availability, latency, and quality, while freshness metrics indicate how up-to-date the data is. Together, they establish trust and reliability for consumers.
- Upstream Lineage Dependencies: Understanding where data originates and how it flows through upstream systems is critical for transparency. Lineage metadata helps consumers assess risk and quickly troubleshoot issues.
Sample Data Product Metadata (JSON)
{
"name": "customer_orders",
"domain": "sales",
"owner": {
"name": "Jane Doe",
"email": "jane.doe@company.com"
},
"service_level_objectives": {
"availability": "99.9%",
"freshness": "updated every 15 minutes"
},
"lineage": {
"upstream": ["customer_profiles", "order_transactions"]
},
"schema": {
"fields": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "order_date", "type": "timestamp"}
]
}
}
Enter Starburst Galaxy and Gravity
With Starburst Gravity, you can have your cake and eat it too.
Gravity is a universal discovery, governance, and sharing layer in Starburst Galaxy that enables the management of all data assets connected to Galaxy.
Gravity provides a holistic platform that consists of the following that work synergistically:
- Data Source & Product Cataloging
- Universal Search
- Centralized Data Governance
- Data Product Creation & Management
- Federated Queries
Data cataloging with metadata management in Gravity enables you to increase data literacy and accessibility across your data sources and data products. Data product creation and management enable data teams to register data products in a centralized registry for data consumers to view and use. The major differentiator of the Gravity data products feature is its ability to leverage Trino’s power.
Most data product workflows require centralization of data, resulting in significant process and technical overhead of data movement. Gravity data products can be created from data federated across multiple sources, and by leveraging logical Views, this can be done without any data movement. Teams are empowered to use and manage the infrastructure of their choice, while providing curated data for general, repeat use without relying on a centralized data team, enabling the utmost agility to quickly and iteratively deliver data of business value.
When data cataloging, data product creation, and management are combined with Gravity’s centralized data governance, regulatory, and PII policies, they can be uniformly and consistently applied to all data across all clouds and regions, whether it is a data catalog entry or a data product entry. This reduces not only the overhead of compliance but also the risk of errors due to a simplified governance configuration surface area.
The ever expanding galaxy
In the coming months, Starburst Galaxy will be introducing two new exciting Gravity features that can be ubiquitously applied across both the data catalog and data products
- Data Quality
- Data Lineage
Data lineage will provide visibility into the data flow from upstream data sources of data products, enabling data consumers to more confidently determine the truthfulness of data flowing into a data product. This helps answer the question: “Does the provenance of the data product make sense?” This will also serve as an impact analysis tool, allowing data producers to quickly address data issues when they occur or to risk-assess and mitigate issues before any schema changes are executed.
As a step in our data SLA strategy, introducing data quality will allow data producers and consumers to collaborate to establish and monitor data metrics indicating fitness for use for both the catalog and data products. The visibility into the data product’s quality will help consumers answer the question “Can I use this data right now?”. More importantly, this feature will enable data producers to monitor and respond to data issues more promptly, often before consumers see them, helping foster trust and confidence between data producers and consumers and ultimately encouraging greater data consumption.
The synergy of data catalogs and data products
Data catalogs and data products have overlap in the value that they provide — namely, increasing data literacy and interpretability. But in the grand scheme of things, they occupy different spaces in your data strategy.
Data Catalogs serve as the secure backbone for an organization’s full data ecosystem. It enables all users to have democratized access to all data and assists in decision-making around consumption. This is also a double-edged sword, as data consumers face access to all data and types within an organization, regardless of their value.
Data Products manifest as part of a data catalog and are intended to deliver value quickly to consumers by providing curated, high-quality data with a high degree of accessibility and interpretability, in a secure and consistent manner. Gravity data products take this a step further by unshackling data teams from centralized data processes, minimizing and eliminating data movement, and enabling the utmost in agility by leveraging the power of federated queries in Starburst Galaxy.
With soon-to-be-introduced data quality and lineage in Gravity, we are excited to let you know that these features will also leverage the power of Galaxy to provide observability into your data, regardless of source, cloud provider, or region, when connected to Galaxy. Exciting times lie ahead as we foster data value generation and consumption, and we hope you join our journey as we evolve our data platform offering.
FAQs about data products and catalogs
What is the difference between a data catalog and a data product?
Data catalogs serve as the comprehensive, secure inventory of an organization’s entire data ecosystem. In contrast, a data product is a curated, value-driven package of data designed to solve specific business problems using product management principles. While the catalog maps where all data is located, data products serve as refined assets that meet strict structural and functional criteria, ensuring they are ready and valuable for consumers.
What role does metadata play in a data product?
Metadata is the foundation of a data product’s structural and functional integrity, transforming raw data into a usable asset. It includes rigorous business and operational details, such as the schema, data ownership, and definitions, that allow consumers to understand the dataset’s context and value. Effective metadata management ensures the product is not only discoverable but also trusted, providing the information users need to determine its suitability for their specific needs.
What are the key characteristics of a data product?
A robust data product is defined by its structural, process, and functional characteristics, which ensure it delivers business value. Structurally, it must include a defined schema and ownership, tables or views, and business or operational metadata; while process characteristics involve applying product and software lifecycle best practices to ensure quality and agility. Functionally, the product must be discoverable, understandable, trustworthy, accessible, interoperable, and secure, enabling easy access and use by data consumers across tools and languages.
How does data lineage support data product trustworthiness?
Data lineage provides critical visibility into the provenance and flow of information, allowing consumers to verify the quality of data entering the product. By mapping the journey from upstream sources, lineage helps users answer whether the data is reliable and makes sense for their current analysis. This transparency also serves as an impact analysis tool for producers, enabling them to address data issues proactively before they impact downstream decision-making.
How do data catalogs and data products work together?
These two concepts work synergistically, where the data catalog acts as the hosting registry and governance layer for data products. The catalog leverages metadata to make data products searchable and secure, ensuring that consumers can easily find high-quality, curated datasets amidst the noise of the larger ecosystem. This combination allows organizations to maintain a secure data backbone while simultaneously empowering teams to deliver agile, high-value data products without excessive centralization.