Demystifying Data Catalogs, Data Products, and Metadata Management

Data catalogs and data products overlap in the value that they provide, but they occupy different spaces in your data strategy

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

In the data space, two terms often make their way into conversations, sometimes independently and sometimes together — data catalogs and data products. You’ve probably heard the terms used in presentations at conferences, read about them in data strategy articles, or perhaps you’ve used the terms to find products that try to solve your data governance problems. 

If you’re nodding in agreement, but are not grasping the distinction, you’re in the right place. The relationship between data catalogs and data products is crucial, yet nuanced. Later, we’ll unravel these terms together and also explore how Starburst Gravity’s features work synergistically to bring value to customers struggling with data discoverability, consumption, and trust.

TL;DR

  • Data catalogs inventory an organization’s entire data ecosystem.
  • Data products turn datasets into valuable consumer-ready assets.
  • Metadata drives discoverability, trust, and usability across teams.
  • Combining both data catalogs and data products optimizes data strategy and access.

Data catalogs increase data literacy

Both data catalogs and data products support your organization’s metadata management strategy and foster data literacy. However, they also serve distinct purposes.

Data catalogs serve as an organization’s comprehensive, secure inventory and backbone of all data assets, mapping out what data is available and where it’s located. While the specific details differ based on data catalog product offerings, the key features of a data catalog are typically:

  • Automated & manual technical, operational, and business metadata management
  • Data governance features
  • Search capabilities
  • Collaboration tools
  • Data lineage functions
  • APIs and connectors to various data sources

Data catalogs augment data management by establishing dynamic directories built on top of all existing data sources — they provide clarity and coherence while promoting security. 

Consider the complexities of our modern data ecosystems — a large number of data sources, varying data formats, and by extension, the risk of establishing new or fortifying existing silos. Data product metadata management, connectors, and search capabilities make your data ecosystem easily navigable, interpretable, and accessible without the need for a data consumer to rely on a data producer to determine how to use each-and-every source system of interest.

Further, with the increase in accessibility comes the problem of trustworthiness, “How do I know whether I should use this data?” Integration between data product metadata management and data quality tools enables data consumers to assess the fitness of the immediate data set for use, while data lineage enables them to view the provenance of data and quickly understand the quality of data flowing into the data set. This provides the critical information needed to build trust in data for consumption and to quickly answer the question of fitness for use.

Lastly, as regulatory compliance and data privacy become mandatory components of business operations, the role of data catalog governance becomes paramount. Data governance enables organizations to comply with regulatory and data privacy rules while ensuring transparency to data consumers. By leveraging data catalog governance features, organizations can ensure that data consumers have access to the wide array of data in their data ecosystem, but only to the extent allowed for each consumer. To the data consumer, this manifests in many forms, including, but not limited to, search results constrained to data sets they have access to, automated column masking to prevent leakage of PII, and the inability to inadvertently modify data sets.

Data products increase data value

Having established the critical role data catalogs play in data democratization, we turn our attention to the other side of the coin – data products. A comprehensive inventory of data provided by a data catalog is essential, but it is just one piece of the puzzle. 

For data to be useful, first and foremost, it must provide value to the business. Whereas data catalog’s strength lies in the ability to inventory everything, the sheer volume of data contained within means that it is an amalgamation of data, reports, spreadsheets, and more that data consumers are challenged to sift through. One may go so far as to consider the manifestation of a data product to be data within a data catalog that meets certain criteria. However, it should be emphasized that a data product is more than just a technical manifestation. So what is a data product?

To clarify the distinction between raw data found in a catalog and a curated data product, consider the role of metadata and context:

Feature Raw Data (Catalog Entry) Data Product
Primary Goal Inventory and discovery Value delivery and consumption
Context Technical details (schema, format) Business context (use case, meaning)
Quality Variable (“as-is”) Guaranteed (SLAs, monitored)
Ownership IT or System Owner Domain/Product Owner
Access Permissive or Ad-hoc Standardized output ports

Before getting into any specific entity, let us be very clear — a data product must provide value to its intended consumers.

Data Products For Dummies, Starburst Special Edition

See the future of data products with large language models

Read now

3 characteristics of a data product: Structural, process, and functional

Here at Starburst, within the context of our Enterprise and Galaxy data platforms, we define a data product as a dataset that meets structural, process, and functional characteristics. Let’s break this down. 

1. Structural characteristics of a data product

First, a data product dataset is a package that consists of

2. Process characteristics — a set of actions taken in developing the dataset to derive quality and value

3. Functional characteristics — criteria to be met to foster trust and encourage utilization

  • Discoverability. Publication in an easy-to-find, accessible, and searchable registry that enables consumers to discover and utilize high-quality / high-business-value data sets
  • Understandability. Upon discovery, data products should provide robust business and technical documentation, as well as information about the underlying representation of the data, with a goal of rapid comprehension of the nature of the data set that assists in quick decision-making around consumption
  • Trustworthiness. The previous characteristic helps answer the question “Is this the correct data product?” This characteristic answers the question “Should I use this data product right now?” Information such as profile, lineage, and other trust validating metrics enables users the necessary information to make a decision on consumption 
  • Standardization for accessibility. Uniform specific standards that enable downstream consumers and data teams to consistently and quickly access any data product in the same manner
  • Interoperability. Data products should be interoperable with the tools, skill sets, and languages of the consumer’s choice. For example, a business analyst will prefer SQL, whereas a data scientist may prefer Python for their day-to-day activities in tools of their choice
  • Security. Data products should be governed to be secure, ensuring adherence to regulatory and organizational data security policies while also enabling correct access to meet the needs of specific teams.

When a data set goes through product processes, is placed into a container that meets the functional characteristics, and business and operational metadata is applied, the output is a data product that exists as a technical manifestation in a searchable catalog or registry.

The result is an easy-to-find, accessible body of data that is far easier to interpret from both business and technical perspectives, distinguishing it from much of the noise a data catalog may introduce.  This lowers the barrier to consumption and, with the right feedback loop between the data consumer and data product owner, helps drive further iterative value from the various registered data products.

Anatomy of data product metadata

Metadata is what elevates a dataset from raw information to a true data product. It provides the context, accountability, and operational guarantees that consumers need to trust and use the data effectively. Here are the essential components that transform a standard dataset into a viable data product:

  • Domain Owner and Contact Point: Every data product must have a clearly assigned owner and a contact for support or escalation. This ensures accountability and provides consumers with a direct line for questions or issues.
  • Service Level Objectives (SLOs) and Freshness Metrics: SLOs define expectations for availability, latency, and quality, while freshness metrics indicate how up-to-date the data is. Together, they establish trust and reliability for consumers.
  • Upstream Lineage Dependencies: Understanding where data originates and how it flows through upstream systems is critical for transparency. Lineage metadata helps consumers assess risk and quickly troubleshoot issues.

Sample Data Product Metadata (JSON)

{
  "name": "customer_orders",
  "domain": "sales",
  "owner": {
    "name": "Jane Doe",
    "email": "jane.doe@company.com"
  },
  "service_level_objectives": {
    "availability": "99.9%",
    "freshness": "updated every 15 minutes"
  },
  "lineage": {
    "upstream": ["customer_profiles", "order_transactions"]
  },
  "schema": {
    "fields": [
      {"name": "order_id", "type": "string"},
      {"name": "customer_id", "type": "string"},
      {"name": "order_date", "type": "timestamp"}
    ]
  }
}

Enter Starburst Galaxy and Gravity

With Starburst Gravity, you can have your cake and eat it too. 

Gravity is a universal discovery, governance, and sharing layer in Starburst Galaxy that enables the management of all data assets connected to Galaxy.

Gravity provides a holistic platform that consists of the following that work synergistically:

  • Data Source & Product Cataloging
  • Universal Search
  • Centralized Data Governance
  • Data Product Creation & Management
  • Federated Queries

Data cataloging with metadata management in Gravity enables you to increase data literacy and accessibility across your data sources and data products.   Data product creation and management enable data teams to register data products in a centralized registry for data consumers to view and use. The major differentiator of the Gravity data products feature is its ability to leverage Trino’s power. 

Most data product workflows require centralization of data, resulting in significant process and technical overhead of data movement. Gravity data products can be created from data federated across multiple sources, and by leveraging logical Views, this can be done without any data movement. Teams are empowered to use and manage the infrastructure of their choice, while providing curated data for general, repeat use without relying on a centralized data team, enabling the utmost agility to quickly and iteratively deliver data of business value.

When data cataloging, data product creation, and management are combined with Gravity’s centralized data governance, regulatory, and PII policies, they can be uniformly and consistently applied to all data across all clouds and regions, whether it is a data catalog entry or a data product entry. This reduces not only the overhead of compliance but also the risk of errors due to a simplified governance configuration surface area.

The ever expanding galaxy

In the coming months, Starburst Galaxy will be introducing two new exciting Gravity features that can be ubiquitously applied across both the data catalog and data products

  • Data Quality
  • Data Lineage

Data lineage will provide visibility into the data flow from upstream data sources of data products, enabling data consumers to more confidently determine the truthfulness of data flowing into a data product. This helps answer the question: “Does the provenance of the data product make sense?” This will also serve as an impact analysis tool, allowing data producers to quickly address data issues when they occur or to risk-assess and mitigate issues before any schema changes are executed.

As a step in our data SLA strategy, introducing data quality will allow data producers and consumers to collaborate to establish and monitor data metrics indicating fitness for use for both the catalog and data products. The visibility into the data product’s quality will help consumers answer the question “Can I use this data right now?”. More importantly, this feature will enable data producers to monitor and respond to data issues more promptly, often before consumers see them, helping foster trust and confidence between data producers and consumers and ultimately encouraging greater data consumption.

The synergy of data catalogs and data products

Data catalogs and data products have overlap in the value that they provide — namely, increasing data literacy and interpretability. But in the grand scheme of things, they occupy different spaces in your data strategy.

Data Catalogs serve as the secure backbone for an organization’s full data ecosystem. It enables all users to have democratized access to all data and assists in decision-making around consumption. This is also a double-edged sword, as data consumers face access to all data and types within an organization, regardless of their value.

Data Products manifest as part of a data catalog and are intended to deliver value quickly to consumers by providing curated, high-quality data with a high degree of accessibility and interpretability, in a secure and consistent manner. Gravity data products take this a step further by unshackling data teams from centralized data processes, minimizing and eliminating data movement, and enabling the utmost in agility by leveraging the power of federated queries in Starburst Galaxy. 

With soon-to-be-introduced data quality and lineage in Gravity, we are excited to let you know that these features will also leverage the power of Galaxy to provide observability into your data, regardless of source, cloud provider, or region, when connected to Galaxy. Exciting times lie ahead as we foster data value generation and consumption, and we hope you join our journey as we evolve our data platform offering.

FAQs about data products and catalogs

What is the difference between a data catalog and a data product?

Data catalogs serve as the comprehensive, secure inventory of an organization’s entire data ecosystem. In contrast, a data product is a curated, value-driven package of data designed to solve specific business problems using product management principles. While the catalog maps where all data is located, data products serve as refined assets that meet strict structural and functional criteria, ensuring they are ready and valuable for consumers.

What role does metadata play in a data product?

Metadata is the foundation of a data product’s structural and functional integrity, transforming raw data into a usable asset. It includes rigorous business and operational details, such as the schema, data ownership, and definitions, that allow consumers to understand the dataset’s context and value. Effective metadata management ensures the product is not only discoverable but also trusted, providing the information users need to determine its suitability for their specific needs.

What are the key characteristics of a data product?

A robust data product is defined by its structural, process, and functional characteristics, which ensure it delivers business value. Structurally, it must include a defined schema and ownership, tables or views, and business or operational metadata; while process characteristics involve applying product and software lifecycle best practices to ensure quality and agility. Functionally, the product must be discoverable, understandable, trustworthy, accessible, interoperable, and secure, enabling easy access and use by data consumers across tools and languages.

How does data lineage support data product trustworthiness?

Data lineage provides critical visibility into the provenance and flow of information, allowing consumers to verify the quality of data entering the product. By mapping the journey from upstream sources, lineage helps users answer whether the data is reliable and makes sense for their current analysis. This transparency also serves as an impact analysis tool for producers, enabling them to address data issues proactively before they impact downstream decision-making.

How do data catalogs and data products work together?

These two concepts work synergistically, where the data catalog acts as the hosting registry and governance layer for data products. The catalog leverages metadata to make data products searchable and secure, ensuring that consumers can easily find high-quality, curated datasets amidst the noise of the larger ecosystem. This combination allows organizations to maintain a secure data backbone while simultaneously empowering teams to deliver agile, high-value data products without excessive centralization.

Starburst Academy: Exploring data products 

Explore the exciting world of data products and learn how they impact data producers and data consumers.

Start now