Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Data products are curated collections of datasets and business-approved metadata designed to solve specific, targeted questions.
The goal of data products is to make data accessible, consumable, insightful, and actionable for the increasing number of stakeholders who rely on data to inform their decision making.
In this context, the terms curated and value have a specific meaning to data products. Let’s learn more.
This allows for a high degree of repeatability across a large number of use cases. At the same time, as the needs of the business evolve, data products evolve with them.
Their benefit lies in the way that they widen and democratize access to data, enhancing the efficiency of all teams.
Each data product contains the components needed to do its job as a discrete object. This differs from traditional data pipelines, which often involve more complexity.
Their ease of access is one of their defining characteristics, and access to the data product should give you all the information you need to gain insights.
Data products contain several different components. Importantly, not every data product uses each of these pieces in the same way or includes every item on the list. Let’s learn more.
Data is the most central part of any data product. This data can come from any source, but ideally, it should be of high quality and reliability.
The list below outlines the typical forms of abstracted data. Just like other data sources used by Starburst, this data can be federated from multiple data sources, providing true flexibility and convenience at the same time.
Typically, the best source for such data is the Consume Layer of a data lake or data lakehouse, though other architectures also exist.
Example: Customer table (1234, ‘Name’, ‘111-111-1111’)
Metadata is also a core component of any data product. It helps control how the data is accessed and how the data product curates the experience for the user.
The list below outlines the types of metadata typically included in a data product.
Example: (ID (integer), Name (varchar), Phone (varchar))
Data products include curated access to both datasets and related metadata.
The list below outlines the type of access patterns found in data products.
Example: Log in to Galaxy and view within the UI
Packaging these access patterns in an automated way is one of the ways that data products achieve efficiency gains compared to traditional methods.
In Zhamak Dehghani’s book Data Mesh: Defining Data-Driven Value at Scale, she defines eight core characteristics of any data product. This list is a good starting point for determining the attributes that make up a data product.
Discoverability: Data products should be searchable and easy to find.
Addressability: Users must be able to interact with data products in a reproducible, standardized way, typically a permanent and unique point of access.
Understandability: Data products should be easy to understand and their use case clear.
Trustworthy: The data inside a data product should be from a trustworthy data source. Data quality must also remain reliable over time. Providing greater visibility of this process is one of the primary goals of all data products.
Accessibility: Data products are designed to be created, saved, and shared across teams as needed. For this reason, they must always be accessible.
Interoperability: Data products should not be proprietary or include anything that negatively impacts interoperability.
Value: Data products should add value for the data teams (especially data science teams) and departments that use them.
Security: Data products must be highly secure. Sharing them across teams must fit with an organization’s wider security and data governance teams.
Users can access data products in several different ways, including:
Let’s unpack the concept of data-as-a-product, and understand how this practice relates to data products.
Data products are best built on top of a data lakehouse and they change the way that users access the data stored on those technologies in several ways. Greater emphasis on decentralization is key to this. Unlike traditional data warehouses and data lakes, access to a data product does not need to be controlled through a central IT team. At the same time, data products do not typically comprise the entire datasource on a data lake or data warehouse.. Instead, data products contain data specific to particular use cases. Sometimes these follow organizational divisions and domains, and other times, they speak to interdisciplinary concerns across different domains and departments.
In this sense, data products treat data as more than just an IT resource. In doing so, they help to rewire the way that data is accessed and leveraged on a fundamental level. Data has immense value, but only if it can be used by the right people in the right way. Curated data products help make accessing and using data easier by the teams using that data themselves. Data products take raw data and translate it into something relevant and useful with specific domains and individual business contexts. In fact, data products can even be used to gather data from other data products. The possibilities for unique combinations and collaborations are endless. The people who build data products are also responsible for security, provenance, and ownership so that the final product better reflects the technical requirements of the data within the domain.
At its heart, data-as-a-product is a generalized methodology that applies product thinking to data. To do this, data-as-a-product treats data in a way that maximizes its usefulness and accessibility for both data producers and data consumers by seeing data as a product in and of itself.
Importantly, data-as-a-product is not a thing. It is a practice or way of viewing the world. Such an approach is revolutionary and can be implemented in many different ways using many different technologies.
Data-as-a-product applies certain principles from product management and agile development methodologies. These include:
As a feature, data products make use of this methodology. In this sense, you can think of a data product as one possible instance of data-as-a-product thinking.
However, data products are only one way in which data-as-a-product is realized. Other features also apply product thinking to data as well. Understanding how data products participate in the practice of data-as-a-product helps situate and contextualize them within other features that also share this way of thinking.
Data-as-a-product shares certain similarities with DevOps, which addresses infrastructure problems by packaging applications and their environments in ways that help facilitate their deployment.
In the same way, data-as-a-product combines the tools, practices, and cultural philosophy underpinning data into packaged units to help improve their deployment and usability.
Data-as-a-product is a new business model. It applies the principles of strong, user-centric design alongside a clear emphasis on product thinking to approach data in a new way. Like any new business model, it comes with both risks and rewards. Businesses that incorporate this new thinking will be able to overcome previous bottlenecks and realize new goals.
Data as a product has resonance with the larger organizational change principle known as data mesh. Although using data mesh is not a necessity when using data products, it is one possibility. Applying data-as-a-product thinking enables decentralization of data operations, moving from central IT teams to the owners of individual business functions.
Because data products are easier to use than traditional alternatives, less technical users can take a more direct approach with data for the first time, leading to a positive breakdown of the bottlenecks holding back change and growth. In this view, data-as-a-product is a precondition for data mesh.
Related reading: Best practices for developing and scaling data products
Related reading: Data products vs. data catalog
Data products are considered one of the four attributes of data mesh. In this sense, data mesh describes a new business paradigm that emphasizes data decentralization over traditional ETL centralization.
In place of the traditional, highly-specialized central IT teams, this new approach suggests that organizations should empower individual business domains to create and share data-as-a-product solutions. Using data mesh, domain-specific data sources are linked together but managed independently, rather than consolidated into a single repository. Doing so yields many organizational benefits. For this reason, data mesh is always considered as much an organizational change model as a technical model.
Data mesh is a reaction against traditional ETL models. But how is data typically handled? Most organizations deploy a data architecture that consists of three parts:
a) The operational data plane, where data is used to make business decisions.
b) The ETL process is the stage where data is extracted, transformed, and loaded with the appropriate business context.
c) The analytical data plane, where data is gathered and leveraged to provide intelligence and solve business problems.
Often, this process has been in place for many years and has been highly centralized, and made available to the wider business.
The traditional, centralized approach to data management presents several challenges. First, central IT teams are experts in data, but they are not experts in the context of that data. For this reason, it is difficult for them to determine what is valuable and what is not.
To overcome this, constant communication is needed, which is often slow and involves the communication of complex, domain-specific information to non-domain-specific IT specialists. This creates a huge bottleneck and places a burden on the IT teams to be experts in both data and the business questions surrounding that data. Because of the complexity involved, solutions often arrive too late, as the problems they were meant to solve have changed in the intervening time. All of this inhibits the agility of the data team and leads to a situation where the insights from that data are not being maximized.
Data mesh addresses the problems associated with traditional ETL and centralization using domain decentralization. By empowering the domains themselves to become directly involved in data, the people who understand the context of the data are brought into closer contact with it.
This helps replace a fragile, narrow data pipeline, with a more robust approach that involves the whole organization. This helps overcome some of the bottlenecks associated with a monolithic, centralized IT department and uses technology to overcome shortages in expertise. The whole process of organizational change surrounding this solution is known as data mesh.
Related tutorial: Data mesh and data products tutorial
The benefits of data products impact different types of users of data differently.
In general, we can divide the types of people who work with enterprise data into two groups:
In many organizations, central IT teams own data pipelines and operate close to the data. This business structure was traditionally necessary owing to the technical complexity of ETL pipelines but creates a significant amount of work for data producers (i.e. the data engineering team).
Additionally, the data platform engineer is responsible for building and maintaining the infrastructure for the overall data ecosystem, including the data product platform, ensuring that data storage and compute capabilities meet the need of data management and consumption.
Data products simplify the job of data producers by allowing data consumers to solve many problems themselves. Data producers are free to deal with more complicated cases or exceptions.
Additionally, data products enable data producers and consumers to work cross-functionally and solve problems together in greater alignment and to meet important organizational metrics.
Data products allow data consumers to gain insights more autonomously. As domain users, they operate close to the business problems and understand the impact of datasets as they relate to business insights.
Data products abstract the technical complexity of an ETL pipeline, making the underlying data more accessible to data consumers.
Learn more about data product roles: Data Products For Dummies
Data products can take many forms across multiple domains. Learn about the most common types of data products: What are the different types of data products?
Starburst’s approach to data products uses data-as-a-product thinking at its core. Using data should be as easy as using any other product. Intuitive accessibility informs everything we do, empowering businesses to apply product thinking to solve problems.
Optionality is a core belief of our company. It is a generalized design principle woven through each of our products. It allows organizations to choose the storage systems, table formats, and architectures that make sense, flipping the conventional data paradigm on its head.
Just like everything else at Starburst, data products take advantage of optionality to deliver decentralized, curated access to datasets using the data sources you want. This enables you to achieve:
Data products even allow you to federate and curate at the same time, creating limitless options. This lets you discover, publish, manage, and share business insights from multiple datasets and sources in a simple and user-friendly manner. This expands the possibilities exponentially and ensures the usability and functionality run hand in hand.
Traditionally, data warehouses use data products extensively. Because of this, these early data products inherited the data warehouse’s belief in a single source of truth. This creates a monolithic approach to data which is often expensive and resistant to change.
At Starburst, we don’t believe in a single source of truth. Instead, we believe in a single point of access. Our data products allow users of data products to access datasets from disparate sources.
Up to $500 in usage credits included
Up to $500 in usage credits included