We’ve all experienced the dreaded analyst ping: “hey, can you help me find [XYZ] data for a meeting later today?”. You stop what you’re doing. If you’re lucky, you go to a confluence page that summarizes all of the available data and try to determine if such a dataset exists. If you win the lottery, what they’re looking for is available. If not, you need to dig through multiple data sources to find the data and ensure it’s fit for use. Worst case, you need to create a new high quality data set that is intended to be reused, but if left to chance, won’t see the light of day again.
But wait. A data scientist just pinged you trying to understand the transformation logic of a different dataset. The cycle starts again.
Your day goes by peppered with questions from your stakeholders about the data, and before you know it, you haven’t made any progress on new data initiatives. Instead, you and your teammates epitomize the living, breathing, organic data catalog. Without you or your data team, business slows down.
It’s a frustrating cycle. That’s why we’re so excited to announce the public preview of data products in Starburst Galaxy. In Galaxy, you now have the ability to create and manage single- and multi-source data products, increasing the discoverability of your datasets with limited-to-no movement of data.
What are data products in Starburst Galaxy?
Before diving into building data products in Galaxy. Let’s cover some key fundamentals. At the core, a data product in Starburst Galaxy is a package of business and technical metadata that includes a pointer to the underlying dataset. It may be helpful to think of a data product as a container with one or more datasets in the Gravity layer.
This empowers data teams to provide value-add information enabling greater self-serve analytics capabilities for the organization, greatly reducing the overhead on data teams, and enabling consumers to more quickly and efficiently generate reports, dashboards, and insights.
Under the hood, a data product is represented as a schema. Schemas can contain one or more tables, views, and/or materialized views joined from one or more data sources, meaning you can create single-source or multi-source data products. And, depending on the dataset, you do not need to move data to create a data product.
The Data Products view (the screenshot below) organizes all data products across your organization. This view is useful for two reasons – you can use it to easily find pre-approved datasets or you can give your internal stakeholders access to this view to enable self-service analytics. Remember you can grant your stakeholders as much or as little access as you’d like with Galaxy’s built-in access controls.
When used in conjunction with other Gravity-level features, data products enable:
- Easy discoverability with universal search
- Fine-grained security when the coupled with built-in access control policies
- Immutable history via audit trails and query history
- Domain ownership by allowing teams to utilize and leverage their own data infrastructure (more on this later)
How to create data products in Starburst Galaxy
By now, you’re wondering, “how can I get started with this?”
Starburst Galaxy provides you two ways to create data products from your data. You can either choose to convert an existing schema to the level of data products or you can create a net new data product from a query.
Let’s start by walking you through the generally available way to create data products – promoting from existing schemas. The process is relatively straightforward:
- Navigate to a schema in your data catalog that your role owns
- Click on the “Promote to data product” button
- Provide business context, sample code, helpful links, and contacts that will help your consumers understand the data product more quickly
- Save the metadata
- Voila! You have your first data product
Rinse and repeat the above to transform your targeted datasets into high-quality, highly discoverable curated data products.
You might wonder, “does the data product get exposed to everyone?” Using built-in access controls, you can configure who has access to the schema – and by extension the data product. When users log in, they will see only the data products their role has been granted access to.
The other way to create data products – from a query – is now officially in public preview in Starburst Galaxy. If you’re not familiar with Starburst Galaxy, it is built on top of the OSS query engine Trino that has the ability to federate across multiple data sources. You can now use the power of Trino to create multi-source data products without needing to land and transform your data first.
In order to create a data product from your query results, following the steps below:
- Navigate to the query editor and write a SQL SELECT statement that joins data tables in different catalogs configured to use the same cluster
- A small data product icon () will appear with the results. Click on this to create a new data product
- Define a new schema in a catalog of your choice, provide data product metadata, verify that you’re happy with the query, and create a new data set inside the data product schema. This can be a table, view (object store type only), or materialized view (object store type only) based on your performance and freshness needs
And with that, you now can create data products that span multiple data sources expediently at the speed of business! Visit the docs to learn more.
The goal of data products is to foster a secure, self-service exchange of high-quality datasets between data producers (you) and data consumers (your stakeholders) to enable businesses to move faster.
The features announced today enable the production, publication, discovery, and consumption of data products. However, our work isn’t ending there. Stay tuned for more updates on data products in Starburst Galaxy!
Try Starburst Galaxy today
The analytics platform for your data.