Last Updated: 2024-01-23
Data products represent a new way of consuming, transforming, and managing data that leverages product thinking to create a curated, user-friendly, and accessible experience. This helps open data to new uses and new users, promising to usher in a new era for data analysis and data engineering alike.
To achieve this, datasets, metadata, and access controls are bundled into a single package, which can then be accessed, stored, or shared in accordance with role-based and attribute-based access controls. Their impact on organizations is often pronounced, impacting both data engineers and data consumers.
Data products are a fundamental part of Starburst Galaxy, and creating them inside the system is both easy and intuitive.
This tutorial will get you started using Starburst Galaxy data products. You will practice creating and editing data products in your own Starburst Galaxy account.
You need a Starburst Galaxy account to complete this tutorial.
Upon successful completion of this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
At Burst Bank, data consumers face challenges when searching for datasets. This issue prompts them to submit numerous, often redundant requests to the bank's data engineers. Consequently, waiting times for these datasets are prolonged. Additionally, this situation contributes to the emergence of multiple datasets with nearly identical names but minor differences.
This scenario leads to divergent results for data consumers, even when they believe they are working with the same datasets. Ultimately, the difficulties with data discovery and duplication have eroded the confidence that data consumers have in the reliability of the provided datasets.
Burst Bank took an initial step to address these problems by establishing a repository of approved and standardized datasets for data consumers. This initiative has achieved some success, but a persistent issue remains. Not every data consumer or team is aware of this change, resulting in ongoing requests for the same data sets. Data discoverability remains an issue.
Burst Bank has decided to further enhance data discoverability by implementing data products within Starburst Galaxy. Help Burst Bank create its first two data products by completing this tutorial.
In Starburst Galaxy, data products are built using schemas as the primary building blocks, but a data product is much more than just a schema. It includes the entire ecosystem surrounding that schema and everything needed to access and share it.
In this sense, you can think of a data product as a package consisting of a schema, the data inside it, all corresponding metadata, and the access controls needed to view and query that dataset.
The process of wrapping up this package occurs when a schema is promoted to a data product. Once promoted, the data product can be accessed or shared in accordance with the access controls placed on it.
In this section, you'll promote a schema with no existing metadata to a data product.
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Starburst Galaxy separates users by role. Creating a new data product will require access to a role with appropriate privileges. Today, you'll be using the accountadmin role.
Your current role is listed in the top right-hand corner of the screen.
You're going to be using the same cluster, aws-us-east-1-free
, that you set up on the Starburst Galaxy: Getting started tutorial.
Enabling a default cluster is an important step when creating a data product because it determines the cluster that will be used whenever that data product is accessed in the future. For this reason, the cluster you select should be one that works for current and future users of the data product.
It is always best practice to confirm that a new cluster has been enabled successfully.
For this tutorial, you're going to use the Burst Bank PostgreSQL catalog that you configured in the Starburst Galaxy: Getting started tutorial. That catalog contains a number of schemas, one of which will be used to create a data product.
Scenario: The first schema that Burst Bank would like to promote to a data product is the burst_bank_with_stats
schema located in their postgresql
catalog.
postgresql_burst_bank
catalog.burst_bank_with_stats
schema.Data products are meant to be used and shared across teams. This means that naming them, providing context, and writing clear descriptions of the schema inside is critical.
Including a thorough description along with relevant background information in a data product increases its value to data consumers.
In Starburst Galaxy, you can use Markdown Language to format data product descriptions. This enables you to incorporate various elements such as images, links, code snippets, and more. These formatting elements ensure that data consumers have access to all the necessary background information in a single location.
# Burst Bank
Burst Bank is a fictional bank with fictional employees, customers, and accounts. It contains nine tables, including the following:
* account
* auto_loan_payment
* credit_card_payment.
## Products
Burst Bank customers can obtain credit cards, mortgage loans, and auto loans.
## Common uses
Most often, this database is used to understand which customers to target for up- and cross-sell campaigns.
Use the following query as a starting point:
```
SELECT c.first_name, c.last_name, c.estimated_income, a.products, a.cc_number, a.mortgage_id
FROM burst_bank_with_stats.customer c
JOIN burst_bank_with_stats.account a on a.custkey = c.custkey;
```
## Image
Burst Bank's logo
![burst bank logo](https://everpath-course-content.s3-accelerate.amazonaws.com/instructor%2Fejxo7n54y6ft3b0yyj7o4es6j%2Fpublic%2F1685043892%2Fburst_bank_logo.1685043892254.png)
After entering the markdown text, you can test how it renders before completing the process.
When a user navigates from the data product to the query editor, the default cluster will be pre-populated in the query editor.
You can add other supporting information to help users understand your data product. This is important and helps ensure that your data product's use and value can be determined by other team members.
In this step, we'll add a link and a contact so that others in the organization can contact you if they encounter difficulties.
In this section, you will promote a schema with existing metadata to a data product. Watch the provided video to see a demo, then complete the steps on your own.
The second schema that Burst Bank would like to promote is the employees
schema from the postgresql
catalog.
employees
schema from the postgresql_burst_bank
catalog.Now it's time to show the metadata. This will be used to create the data product.
Now it's time to add in the schema description. Again, this should clearly indicate the context of the schema in question.
Now you need to add links to the schema metadata. This differs from the previous data product you promoted, which had no metadata.
It's always a good idea to add contact information when creating a new data product. This allows ownership in case of a question at a later date.
Now you're ready to begin creating the data product by promoting the schema.
Next, it's time to import any metadata from the schema into the new data product during the promotion process.
In this case, you'll want to select all three of the fields suggested.
Now it's time to review the details to see if they are correct. Notice that the description from the schema's metadata is now the data product summary field and that the link and contact were also imported and populated into the correct fields.
employees
schema to a data productAlmost there! It's time to finish adding the last few fields to promote your schema to a data product.
# Employees
This database details a fictional org and employees
## Executive team:
* Jane Burst
* John Star
_______________________
In the last sections, you saw the process for creating data products. In this section, you're going to test out what it's like to actually use a data product to solve real data problems. You'll become comfortable finding, exploring, and querying data products in Starburst Galaxy, mirroring real-world scenarios.
Watch the video below to guide you through the process. When complete, follow the step-by-step instructions on your own using your own Starburst Galaxy cluster.
In the Starburst Galaxy left-hand navigation menu, under the Data heading, you'll find the Data products section. This is your first port of call when using data products. It will list all of the data products available for your role, and is your general jumping-off point for daily data product workflows.
You're going to get familiar using this section going forward.
Data products are also searchable, and this is one of the most popular ways that users incorporate them into their daily workflows.
The search box is also in the data products section, in the top right of the screen. You're going to test it out with a quick search.
Now it's time to explore inside a data product. You're going to use the Burst Bank data product as an example.
Data products are promoted from schemas. Starburst Galaxy allows you to view the corresponding schema associated with a given data product.
Starburst Galaxy takes you to the Catalog page showing the details of the burst_bank_with_stats
schema. This is the schema that was used to create the Burst Bank data product.
Moving between data products and schemas is a useful way to view the schema that a data product is associated with, which can be useful for data lineage applications and other workflows.
When you promote a schema to a data product, the datasets in that schema are not copied into the data product.
Instead, when you query data from a data product, you're querying data from the underlying schema.
Starburst Galaxy includes features that let you flip from data product to query editor to help incorporate data products into your workflow.
Your data product includes the sample query that you input earlier in the tutorial.
SELECT c.first_name, c.last_name, c.estimated_income, a.products, a.cc_number, a.mortgage_id
FROM burst_bank_with_stats.customer c
JOIN burst_bank_with_stats.account a on a.custkey = c.custkey;
You could just move to the query editor in the traditional way. But there is an easier way, using the Query data button. It will move you from the data products section to the query editor.
Notice that you are now in the Query editor and that the cluster, catalog, and schema have been pre-populated.
Data product owners can delete data products from their details page. Doing so does not delete the underlying schema.
Deleting data products is easy and uses the options menu that you explored in previous steps.
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use data products in Starburst Galaxy.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!