Last Updated: 2024-01-22

Background

Starburst Galaxy allows you to easily query a variety of data sources. It can also be used to join data from multiple data sources through a single point of access.

It is especially useful when querying data lakes or joining data lakes to data warehouses as a way of enhancing efficiency and reducing cost by shifting data from traditional data warehouses towards inexpensive cloud object storage.

Scope of tutorial

This tutorial will get you started using Starburst Galaxy and help demonstrate the federation of data between a data lake and data warehouse using sample data.

After creating a free Starburst Galaxy account, you can jump right in and start using Starburst Galaxy.

Prerequisites

Learning outcomes

Once you've completed this tutorial, you will be able to:

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Background

It's time to set up your Starburst Galaxy account. This is designed to be quick and easy - just a couple of minutes.

Once you're set up, you can query sample data or add your own data sources, including data lakes, or data warehouses.

Excited? Let's get started.

Step 1: Navigate to Starburst Galaxy

Starburst Galaxy is accessed through a web browser, although it can also be used in other ways, including a command line interface. This tutorial will focus on using the web UI, as this is the way that most users interact with the system.

To get started, click the URL below to begin signing up.

Step 2: Provide your information

Starburst Galaxy accounts are registered to a particular name and email address. To set up your account, you will need to provide the following information.

Step 3: Retrieve confirmation code

A confirmation code will be sent to your email address. You will need to copy it from your email into your Starburst Galaxy browser window.

Step 4: Paste verification code into Starburst Galaxy page

Now it's time to paste the confirmation code into Starburst Galaxy.

Step 5: Choose a domain name

Starburst Galaxy accounts each include a domain name. Typically this is either the name of the user, or the department or business function controlling the domain. In either case, it should be meaningful to you and something that you will not forget.

Once selected, the domain will become part of the URL used when you sign in to Starburst Galaxy.

Step 6: Create a password

Now it's time to create a password. This will help protect access to your account. You should never share this password with anyone.

If you need to share access to your account with others, there are more secure ways to do this from within Starburst Galaxy. Future tutorials will demonstrate this process.

Step 7: Tell us how you'll use Starburst Galaxy

Starburst Galaxy is used by many different types of people in different ways. Knowing the ways that you'll be using it will help us meet your needs better.

Before proceeding, tell us a bit about how you are going to use Starburst Galaxy.

Background

Starburst Galaxy's main task is to make running queries easy and intuitive, no matter where your data resides. To help accomplish this, it hits the ground running by providing a simple dataset and a few sample queries to get you started.

You're going to begin by using these sample datasets to execute your first queries.

Step 1: Query sample data

Starburst Galaxy lets you use either sample datasets or your own datasets.

In future tutorials, you will connect your own data sources, but for now you can use sample data.

Step 2: Query the sample catalog

Starburst Galaxy includes several test queries when you first launch it. This is designed to help you get up and running quickly.

The queries have been designed to demonstrate the system's capabilities and hit the needs of most users.

Starburst Galaxy will also display a prompt designed to help you understand the user interface.

Step 3: Select Cluster, Catalog, and Schema

It's time to ensure that your cluster is configured properly in the Query editor before you run your first query.

Step 4: Query customer table from burstbank sample data

The SQL query that you need is already in the Query editor. This helps you hit the ground running and test your cluster quickly.

SELECT custkey, last_name, country FROM sample.burstbank.customer LIMIT 10;

Step 4: View results of first query

Now it's time to view the results of your query. Starburst Galaxy displays the results of your query at the bottom of the page.

Step 5: Execute join using sample data

Now you're going to run another query using sample data. The second query included with Starburst Galaxy joins data within a single catalog of sample data.

Step 6: View results of second query

The results of the second query are displayed below the query editor. This query joins data from tables within the same catalog, but Starburst Galaxy makes it easy to join tables from different data sources using data federation.

The next part of this tutorial will delve into that federation use case, unlocking the power of the open data stack.

Background

Now it's time to shift gears. To showcase the true power of Starburst Galaxy, you're going to use data federation to join two different data sources.

Many organizations store data in multiple locations for a variety of reasons. It's not uncommon for data to be scattered across data lakes, data warehouses, and databases, sometimes both in the cloud and on-prem.

For this scenario, imagine that you work as an imaginary bank that uses Starburst Galaxy, Burst Bank. This bank has data in many different locations, and it's going to be your job to join two of them.

The scenario below outlines the scenario in more detail.

Take action

Now it's time to use Starburst Galaxy to assist Burst Bank by joining all their datasets together to produce the report they need.

In this exercise, the data warehouse and data lake already exist and the necessary connection information is provided for you in this tutorial.

To do this, you will:

  1. Configure catalogs to access the data lake and data warehouse.
  2. Configure a cluster for those catalogs.
  3. Run a federated query to get the results that Burst Bank needs.

Background

Burst Bank holds payment history information in their data lake. In this section you will use the Starburst Galaxy Data Lakehouse connector to access this data.

Step 1: Create new Catalog

Adding a new data source to Starburst Galaxy requires creating a new catalog for that data source.

Step 2: Choose the data source

Data lake object storage is the most popular option for Starburst Galaxy data sources. For this tutorial, we're going to be using Amazon S3.

Step 3: Provide credentials for catalog

Each data source has its own credentials that limit access. You're going to add the credentials needed to access the data lake via Starburst Galaxy. This will create a catalog for that datasource which will keep track of those credentials in the future.

Step 4: Choose the metastore configuration

Now it's time to select the metastore configuration. Starburst Galaxy lets you choose from several options for metastore. You can mix and match metastores and data sources in whatever way makes sense for you. For this tutorial, we're going to select AWS Glue, using the US East North Virginia region.

Step 5: Add final catalog details and test the connection

Table formats control the way that data is stored when it enters the data lake. Today you're going to be using a modern, open table format known as Iceberg. Starburst Galaxy is also able to read other table formats like Hive and Delta Lake, but Iceberg is by far the best table format for most users.

Step 6: Set permissions

Starburst Galaxy allows you to set permissions for every datasource. In this tutorial, we're going to set our access to the data lake to read-only and leave all other default settings in place.

Step 7: Skip adding to a cluster

Starburst Galaxy allows you to add the catalog to a cluster at this point. For this tutorial, you're going to skip this stage and continue adding data sources then add them all to a cluster later.

Scenario

Burst Bank has customer information and customer account information stored in their data warehouse. In this section, you will use the PostgreSQL connector to access this data warehouse.

Step 1: Navigate to the Catalogs section of Starburst Galaxy

Connecting a data warehouse follows much the same process as connecting a data lake, both involve creating new catalogs to access the datasource.

Step 2: Select PostgreSQL datasource

Starburst Galaxy allows you to connect to a PostgreSQL data warehouse just like any other data source. PostgreSQL is listed in the Additional data sources section.

Step 3: Configure the PostgreSQL catalog

Starburst Galaxy allows you to access PostgreSQL data sources on all three major cloud providers: AWS, Google Cloud Platform, and Azure. For this tutorial, you're going to use AWS.


Step 4: Set the connection details

Now it's time to input the credentials that will allow Starburst Galaxy to connect to the PostgreSQL data warehouse.

Step 5: Set permissions to read-only

Just like your data lake, for this tutorial you will only need read-only access. Starburst Galaxy lets you configure this as part of the setup.

Step 6: Skip adding to a cluster

You don't need to add the PostgreSQL data warehouse to a cluster yet. You'll connect both the data lake and data warehouse to the same cluster in the next section of this tutorial.

Scenario

All of Burst Bank's data sources are in AWS in the US East-1 (North Virginia) region. With Starburst Galaxy, the data sources and Starburst Galaxy cluster should be in the same cloud region to reduce cloud costs.

In this step, you will create a new cluster in US East-1 (North Virginia).

Step 1: Create new cluster

Begin by creating a new cluster

Step 2: Name the cluster

Now it's time to name the cluster. This name should be something useful and indicative of its purpose and location.

Step 3: Select catalogs in your cluster

Now you need to add all of the catalogs that you want associated with the new cluster.

In the case of the tutorial, you'll want to select all 5 catalogs.

Step 4: Add remaining cluster details

When you create a new cluster, you need to define certain key details.

Step 5: Notice the cluster startup time

The cluster will now begin starting automatically. This takes a few minutes and you can watch the process update live.

Step 6: Use a shortcut to the Query editor

Now it's time to return to the query editor. You can do this quickly using a shortcut.

Step 7: Confirm that your catalogs are connected

It's always a good idea to confirm that new clusters have been created successfully, and that the correct number of catalogs has been added.

Background

With Starburst Galaxy configured, you are now ready to produce the report for Burst Bank's risk department. Remember, they need a list of customers that have more than 3 delinquent payments in their history and a FICO score of less than 500.

Use SQL to join data warehouse and data lake

In this section, you will run a federated SQL statement across a data warehouse and a data lake to create the critical report the risk department needs.

Run the following SQL command to get the required information. Let's break it down to help understand what it's doing.

Breaking it down

The query returns several fields from different data sources. On the customer side, you're looking for:

On the auto sales side, you're also returning some data, including:

Some of this data is in your lakehouse_burst_bank data lake and some of it is in your postgresql_burst_bank data warehouse. Returning results from two datasets requires a join, linking the customer ID and auto loan ID.

Finally, the delinquency of the account requires some conditional, boolean logic. You need to test whether the conditions for delinquency are true, and then if they are true, return the result if the delinquency is greater than 3.

These results are then aggregated using the GROUP BY command.

SELECT
    c.first_name,
    c.last_name,
    c.fico,
    a.auto_loan_id,
    a.auto_loan_open_date,
    count(al.delinquent_payment) as num_delinquent_payments
FROM
    postgresql_burst_bank.burst_bank_with_stats.customer c
    JOIN postgresql_burst_bank.burst_bank_with_stats.account 
          a on c.custkey = a.custkey
    JOIN lakehouse_burst_bank.burst_bank_with_stats.auto_loan_payment 
          al on al.auto_loan_id = a.auto_loan_id
WHERE
    al.delinquent_payment = 'Y'
    and c.fico < 500
GROUP BY
    a.auto_loan_id,
    c.first_name,
    c.last_name,
    a.auto_loan_open_date,
    c.fico
HAVING
    count(al.delinquent_payment) > 3;

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use Starburst Galaxy to connect different sources of data. Do you have data sources that you would like to connect to Starburst Galaxy? Check out our other tutorial offerings to see how easy it is to connect!

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.