Join Starburst on March 18th for the next iteration of our Live Demo Series

Datanova 2022

Starburst Galaxy Lab

Welcome to the Starburst Galaxy hands-on Lab! This lab is designed to showcase features in Starburst Galaxy like cluster deployment and management, querying, security administration, auditing, and much more.

Signing up with Starburst Galaxy

To start off the lab today, you will need to set up your free account with Starburst Galaxy. Click here to sign up, provide the appropriate information and you should receive an email with next steps. Follow the instructions in the email to sign up for Starburst Galaxy while you wait for the lab to begin! Don’t wait for us! If you feel so inclined, go ahead and start playing with Starburst Galaxy and connecting your own data sources.

Also, later in this lab, we have a section where you can optionally follow along by using an RDS instance on your own AWS account. So either make sure you have the information ready for an existing RDS instance or go ahead and create that now as well while you wait.

Querying data in Starburst Galaxy

After some introductory prompts, you’ll start off in a worksheet that has a sample Trino cluster already set up and a sample catalog already connected with sample astronaut and space mission data tables.

Let’s query something right away. Let’s look at the schema we’re dealing with.

  1. Select the sample cluster dropdown in the left hand navigation to view catalogs.
  2. Select the sample catalog dropdown.
  3. Select the demo schema dropdown.
  4. Select the tables dropdown.
  5. Click on the three dots next to astronauts and choose the option to Enter a SELECT statement to see what the
    different columns consist of:This following query will be generated.

    SELECT * FROM "sample"."demo"."astronauts" LIMIT 10;
  6. From the output we notice that there are duplicates of the same astronaut, this is because each row represents not only the astronaut, but all the missions they’ve been on.
  7. It may be interesting to view the astronauts by nationality, let’s run a quick aggregation by running the following query that first condenses the rows by astronaut number to avoid counting astronauts multiple times for each mission:
    SELECT nationality, COUNT(*)
    FROM (SELECT number, nationality FROM sample.demo.astronauts GROUP BY number, nationality)
    GROUP BY nationality;
  8. Let’s run a slightly more complex query where we break down the length of missions into categories to see which astronauts from different countries spend longer times on missions.
    SELECT nationality, mission_category, COUNT(*) as mission_cnt
    FROM (
     SELECT
     nationality,
     CASE
       WHEN hours_mission <= 24 THEN 'Day Mission(s)' WHEN hours_mission > 24 AND hours_mission <= 168 THEN 'Week Mission(s)' WHEN hours_mission > 168 AND hours_mission <= 720 THEN 'Month Mission(s)' ELSE 'Year Mission(s)' END AS mission_category FROM sample.demo.astronauts WHERE hours_mission > 0 AND nationality IS NOT NULL
    )
    GROUP BY nationality, mission_category
    ORDER BY nationality, mission_category;
            

    Now that we’ve got some interesting data, we can check on all our queries by clicking on the Query History and
    Audit Log in the familiar Trino UI

Exploring Query History and Audit Log

We can also take a look at the query history in Starburst Galaxy that not only tracks the history of the queries, but
now this query history is persisted for the duration that your Starburst Galaxy account exists. For those who run Trino, you know that unless you set up your own query listener, this functionality doesn’t exist
out-of-the-box.

What’s more is that you can filter queries by query text, email, role, status, and date, and you can even click into them long after the internal Trino cluster has suspended. This is such an invaluable tool to debug and troubleshoot issues on the client side, and makes bug reporting and deploying a fix a much faster turnaround time from our end. You can also click into the query for more details.

This shows a visual representation of the query stages and how long each stage took. We can dig into more advanced representations like the analyze query output for the query that was run. This also includes Execution details.

Aside from debugging and troubleshooting, Starburst Galaxy’s query history also serves as an incredibly valuable tool for auditing. This pairs up nicely with the audit log.

If you need to track where changes to the cluster come from this will contain what the change was, when, and by whom.

Connecting additional data sources

For this next part of the lab, I’ll set up another connector to Postgres with a pre-existing RDS instance. If you were able to set up an RDS instance, you can use that instance for this portion of the lab. If you don’t have one set up that’s fine you can just follow along during the lab and we’ll publish this video for you to follow the steps to later.

  1. Set up a PostgreSQL database: Click on the Catalogs page in the navigation pane then choose Postgres as the data source.
  2. Enter in the name of the catalog, database url, name, and password.
  3. Click Test connection. Once the connection is established, you can save the catalog.

Copying data and federated queries in Starburst Galaxy

Trino is also well known for handling ad hoc queries across multiple databases, however you can also use Starburst Galaxy to create new datasets from existing datasets. We have a dataset sitting in S3 that we’d like to load into the PostgreSQL instance. We can do this pretty simply with a CREATE TABLE AS also called a CTAS query.

CREATE SCHEMA postgres.demo;
CREATE TABLE postgres.demo.astronauts AS
SELECT * FROM sample.demo.astronauts;

Now, we just copied one dataset from our data lake into our PostgreSQL database. Let’s run a quick federated query across our data lake and PostgreSQL.

SELECT
   m.company_name,
   a.nationality as astronauts_nationality,
   COUNT() as nb_astronauts
FROM
postgres.demo.astronauts a
JOIN sample.demo.missions m on m.detail LIKE format('%%%s%%', a.mission_title)
GROUP BY a.nationality, m.company_name
ORDER BY m.company_name, a.nationality;