×

Build an open data lake architecture with dbt Cloud and Starburst Galaxy

Last Updated: November 8, 2023

Three or so moons ago, I had the pleasure of demonstrating to you the notorious pair that is Starburst and dbt. We posted the dbt Core and Starburst Galaxy tutorial, and it was awesome to see people get excited about using the two technologies together to make life even better for the golden trio that is data engineers, analytics engineers, and data analysts. Starburst gives you the ability to join data from different data stores like data warehouses, data lakes, object storage, which means you can leverage all the goodness of dbt within one single repository making it more efficient to build, test, and document data pipelines.

Last week, we were able to name an even more iconic duo, no more waiting.

Starburst and dbt Labs announced that the dbt Cloud integration is now available for Starburst Galaxy, Starburst Enterprise, and Trino.  So, there’s no better time to revisit that tutorial and publish another one using both dbt Cloud and Starburst Galaxy.

Why does this matter? 

Both dbt and Starburst strongly believe in the sanctity of open source software, as demonstrated by their commitment and investment into each respective open source community.  If your business is vibing well with open source technologies, that’s awesome. However, that doesn’t mean it may be the best choice for every organization. While some businesses may not have the resources to operate open source, that doesn’t mean they shouldn’t be able to implement top tier technologies in their own data architecture.  This new integration empowers data practitioners to make architectural choices that are best for their ecosystem, and to flexibility to decide where they want to fall on the build versus buy spectrum. 

Open data lake architecture

Many use Starburst to federate multiple data sources together, so you don’t have to do migration work to then work with data in a singular dbt repository.  The other popular use case is to utilize Starburst and dbt together to build out an open data lake architecture, which is what will be showcased in the tutorial. In this scenario, Starburst is acting as the data lake analytics platform that is combining data from multiple sources within the data lake and then dbt is transforming that data to make it easier to consume for analytics, reporting, and any other business need.

In this example, we’re going to build a reporting structure in our data lake by using Starburst Galaxy to read the data in the land layer, then clean and optimize that data in the structure layer. The last step is to join and aggregate the data from all three sources together into a single table that is cleaned and ready to be utilized by a data consumer. 

Some dbt projects like to use Staging, Intermediate, and Aggregate as their reporting structure layers, so that’s what’s utilized in the tutorial. 

  • Land = Staging
  • Structure = Intermediate
  • Consume = Aggregate

Where in the world is the data?

We have our three data sources: 

  • The AWS COVID-19 data lake – utilizing low cost cloud object storage, this is where our confirmed case count for each day is stored.
  • Snowflake public COVID19 database – we have vaccination data in snowflake that is then being utilized as lookup information to compare to our case count.  
  • TPC-H standard datasets – our business has predetermined the regions and countries they want to see through these tables, so this will hold our location information. 

    Here’s the lineage graph for the models you will create. Each source, highlighted in green and prefaced by the data location they originate from, will be built upon to create our three data lakehouse layers within S3 using Iceberg. You should notice an almost identical structure between the ideal architecture diagram displayed above and the resulting data lineage graph that displays the models throughout each layer of the reporting structure. Then, the final output results in aggregations at both the nation granularity level and the region granularity level. You can then get tricky within Starburst Galaxy and only let your data consumers see the aggregations using role-based access control and access-based access control.  

    What you’ll need

    • A Starburst Galaxy account. This is the easiest way to get up and running with Trino to see the power of trino + dbt.
    • AWS account credentials. AWS will act as a source and a target catalog in this example.
    • A dbt Cloud account. This is the easiest way to get up and running with dbt. 
    • Any snowflake login. They have a 30 day free account. Note: You can still run this tutorial without having a snowflake account. Those jobs will fail, but just skip over anything that’s snowflake related. Then in the final tables, remove any logic for the vaccinated population. If you want the code, reach out to monicamiller@starburstdata.com

    The github repository has all the models pre-built for you. Therefore, it should be super quick to play around once you get connected to dbt Cloud.

    Try it for yourself

    I’ve logged into Starburst Galaxy and completed all the necessary setup infrastructure steps so that I am now ready to connect to dbt Cloud.  First thing’s first, you need to set up a project. 

    Project name: dbt-galaxy

    Choose a connection: Starburst

    At step 3, Configure your environment, navigate back to the cluster you created in Starburst Galaxy. Click the Connection info button, and then select dbt as the client of choice. You should see connection information that looks something like this.

     

    Download the information and bring it back to dbt Cloud.  Put in the appropriate host and port from the connection info. Then, put in your username for Starburst Galaxy, but also attach the role you want to use (shown as User in the connection info). Enter your Galaxy password.

    Catalog: dbt_aws_tgt – name of the target AWS catalog in Galaxy

    Schema: dbt_mmiller – default schema for your models, add an identifier for yourself

    Target Name: default – won’t come up in this scenario 

    Threads: 6 – default value

    Next, test the connection to make sure everything is configured properly.

     

    Create a fork of the tutorial repository. Then, select this fork as your repository in dbt Cloud. I’m connected to dbt Cloud through GitHub, so my repositories automatically popped up. Search and select your fork.

    Now that your project is set up, you should see the repository imported into dbt. You should be able to run everything without making any edits, but if you want to edit any of the models, create a new branch.

    Run the following commands on your main branch in the command line below:

    dbt deps
    dbt run
    dbt test
    dbt build

    Now, all your models have completed and the data pipelines have successfully transformed your data. Navigate back to Starburst Galaxy and secure access so that your consumers only are able to view the final aggregate tables, and only have access to select from the tables.

    First select from each table to validate your output is indeed what you expect. 

    Run the following commands on your main branch in the command line below:

    SELECT * FROM agg_nation; 
    SELECT * FROM agg_region;
    

      Navigate to the Roles and privileges tab, then select Add role. Enter the information and create the new role.

      Click within the newly created role, and navigate to the Privileges tab. Select Add privilege.

      Add the table privilege information for both tables.

      agg_nation

      • Modify privileges: Table
      • Catalog contains the tables: dbt_aws_tgt
      • Which schemas can this role access: dbt_mmiller (same one created in dbt)
      • Which tables can this role access: agg_nation
      • What can they do: Allow select from table

      agg_region

      • Modify privileges: Table
      • Catalog contains the tables: dbt_aws_tgt
      • Which schemas can this role access: dbt_mmiller (same one created in dbt)
      • Which tables can this role access: agg_region
      • What can they do: Allow select from table

      Now, navigate to the consume_layer_access role in the top right corner.

      Try running select statements and deletes to see what permissions are now available to your data consumers.

      Run the following commands on your main branch in the command line below:

      SELECT * FROM agg_nation;
      
      SELECT * FROM agg_region;
      
      DELETE FROM agg_region where region = 'ASIA';
      
      DELETE FROM agg_nation where nation = 'JAPAN';
      

        Try running select statements and deletes to see what permissions are now available to your data consumers.

         

        Congrats! You’ve created an open data lake architecture, created a data pipeline, and secured proper access to the final tables using dbt Cloud and Starburst Galaxy.  If you are interested in completing another tutorial, follow the Starburst Galaxy quickstart on dbt Cloud’s website

        Start for Free with Starburst Galaxy

        Up to $500 in usage credits included

        Please fill in all required fields and ensure you are using a valid email address.

        Start Free with
        Starburst Galaxy

        Up to $500 in usage credits included

        • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
        • Get up and running in less than 5 minutes
        • Easily deploy clusters in AWS, Azure and Google Cloud
        For more deployment options:
        Download Starburst Enterprise

        Please fill in all required fields and ensure you are using a valid email address.

        s