Build and run scalable transformation pipelines using dbt Cloud and Starburst

April 27, 2023

Emma Tippet

Head of Product

Starburst

Emma Tippet

Head of Product

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Building lakehouse with dbt and Trino

It’s undeniable that dbt has increasingly become an industry standard for managing data transformations. That’s why we are excited to announce that we’ve partnered with the dbt Labs team to make transforming your disparate data as simple as possible by adding Starburst as a native, out-of-the-box connection option in dbt Cloud.

Why dbt Cloud and Starburst?

Starburst is a powerful data lake analytics platform that lets you discover, govern, and query in your data whether it resides in the lake or not. With the new dbt Cloud and Starburst integration, data teams can more efficiently build, test, and document data pipelines against multiple data sources in one central place.

In the rest of this article, we will discuss common use cases we see for dbt Cloud and Starburst, as well as the easiest way to get started with both tools.

Build a data lake reporting structure

As a data lake analytics platform, we spend a lot of time talking to data engineers who are responsible for extracting value from their company’s data lake. Unsurprisingly, the success of their data lake solution comes down to the quality of the organization of data within the lake.

Making the desired data available in the appropriate format is time-consuming due to poor documentation or data discoverability issues. This is often compounded with data pipelines that quickly become slow or outdated, falling out of step with analytical needs, due to the speed of business and the ever growing backlog of requests.

At Starburst, we’ve been actively delivering features that will help teams avoid these pitfalls. Some examples include:

Global search for quick visibility to all connected data sets
Tags for adding business context to existing data sets
Query plans and cluster utilization charts so you can see the health of your workflows in real time

But the combination of Starburst & dbt Cloud unlocks a new level of efficiency for data teams performing data lake analytics. Now anyone on the data team can safely contribute to production-grade data pipelines with features like version control and generated documentation.

Below is what a full pipeline with Starburst and dbt Cloud might look like:

Use Starburst and dbt Cloud to extract data from a PostgreSQL database which holds customer information. This will be combined with the JSON sales data being landed directly in S3.
Data from the landing zone is integrated into the structure zone using a variety of drop/create views and incremental loading.
Populate the consume layer with rollup tables to be queried by a variety of sql client tools.
Consume the data via traditional BI tools, the Galaxy UI, or your other favorite tools.

Build transformations from federated data

While many organizations have goals to centralize as much data as possible in a single data lake or warehouse, the reality is that establishing that single source of data could take years. For others, centralization is an impossible goal, as new application data sources are added on an ongoing basis.

This raises an issue for data teams that want to leverage dbt’s workflow. Since dbt specializes in transforming data within a data warehouse, typically each project would be connected to a single data store. However, Starburst Galaxy’s superpower is the ability to federate queries orchestrated by dbt Cloud to multiple data sources from a single dbt repository.

This means that you can leverage Starburst as your data platform for dbt Cloud, giving you the ability to easily build data pipelines for multiple data sources from one central plane.

Follow along with this series to learn how to build data pipelines using dbt and Starburst with data directly from your operational systems. The pipelines use a variety of sources including relational databases, NoSQL databases, and other systems. The resulting data is stored in the data lake using the open source Iceberg table format.

Get started with dbt and Starburst

Now that you’ve seen how to use dbt Cloud and Starburst to build data transformation pipelines, we encourage you to load your own sample or production data and dive into some of the more advanced functionality of dbt Cloud and Starburst.

The seamless integration between dbt Cloud and Starburst Galaxy, Starburst’s fully-managed cloud platform, means you can now get up and running within minutes by following three easy steps:

Create a new dbt Cloud project.
Select “Starburst” as the data platform; enter credentials and connect.
Write queries as normal, using SQL JOINs between data from multiple sources while Starburst intelligently determines where to send requests.

The above steps assume you have a Starburst Galaxy account. For a complete quick start guide for dbt Cloud and Starburst, follow the quickstart.

Starburst Galaxy and dbt Cloud

Get up and running in minutes with the dbt quickstart

Get started

The Data Engineers Guide to Iceberg v3

Build and run scalable transformation pipelines using dbt Cloud and Starburst

More deployment options

Start for Free with Starburst Galaxy

Building lakehouse with dbt and Trino

Why dbt Cloud and Starburst?

Build a data lake reporting structure

Build transformations from federated data

Get started with dbt and Starburst

Starburst Galaxy and dbt Cloud

Building lakehouse with dbt and Trino

Build a Data Lakehouse Reporting Structure with dbt and Starburst Galaxy