Building a SQL-based data pipeline with Trino & Starburst

September 26, 2023

Starburst Team

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Building lakehouse with dbt and Trino

We recently posted the YouTube video series below. This is a part of the FREE, on-demand Starburst Academy course Exploring Data Pipelines.

If you’re a data engineer tasked with building and managing data pipelines, Starburst Galaxy enables you to build a data pipeline workflow using modern data lakes and SQL. This approach offers both simplicity and power. What might have required a complex, user defined function (UDF) in Python using other systems can be accomplished with the accessibility and universality of SQL alongside the ease and cost effectiveness of the data lake.

Modern data lake architecture

In this video tutorial series, we walk you through the steps needed to set up a modern data lake. To do this we will construct a three-part modern data lake (aka data lakehouse) architecture comprising the Land, Structure, and Consume layers using Starburst Galaxy and SQL. This architecture is rapidly evolving as the new standard for modern data lakes and lakehouses based around open table formats like Iceberg, Delta Lake, and Hudi.

Get your own Starburst Galaxy account

For this tutorial we will be using the BlueBikes dataset, which is freely available and public. In fact, using the Starburst Galaxy free trial you can follow along with each of the steps in this tutorial and create your own Land, Structure, and Consume layer in your very own modern data lake.

Let’s get going!

1. Assessing the requirements

Let’s get started with the BlueBikes dataset. This first video will show you how to download the dataset and access it using your own Starburst Galaxy cluster.

2. Creating the land layer

Now that you’re up and running in Starburst Galaxy, it’s time to begin by creating the first of the three-part modern data lake architecture, the Land layer. This layer will receive raw data from the source. This will serve as the basis for future transformations as the data moves through the next two layers.

3. Creating the structure layer

With the Land layer complete, it’s time to set up the second layer in the three part modern data lake structure, the Structure layer. This layer requires transformations from the Land layer and all of that work can be accomplished using Starburst Galaxy and SQL. When complete, the Structure layer will become the new source of truth for the dataset.

4. Creating the consume layer

Now that you’ve constructed the Structure layer, you only have one step left, the Consume layer. This final layer makes the data available to queries and BI tools and constructing it completes the last of the three-part modern data lake structure.

5. Automation with Starburst and dbt

We’re not done yet! You’ve constructed all of the three layers needed for a modern data lake using Starburst Galaxy, but there’s another trick up our sleeves, automation. Starburst Galaxy lets you execute powerful data engineering workloads using SQL, and its integration with dbt Cloud lets you wrap up all of that work and automate it according to a work schedule. In real-world workflows this is a powerful strategy for achieving greater efficiency in data engineering.

Excited? Learn how to automate with Starburst Galaxy and dbt cloud.

The Data Engineers Guide to Iceberg v3

Building a SQL-based data pipeline with Trino & Starburst

More deployment options

Start for Free with Starburst Galaxy

Building lakehouse with dbt and Trino

Modern data lake architecture

Get your own Starburst Galaxy account

1. Assessing the requirements

2. Creating the land layer

3. Creating the structure layer

4. Creating the consume layer

5. Automation with Starburst and dbt

Building lakehouse with dbt and Trino

Build a Data Lakehouse Reporting Structure with dbt and Starburst Galaxy

Build and run scalable transformation pipelines using dbt Cloud and Starburst