We recently posted the YouTube video series below. This is a part of the FREE, on-demand Starburst Academy course Exploring Data Pipelines.
If you’re a data engineer tasked with building and managing data pipelines, Starburst Galaxy enables you to build a data pipeline workflow using modern data lakes and SQL. This approach offers both simplicity and power. What might have required a complex, user defined function (UDF) in Python using other systems can be accomplished with the accessibility and universality of SQL alongside the ease and cost effectiveness of the data lake.
Modern data lake architecture
In this video tutorial series, Starburst Academy’s Lester Martin walks you through the steps needed to set up a modern data lake. To do this we will construct a three-part modern data lake (aka data lakehouse) architecture comprising the Land, Structure, and Consume layers using Starburst Galaxy and SQL. This architecture is rapidly evolving as the new standard for modern data lakes and lakehouses based around open table formats like Iceberg, Delta Lake, and Hudi.
Get your own Starburst Galaxy account
For this tutorial we will be using the BlueBikes dataset, which is freely available and public. In fact, using the Starburst Galaxy free trial you can follow along with each of the steps in this tutorial and create your own Land, Structure, and Consume layer in your very own modern data lake.
Let’s get going!
1. Assessing the requirements
Let’s get started with the BlueBikes dataset. This first video will show you how to download the dataset and access it using your own Starburst Galaxy cluster.
2. Creating the land layer
Now that you’re up and running in Starburst Galaxy, it’s time to begin by creating the first of the three-part modern data lake architecture, the Land layer. This layer will receive raw data from the source. This will serve as the basis for future transformations as the data moves through the next two layers.
3. Creating the structure layer
With the Land layer complete, it’s time to set up the second layer in the three part modern data lake structure, the Structure layer. This layer requires transformations from the Land layer and all of that work can be accomplished using Starburst Galaxy and SQL. When complete, the Structure layer will become the new source of truth for the dataset.
4. Creating the consume layer
Now that you’ve constructed the Structure layer, you only have one step left, the Consume layer. This final layer makes the data available to queries and BI tools and constructing it completes the last of the three-part modern data lake structure.
5. Automation with Starburst and dbt
We’re not done yet! You’ve constructed all of the three layers needed for a modern data lake using Starburst Galaxy, but there’s another trick up our sleeves, automation. Starburst Galaxy lets you execute powerful data engineering workloads using SQL, and its integration with dbt Cloud lets you wrap up all of that work and automate it according to a work schedule. In real-world workflows this is a powerful strategy for achieving greater efficiency in data engineering.
Excited? Learn how to automate with Starburst Galaxy and dbt cloud.