In a previous life, I had the privilege of being part of a team that brought a data warehousing solution to my company. It’s a tale as old as time. Data was stashed and stored everywhere. We needed to build a bridge between where that data was and push it to where it was not, so we adopted a data warehouse.
And it worked well – until it didn’t. I’ll go into more details on my cloud data warehouse learnings in a future series, but for my first Starburst post, I wanted to start simple – how we are using Starburst at Starburst.
As the data engineering team, our goal is to provide a data analytics platform that enables self-service, provides clarity and insights, is trustworthy, and that quality and testing are built into the data for the end users (or systems). That’s a fancy way of saying we are obsessed with being data-driven and giving more people access to quality data.
As with any good data engineering platform, our first step was to build out a data curation lifecycle. For this initiative, we chose to use Starburst Galaxy paired with dbt. The rest of this blog will talk about our experience with Starburst Galaxy and dbt compared to the traditional warehousing model.
Designing the curation lifecycle
When architecturing the data curation lifecycle, my team wanted to be able to organize the data and the curation layers by Topic. Topic is the category or bucket of information that the data models represent. In this blog, Telemetry is our Topic, which means the the data curation lifecycle stages for Telemetry would become:
- Telemetry_Source – where data is landed
- Telemetry_Stage – where data is structured
- Telemetry_Curation – where data is served
Sometimes this is referred to as the medallion architecture where you have bronze (raw data), silver (cleaned data), and gold (curated data) tables. At Starburst, we like to use land, structure, and consume to communicate the same paradigms which perfectly pairs with the our Data Lifecycle source, stage, and Data Product (AKA curation)
Now that we have a methodology for categorizing data, the next step is building out the curations. But we want to create an efficient and scalable process that will allow us to iterate quickly, deploy, document, test, as well as provide the best possible data to the curation layer.
Ideally, we want to be able to deploy from our repositories on GIT, with a CI/CD (continuous integration, continuous deployment) strategy, and provide lineage, test results, data dictionaries and catalogs as part of this process. Our team goal was to build as much self-service as possible into the curation layer as possible, from a single source of truth (GIT).
The traditional warehousing approach
In my previous life, I would’ve built these pipelines by pulling the data from a source of information and then landing that data (either raw or after some in-stream transformations) into a large warehouse. Pretty straightforward. Then, I would’ve implemented a service or orchestration layer, to help manage those connections. Only at that point would I be able to start asking questions of data that would normally span domains of information.
The problem with this approach is that warehousing pipelines are fragile and prone to human error. There are a couple common breakpoints and inefficiencies I ran into over and over again.:
- Incorrect data structure assumptions: As the data engineer, I would need to make determinations on the structure of the data. Oftentimes, I chose to land this data as blobs or varchar and chose to figure out the rest downstream. Most of the time this worked and the assumptions I made were right but other times not, leading to a poor quality pipeline.
- Time spent on pipeline maintenance: Data pipelines require maintaining, updating, validating, and managing connectors. If anything breaks, you have to take the time to understand what the pipeline does, what transformations were performed, and what downstream systems it feeds. I often referred to this pipeline work as “data archeology”. The pipeline needed to be deciphered, the origins discovered, the transformations vetted and understood. For me, this was hard enough to do with my own pipelines, let alone deconstructing another data engineer’s work.
- Short-sighted naming conventions: In a traditional warehouse, the reality is data has to live within the same schema which means table and view names take on extra importance. Is a table always prefaced with “tbl_” or a view with “vx_”? We are constrained by our own ability to name things that are informative and forward compatible. This is a lot to manage, even without tracking lineage, writing documentation, creating data dictionaries. The organization of data to build our analytic models alone can be a quagmire.
This isn’t meant to be a list of complaints against warehousing pipelines. I really thought these problems were just part of the job. Not only do we work with data, we work with the entire ecosystem (and all its good or bad) as well.
Starburst Galaxy and dbt
After being introduced to Starburst, I realized that a lot of the day-to-day pipeline pains I had classified as inevitable could be fixed because of how Starburst operates.
Starburst Galaxy, Starburst’s fully managed data lake analytic platform, acts as a non-destructive, abstraction layer on top of your data sources. We can develop a strategy, hate it, and develop a new strategy for organization without a lengthy migration process. While optimized for data lake analytics, Galaxy works across any data source from data warehouses to operational databases to storage solutions as simple as Google sheets.
In our deployment of Starburst Galaxy, we arranged our telemetry data in our new data catalog “telemetry-catalog” which is connected to our AWS S3 data lake that contains all the curation stages either as S3 objects or external tables.
In order to move our lake data through the curation stages, we embraced dbt. With dbt, we are able to build the models, build the dependencies, create tests and document our models. Said differently, we are building our telemetry pipelines with dbt and using Galaxy to manage our connections. This means that instead of switching between many systems, we can manage both source and destination data through a single system – Galaxy.
Specifically, we use Galaxy’s federation capabilities (aka cross-schema joins) to create staged views – or materialized if necessary – from source information, meaning we are able to build the curation layer without creating any pipelines.
With dbt, we are able to connect to Galaxy and run all of our operations through Galaxy and define them with our dbt_project YAML. Managing the different objects, the transformations, and curation models through dbt and Galaxy means that we can also build out a complete lineage of the model, with comments and tests, and track the dependencies that each of our curation models are built from.
Here’s an example of what this looks like for our telemetry data:
As you might have figured out, I’m a big fan of dbt and Galaxy. After building our data models with dbt and Galaxy, we’ve seen the following benefits:
- Increased data quality: Testing is easy to define and build into the process with dbt. Tests are able to be run as part of our CI/CD pipelines, with results. And the artifacts that are generated by dbt allow data engineers to have a copy of the tests as complete SQL code. Quality is a feature of good curations.
- Domain-oriented approach: We are able to work within domains (or topics) instead of spending time trying to decipher the tools and processes. This allows us to bring a higher level of quality to the data, be more efficient with our workflows, and always be improving self-service with team members that are domain experts. We’ve found Galaxy’s data products and comment capabilities especially useful here.
- More time spent building vs. maintaining: Galaxy handles the integration for the models, so my team can spend more time building new models. With Galaxy, if the catalog exists and is assigned to the right permissions and clusters, the model is integrated within the data ecosystem. While dbt defines the model, Galaxy serves up the model.
- More agility and autonomy: Ultimately, data engineers want to move quickly, without being discouraged by many thousands of dependencies. Galaxy and dbt, with a good and clear strategy, allows the DE team to work as autonomously as possible and asynchronously. We are a global company after all.
All of the above makes it easier for my team to scale. Scalability is something I have struggled with in previous work, because data engineers can become specialists quickly. As mentioned previously, pipelines can become black boxes, we then spend our time understanding where and what needs to be done. This type of workload does not scale, it is impossible to scale this work … and I like to take vacations.
Galaxy manages the connections, but dbt really delivers a method to build a workflow that we can scale. We are able to treat many of our curation projects as a normal development lifecycle, and leverage the rigor from that workflow to build scalable solutions.
I am an imperfect engineer, and often fail at tasks. I like to tell my team that “I really try and do nothing that my future self will want to kill me for”.
The reality is that, like my previous data warehouse implementation, not every strategy will work. Instead, my goal is to create an environment where analytical models are constantly being improved upon and the work that drives those models can change when change is (inevitably) necessary.
Starburst Galaxy allows us to create, explore, understand and build pipelines, but more importantly it allows us to change and not destroy when we realize our approaches are not quite working.
With dbt and Galaxy, my role has evolved from maintaining many different systems, configurations, and SQL into managing a single dbt project and spending more time building new projects. Often I find myself having eureka moments with Starburst Galaxy and how breaking some of our old paradigms allows us, the data engineers, to build and grow a data driven culture within Starburst and everywhere else.
Try Starburst Galaxy today
The analytics platform for your data lake