If you read my previous blog post, you’ll know that I’ve had a long career in data engineering – 20 years, maybe more. The vast majority of work that I have done can be summed up as moving data from source A to target B, in order to ready it for analytics. I know that this might be a vast oversimplification, but the root of everything I have done in the past can be reduced to that simple process.
I find myself thinking about this more often because working at Starburst has changed how I think about my work and my role. Starburst offers a shift – a shift left to a more lean model of delivering data for analytics bringing the compute to the data, instead of the data to the compute.
The lean model is well known in the field of manufacturing, and I find it to be an interesting parallel with Starburst’s approach to data engineering. The lean principle sees waste as anything that doesn’t add value that the customers are willing to pay for.
Lean Manufacturing vs. Lean Data Engineering
Before lean manufacturing, manufacturers would:
- Build a warehouse – hoping that it is the right size and shape to allow them to scale their production
- Move raw materials in the warehouse – requiring lots of money up front in order to build up a reasonable amount of stock
- Build the “product” from the raw materials based off of forecasting – fulfilling quotas and sales projections instead of actual orders
All of this is done, BEFORE the first order. (We did understand what the customer wanted, ahead of time, correct?)
Lean manufacturing changed the world. The new process was based on demand to minimize waste, meaning manufactures would:
- Find a scalable warehouse solution – one that’s small but can grow with you
- “Pull” raw materials based on demand – keeping upfront costs down instead of stockpiling
- Build products only when an order is placed – responding to demand instead of predicting demand
This new approach made manufactures quicker and more efficient. They could now change their product and their supply based on shifting consumer preferences or fluctuating demand.
The world of data engineering is very similar to traditional manufacturing. Data engineers, typically:
- Buy a costly data warehousing solution
- Pipeline as much data as possible into the warehouse, based off of assumptions about the data and how it will be used (good and bad)
- Serve the warehouse data to end users – and hope it was what they needed
You’ll notice that in traditional data engineering, much like traditional manufacturing, customers are at the end of the process – after we transform those raw data materials into a product or service.
Starburst Galaxy takes the lean approach:
- Connect to data sources
- Land data ONLY when you need to transform or process that data
- Serve up prepared data to end users via data products
Putting Concept to Practice
Let’s take a moment to think about this with an example, let’s revisit how to set up Trino for dbt.
In the demo provided by Michiel De Smet and Przemek Denkiewicz, they build a 360 view of the Jaffle shop customer using dbt and Starburst Galaxy. When we think of how we would have managed this in a more traditional manner, we think of the following steps we as data engineers would have had to perform:
- Land Your Relational Data: We would need to pull in the need to move the relational database to the warehouse, and make sure that we are landing the data as it needs to be landed for analytics. Sometimes we will need to cast fields as a timestamp, boolean, or integer because we have to understand the data as we move it to the warehouse. Sometimes we miss the mark, but more often (as data engineers) we are exactly right, it is that sometimes that scares me a little.
- Land Your Semi-Structured Data: Well we landed the data, and it was relational, which is such a blessing when moving data to the warehouse. But now we have to manage JSON, the enemy of structured data. We have this click data, and we need to now manage that data. In this specific case the JSON data is not too unwieldy, but this is not true for all semi-structured data. When data engineers get to semi-structured data, we have to make a lot of decisions about the data. In a normal world, we are sharing the grief of these decisions with a group of peers. So at least we have this going for us. But we still have to make a decision, and we still have to rationalize how we relate the data to the customer 360 story.
- Quality Check: We missed something, some assumption is wrong (you will hear a lot of this from me), and now we have to rebuild the models that we assumed were right, from that terrible JSON data and rectify the assumptions we (as a community) have made.
The above steps were my life for 20 years. Repeat and repeat.
However, with Starburst Galaxy, there’s a new path. Instead of warehousing the data, fixing the data, warehousing the new data, and then fixing the data again, Michiel De Smet and Przemek show us how to connect directly to the original source of information, pull what we need to understand, and deal with misconceptions at that point of understanding. This allows us to respond to changes faster and connect that data to the business quickly and effectively. The response to problems does not become a story of reconfiguring the warehouse, but responding to a connector.
Because we are pulling data on demand, instead of pushing to a warehouse, we are able to respond to the changing topology of requests that our data customers have. As a data engineer, I am focused on building the right analytic models, instead of maintaining pipelines. I do enjoy maintaining a positive relationship with pipelines, but these relationships can become complicated quickly. And focusing on pipelines, and warehousing data, takes time and energy away from providing those curated models.
So should you adopt the lean approach to data engineering?
On my team here at Starburst, I’ve seen that adopting the lean approach to data engineering can offer a range of significant benefits. Embracing the lean principles has led to streamlined workflows, reduced costs, quicker response times, and enhanced overall data quality.
It has required me to change how I think about data engineering and how we deliver value to our customers away from the warehousing of data and to the pulling of data on demand.