As mentioned in my first blog post, I explained how we use Starburst at Starburst to create resilient data pipelines and how I have been able to shift away from traditional enterprise data warehouses with its often fragile ETL pipelines.
In this post, I want to share lessons learned working with cloud data warehouses — where they hold organizations back and how a fresh approach to enterprise data can fulfill the promise of modern data analytics.
What is an EDW?
Enterprise data warehouse (EDW) is a centralized repository that integrates and stores data from various sources within an organization. The data analysis is designed to support business intelligence (BI) and analytics activities by providing a unified and consistent view of data across the enterprise.
What are the benefits of an enterprise data warehouse
Data centralization has been the bread and butter of data engineering. We have been collectively building pipelines to centralize data for most of our careers. Businesses thrive on data, good and bad.
One of the biggest challenges is that data tends to evolve organically as business grows. Data can exist in many different systems within the organization, different regions, different domain experts, and often is managed and maintained by different groups. Types of data and data itself can be anything(i.e. business data, historical data, real-time data, transactional data, operational data, unstructured data, etc) and anywhere (i.e. apps, IoT devices, on-premise, cloud, in the US or France, etc).
Providing business users with a single source of truth is idealistic. Sure, there is a shared point of reference and this tends to be pipelined into a central repository.The goal is to enable analysts to build insights and support a data-driven organization and decision making process. Certainly, data warehousing architectures have historically been the right solution, first with on-premises providers like Teradata and then cloud-based data warehouses like Snowflake.
Warehousing has worked, has been the standard, and has enabled businesses to become more data driven. But the unstructured data that is used, and available, tends to hide (aka. dark data) the chaos that exists below the surface.
Challenges of building and maintaining an enterprise data warehouse
Like most data engineers, I have seen first-hand how challenging it is to build, let alone maintain, an enterprise data warehouse. One post is not enough to cover all the challenges, so for now, I’ll focus on these three: unpredictable costs, data complexity, and access control.
1. Unpredictable enterprise data warehouse costs
Data warehouses are generally a proprietary system which bundles functionality, storage, and compute. With an on-premises solution you are making a fixed investment on your infrastructure and then either accepting the constraints or spending money on idle capacity just in case you need it.
In my experience, building an on-premise data warehouse is a struggle between better planning and scaling with budget and costs. Usually planning ahead for a year or two, which I myself have not been able to be more prescient, falling short on needs when our on-premise warehouse takes off.
Cloud data warehouse solutions offer more scalable compute as well as “unlimited” storage, where in an on-premise model the scale and storage can be unknown for vendor pricing is notoriously unpredictable. From month to month predicting spend can be notoriously difficult. Did I spend too much on compute, or query, or storage? It can be a struggle to understand where to constrain costs.
If you happen to find a more affordable and predictable data warehouse platform, you still have the arduous and expensive task of justifying the migration, decoupling from your existing systems, and moving to this new platform. Migration costs alone can make vendor lock-in a very real possibility. Often you are unable to decouple from vendor lock in.
Many times, decoupling is not realistic.
“We evaluated Snowflake, but given the incredibly ad-hoc nature of our business it wouldn’t be cost effective. We would have to increase our cost by 10X to achieve the performance that Starburst offers us at a fraction of the cost.” — Richard Teachout, CTO, El Toro
2. Data complexity overwhelms centralized data warehouses
Even with the promise of cloud warehousing, data scalability can be limited. Storage capacity can not always keep pace with data volumes nor the needed velocities, many times it requires more work at ingestion just to keep data and pipelines manageable and affordable.
Data warehouses tend to also struggle with scale when business demands change. Workloads are increasingly becoming more complex, especially as businesses continue to adopt artificial intelligence and machine learning. Large complex transformation jobs are testing the performance limits of warehouses, cloud and on-premise.
Data itself is also becoming more complicated; more diverse, more variant, more informative and detailed. Business decisions are not based just on structured, well formed, schema ready data anymore. Warehouses were not really designed semi-structured data, for logs, clickstreams, IoT, and other non “normal” data streams.
Data quality in the face of this complexity can contribute to the increase in data warehousing costs. New data requires development of pipelines, pipelines require testing, data requires testing and quality checks and gates. Data projects are rarely straightforward and easy. Every pipeline requires a great amount of care and feeding to make sure the warehouse has quality data.
In a BCG survey, more than 50% of data leaders said architectural complexity is a significant pain point. As a result, many companies find themselves at a tipping point, at risk of drowning in a deluge of data, overburdened with complexity and costs.
3. Managing enterprise data warehouse access and control at scale
Access control is the third challenge. A main driver for centralized data in a warehouse is to make data more accessible to business users. But how do you democratize access at scale, and provide the right amount of access to enable business users at different levels to provide the analytics they need to build?
And more importantly, how do you scale access while also complying with data privacy and sovereignty regulations? Access policies for each data set may change depending on who the user is, where they are, and whether the data access needs to comply with Texas, California, France, or anywhere.
Access becomes a logistics nightmare, warehousing enables some data integrations with existing infrastructure. Often data engineers will work with different teams, like a governance team, to build out access control that works with the warehouse model. But as we know with pipelines, they require a lot of maintenance. Securing data, controlling access, and providing data to end users can be a balance of give and take. But often the real access controls were left behind on the original systems as we pipeline the data to the warehouse, pipelining data to the warehouse can often obfuscate the original intent of access control.
These are by no means the only data warehouse challenges we face, but they are consistent across companies and industries.
Limitations of a data warehouse architecture
As data engineers, we spend a lot of time, effort, and resources resolving data and data warehouse challenges. It is easy to be lulled into the comfort that the data system just works, and that the challenges are negligible, or can be resolved easily. My journey with Starburst has been an awakening from some of the challenges that are part of a centralized warehousing model for data.
Centralizing enterprise data is impossible because enterprise data is fundamentally decentralized.
Disparate data is everywhere and it is generated by everything. From manufacturing to HR, from domestic and overseas offices, data’s exponential growth is a constant force. Companies need to store data on servers, personal devices, cloud platforms, as well as SaaS applications. The list goes on, and we are often building systems to catch up with some data source that is already pushing data.
Often, large amounts of data never enters a data warehouse, or is in a warehouse without being used or useful. This can happen for many reasons, in my experience the data was never asked for or it was asked for and forgotten. So, we’re constantly integrating new sources into the warehouse.
I’ve expressed the pain of adding a new source to a warehouse as: Was the data even that useful? How many different layers of systems and pipelines do we need to build, manage, or alter in order for us as data engineers to answer a new question that requires a different source of information?
Some data you can not or should not centralize at all, there are always legacy systems, there might be legal or ethical constraints, there are many reasons why data cannot be warehoused.
Data warehouses can not cope with the endless flow of raw data from real-time sources. You have to hope you make the right assumptions, at the beginning of the process of building and pipeline, or when you sample and aggregate data. In many cases, compliance with data sovereignty regulations prevents you from moving or copying data at all.
When you’re running a centralized data architecture on top of an inherently decentralized information ecosystem, you get all the challenges that make data engineering so frustrating. Data warehousing requires planning, pipeline development, and constant maintenance. It is no surprise that many are seeking alternatives to the traditional data warehouse and pipeline models.
“We moved from a monolithic Snowflake approach to a decentralized approach with Starburst and Iceberg. Now we can skip the data warehouse step completely, and complete analytics on the data right where it sits.” – Lutz Künneke, Director of Engineering, BestSecret
Decentralized data with Starburst makes data more accessible
What I’ve learned with Starburst is there is a better model than a central warehouse for data. Starburst enables you to build a data analytics infrastructure that reflects the nature and topology of your existing data infrastructure. This is an abstraction layer which allows you to bring together different data sources which might exist along internal organizations, regions, business units or even different on-premise or cloud storage platforms, as well as relational databases and data lakes.
Instead of moving data into a warehouse, the data remains in its original location with its original attributes and characteristics. With Starburst, we are able to have a single interface with a single source or truth and access for engineers, analysts, and data products. As a result of this federated architecture, the challenges we discussed earlier disappear.
Decentralization is more cost-efficient because it allows us to separate storage from compute. We tend to invest in storage where growth is more predictable and easier to optimize. When storage and compute are decoupled, instead of the case of a single proprietary platform, Starburst lets you query, transform, and process big data more affordably on Amazon, Microsoft, or Google’s scalable cloud platforms. With over fifty enterprise-class connectors which will seamlessly unify data from different sources and eliminate the data warehouse’s pipeline development and maintenance costs.
Using Starburst to build a federated architecture simplifies your data management system. Data volume and velocity are less of a challenge when historical and real-time data sets are just as accessible. Data products seamlessly integrate structured, unstructured, and semi-structured sources. Complex workloads take less effort to prepare and run.
My biggest revelation has been seeing how decentralization and abstraction transform data access. Starburst’s connectors automatically deal with each source’s unique take on SQL. Instead of learning a new dialect for different systems, analysts can use their existing tools and SQL skills to access any data source directly, requiring less reliance on a data engineering team.
Don’t get me wrong, as a data engineer, I have a great sense of pride in working with people and unlocking data they need for decision-makers and business users. I am not inclined to be a gatekeeper and if being a data superhero means I have to also be a data gatekeeper for routine requests, I am not doing my job well. My goal is to build for data users, and enable data driven organizations. There is plenty of work to be done, unblocking people efficiently helps me in the long run.
Democratizing access in this federated model isn’t the compliance nightmare you’d think it’d be. Starburst’s single point of access makes it a single point of access control where you can turn role-based and attribute-based rules into fine-grained controls at scale. Data access can be modeled to respond to any legal compliances: data is protected and there’s a least privilege model.
Breaking data silos
How 8 companies gained greater data warehousing value with Starburst
Don’t rely on the data warehouse for everything. Decentralization is inevitable as you grow.
Working on data warehouses taught me lessons that are still with me today. But it wasn’t until working with my team at Starburst that I could put those lessons in perspective.
All the hard work it took to keep pipelines working and get projects across the finish line shared a root cause that I couldn’t see at the time. We were trying to force a centralized model on a fundamentally decentralized — and uncentralizable — system. Often my job would be maintenance, troubleshooting, and “hot-fixing”. I could be pulled into problem after problem, and not be able to be more strategic or build better data systems, because we (as data engineers) tend to keep all our systems afloat.
Perhaps there is an overall emerging pattern I’ve observed that is even bigger than centralized data warehousing.
The way I see it: there is a parallel between software development, decentralizing code and processes, and data-engineering. Software went from a highly centralized process to a more node (peer) process (the internet and networked systems led the way).
Bottomline: Decentralization is inevitable and that is how I would approach data management today.
Now that I have a different perspective, decentralization is the better way to reach the goal data warehouses were supposed to achieve. Starburst’s federated data analytics lets you embrace the complexity of modern data, making data easier to manage, access, protect, and use to generate business insights.