With every budgetary statement you get for running your cloud data warehouse, the natural reaction is to shake your head at how broken your total cost of ownership (TCO) calculations are. You budget for more storage. You budget for more compute. You budget for more — and more expensive — data engineers. But that budget never quite covers the growing demands analysts and data scientists place on your warehouse.
Certainly, cloud data warehouse costs are a problem, but they aren’t the full picture. The real problem is, you’ve fallen into the cloud data warehouse vendor’s trap and getting out of its jaws isn’t meant to be easy. On the one hand, you’re caught by classic vendor lock-in. You’ve built your data architecture upon a single vendor’s proprietary technologies, data formats, and APIs. Escaping to a different vendor requires an expensive, arduous migration to another ecosystem. On the other hand, you’re in the grip of the cloud data warehouse architecture’s inherent disadvantages, not least of which are limited capabilities, and expensive pricing.
An exit strategy that fails to account for these architectural weaknesses will not eliminate the risk of vendor lock-in — or its consequences. Onboarding new use cases into cloud data warehouse is too slow to deliver the speed of change modern data-driven decision-making requires.
In this post, we’ll look at how cloud data warehouse vendors trap their customers in their walled gardens. We’ll also talk about the on-premises and the cloud warehouse model’s limitations before discussing ways to avoid cloud vendor lock-in through federated data architectures.
What is vendor lock-in?
Vendor lock-in is how vendor policies and practices create enough friction to keep customers from switching to a competitor. By making it easy to build an analytics architecture around the vendor’s proprietary features and interfaces, vendors raise switching costs and the risk of disruption to their customers’ business processes.
TCO in analytics and cloud data warehouses
Equally important is to define the Total Cost of Ownership (TCO) in data analytics—in the context of cloud data warehouses—refers to the costs associated with the creation, operation, and maintenance of a warehouse over its entire lifecycle. This concept encompasses more than just the initial expenses or the ongoing operational costs. Factors that impact the overall expenditure include:
- Continuously budget for increasing storage and compute resources as the data volume and processing grows.
- Costs associated with hiring and retaining data engineers, analysts, and data scientists.
- An over reliance on a cloud data warehouse creates a scenario where escaping to a different vendor is costly and complex.
- A limited data architecture not only impacts costs but also limits business agility and the ability to adapt to new use cases or technological advancements.
“We evaluated Snowflake, but given the incredibly ad-hoc nature of our business it wouldn’t be cost effective. We would have to increase our cost by 10X to achieve the performance that Starburst offers us at a fraction of the cost.”
— Richard Teachout, CTO, El Toro
How vendors create data lock-in and increase costs
Imposing contract termination fees is one way to keep customers from leaving. Another is to charge high prices for exporting large data sets. While brute-force pricing policies are a factor, cloud data warehouse customers often build their own prisons.
Vendor pricing can have more subtle effects on their customers that encourage lock-in. While the cost of exporting data may be prohibitive, customers pay much less to import data or move data within the warehouse. This incentive increases the volume of captured data, including metadata, catalogs, and backups.
Companies commit to proprietary data formats by choosing a cloud data warehouse vendor, where the only way to access the data is via the cloud data warehouse engine itself, and with few interoperability guarantees, data migration projects must ensure the company’s data survives the journey from one proprietary format to another.
As companies lock themselves into the cloud data warehouse and attempt to make it provide a central source of truth to the rest of its organization, the vendor’s APIs become entrenched in data products, applications, and workloads. Moving to another vendor requires updating these systems to the new vendor’s API. Not only that, engineers must open up every pipeline to redesign and test them with the new cloud data warehouse to preserve the integrity of extract, transform, and load processes.
Finally, the vendor creates lock-in by becoming the way its customers work. People get familiar with the interfaces and the specific functions in the query language. Data teams develop expertise in that vendor’s technologies. People learn to work with what works, and work around what doesn’t and this tacit knowledge becomes a barrier, so changing to another vendor is not transparent and imposes an expensive learning curve that goes far beyond training. For example, vendors implement “standards” like SQL differently. Business intelligence analysts and data scientists must rewrite their queries to the new vendor’s implementation. In addition, data team staffing must be reassessed since vendor-specific skills may no longer apply in the new system.
Cloud data warehouse challenges
The many ways vendors create lock-in contribute to data analytics’ worsening TCO, but the core problem comes from the inherent challenges of data warehouse architectures.
Cloud data warehouses weren’t designed for the scale of change and diversity of modern enterprise data. To compensate for these limits, data teams create complex structures that connect cloud data warehouses with operational servers, data lakes, and other sources. Designed around the warehouse vendor’s platform, these pipelines become another kind of lock-in.
The cloud data warehouse model also encourages data duplication, which drives up data storage costs, but more importantly operational costs. We need people to manage and develop the data transformation and duplication. Source data gets copied into the warehouse or separate data lakes. Within the warehouse, different applications get their own versions of processed data. To get around performance and cost limits, large projects may need dedicated compute with more data copies. Lineage and observability become more and more important with data duplication and perhaps require us to invest in 3rd party tools to manage this.
Vendor lock-in prevents data-driven innovation
The most important cost of cloud data warehouse ownership never shows up in TCO calculations. It includes the opportunity cost of not doing the activity that we want to do, because we are spending time getting data into the system or managing the data once it’s there. The lock-in trap results in a loss of agility and discovery that prevents data-driven innovation.
Cloud data warehouses are great for answering routine questions about everyday business. Does an executive want a list of high-revenue customers? That’s a known question you can answer with known data. It’s all modeled and sitting in the warehouse, ready to be queried.
But questions like that don’t drive innovation. For that, you need to delve into the unknowns: questions nobody asked before that need data nobody’s used before. Revealing innovative insights requires an iterative process of exploration, discovery, and experimentation, waiting weeks or months for another team to onboard and shape data is a sure way to prevent innovation.
A workaround for vendor lock-in and rising costs
Federated data architectures streamline the integration of enterprise data sources. Whether you’ve adopted a hybrid cloud or multi-cloud strategy, federation turns on-premises, private cloud, and public cloud assets into a unified platform for discovering, accessing, and using data to generate innovative insights.
Rather than creating yet another centralized data repository, federation abstracts storage architectures to create a virtual access layer. Users can search for data throughout the enterprise in a single interface and access data from multiple sources with a single query.
This vendor-agnostic emphasis on open standards simplifies data management and minimizes dependency to enhance data portability and allows end users to exploit their data processing engine of choice.
“One of my favorite Starburst features is federated data. The ability to tap into multiple data sources from one point of access is huge.”
– Sachin Menon, Sr Director of Data, Priceline
How Starburst helps with rising costs and vendor lock-in
Starburst was designed to process exabytes of data at the world’s largest internet companies. At the same time, Starburst gives customers complete control over how and where their data is stored, managed, and consumed, allowing them to optimize their analytics environment for the perfect balance of performance and cost as they grow.
Cost Benefits for data & analytics leaders
For Data Leaders who struggle to support increasing numbers of analytics use cases amidst existing architectural complexity and increasing data costs, Starburst reduces complexity & cost with a new approach to data access.
Our Analytics Platform connects to data where it lives, reducing the time & cost of data movement, and streamlines the end-to-end journey of data from source to pipeline to consumer. This approach reduces the dependency on data centralization, and is a better approach for managing & analyzing distributed data at the speed the business needs it.
Analytics leaders who struggle to support more use cases while dealing with time delays and cost constraints, Starburst opens up data accessibility at a fraction of the cost seen in cloud data warehouses.
“The decision to deploy Starburst Enterprise was made simpler because it has proven to be a reliable, fast, and stable query engine for S3 data lakes.”
— Alberto Miorin, Engineering Lead, Zalando
Starburst federates data sources to end vendor lock-in
Starburst’s modern data lake analytics platform creates a single point of access to your data no matter where it lives. By decoupling storage from compute, Starburst breaks the architectural limitations of cloud data warehouses and delivers the scalability, efficiency, and performance demanded by today’s data analytics requirements.
Escaping vendor lock-in with Starburst becomes easier.
You can adopt a multi-cloud approach that balances cloud service providers. Starburst runs on any combination of Amazon AWS, Microsoft Azure, and Google Cloud. Moreover, Starburst offers connectors to more than fifty enterprise-class data sources, including:
- Data lakes/object storage: Cloudera, Delta Lake, Iceberg, Delta Lake, Minio
- Data warehouses and relational databases: Amazon Redshift, Clickhouse, JDBC, MySQL, Oracle, PostgreSQL, Snowflake, Teradata, and more.
To meet the needs of increasingly complex cloud computing environments, Starburst introduced Gravity, a universal access, discovery, and governance layer that optimizes multi-cloud approaches to analytics.
Gravity’s access controls let you use granular role-based and attribute-based rules to apply consistent policies to all data sources while addressing demands for security, confidentiality, and privacy.
Universal search lets users explore data sources across all of their domains to speed discovery and shorten time to insight. Subject to access rules, users can search data products, catalogs, schemas, tables, views, and columns in any source, no matter which cloud services host the data.
The fact that your data remains distributed among different providers is completely transparent to your users, so switching data storage is not disruptive. When giving access to data during a migration project, your admins simply change the connector location. Users won’t notice that anything has changed.
Gravity also breaks data catalogs free from vendors’ lock-in practices. Automatically cataloging metadata from over twenty sources, Gravity provides a universal access point for search and discovery.
Using Starburst to federate your company’s disparate data sources helps you avoid cloud vendor lock-in and removes the limits cloud data warehouses place on your analytics efforts. You can optimize storage and compute independently, handle data management at scale, and create a more cost-effective data architecture.
“In terms of technologies, we have SQL Server, AWS React Components, Apache Superset, and a whole load of Azure. We have a wide range from quite legacy thick design all the way up to modern responsive apps on mobile. And Starburst integrates well with all of them. Some of our customers have had some real success using Tableau with Starburst.”
– Richard Jarvis, CTO, EMIS Health