I recently started a new role at Starburst. I have spent most of my adult life in the data and analytics industry, so I thought I would share why I made this move to Starburst in particular.
My frustration with the data and analytics industry is that while most organizations have strived to be data-driven for a very long time – most are still failing to reach this goal. In May of this year, IDC stated that 83% of CEOs want their organization’s to be more data-driven. Another 2021 report by New Vantage Partners highlighted that only 27% of firms reported forging a data-driven culture, despite 65% of these same companies investing more than $50M on their big data & analytics efforts. That is a massive investment for such poor results. Clearly, we have a problem, but despite the many new and innovative products designed over the past years to address pieces of the problem, the problem remains, and I would argue that it is getting worse.
A quote I like to use a lot is: “The definition of insanity is doing the same thing over and over again and expecting a different result.” I believe this line can be used perfectly to describe the Big Data and Analytics space. Despite numerous innovations that can help us scale data to larger volumes, query data faster, and analyze data in simpler and more advanced ways, companies are still struggling to be data-driven.
Why? The Data Pipeline is Broken
As Zhamak Dehghani writes in her new eBook Data Mesh: Delivering Data-Driven Value at Scale, that we are at an inflection point. We have to face the facts, our decades-old approach to data management just doesn’t work anymore. Businesses are too complex, data volumes and sources are growing too fast and too large, and have too much variety – clogging the arteries and overloading the decades-old data pipeline process. Imagine a world where there are no more pipelines in order to access data for sharing and insights.
In today’s world, operational data is fed through a data pipeline via fragile scripts and/or ETL jobs to create analytical data. For the analytical process to work, we have to centralize the analytical data in a monolithic data warehouse(s), or mart(s), lake, ODS(s), etc. The problem is that 70% of a data engineer’s time is spent moving, copying, and ETL/ELTing the data.
As Zhamak says so well in her book, “The complexity debt of the sprawling data pipelines, duct-taped scripts implementing the ingestion and transformation logics, the large number of datasets – tables or files – with no clear architectural and organization modularity, and thousands of reports built on top of those datasets, keeps the team busy paying the interest of the debt instead of creating value.”
The Effects of a Broken Data Pipeline: Organizations cannot be data-driven and analytics vendors are stifled from assisting them
Businesses cannot quickly react to new challenges or opportunities. Analytics teams cannot quickly respond to new analytical requests BI and Analytics vendors cannot quickly provide value to their new customers. Business and data science teams cannot experiment with ideas and hypotheses without a financially supported business case to support the time investment. The total cost of ownership remains high. CEOs don’t believe their companies are data-driven.
The answer to the problem requires us to look at the problem differently. What if we could avoid the clogged data pipeline similar to the way the Waze traffic app helps us steer clear of traffic and arrive at our destination quicker? That’s what we are doing at Starburst – no more copying, moving, or need to ETL your data. Like Waze, we help you avoid the congestion of the clogged data pipeline helping you arrive at your destination faster and more affordably.
Imagine leaving your data where it currently resides and avoiding most of the pipeline process altogether. Start analyzing information from the source, and from multiple sources, across multiple environments, across multiple platforms, on-prem, and in the cloud. Imagine doing all of this without sacrificing query performance, security, or governance. The benefits of an approach like this are numerous and address not only the business issues highlighted above but many more.
How? The cloud makes the seemingly impossible, possible
In recent years, we have seen the rise of modern data warehousing systems in the cloud that separate the data from the processing compute. These new architectures, made popular by companies like Snowflake, Google (Big Query), and Amazon (Redshift), utilize cheap long-term storage and temporary, high-performance processing nodes to perform queries on demand. Thus query processing is no longer happening at the data layer because the cloud has an abundance of network availability which is fast and cheap. The data for each query is being brought across the network to separate processing nodes for query execution.
While the enterprise data warehousing systems highlighted above are still monolithic and require complex data pipelines to centralize data before performing any analytical processing, they proved that you can quickly and efficiently bring large amounts of data across the network to a separate processing tier for query execution. This is a big deal! If you can physically separate the data storage tier from the query processing tier, then you no longer need to centralize all of your data in a high-cost, monolithic data warehousing system or a data lake.
Today you can have a single point of access, a Data Mesh, through which all the data across your enterprise can be shared, accessed and managed within or across organizations. This is what Starburst is doing at companies like Finra, Comcast, VMWare, and hundreds more.
Additional benefits to modern technical initiatives include accelerating cloud and digital transformation activities, dramatically improved data lake query response times, unlimited scalability & unlimited concurrency, join data across data sources, dramatically reduce data movement and number of data copies, the distributed Data Mesh architecture, hybrid clouds and multi-cloud, migration without disruption, global data federation while meeting data residency, sovereignty, and governance requirements, cross-cloud analytics, and minimizing egress cost while avoiding vendor lock-in.
Stop doing things the same way and expecting different results. To truly become data-driven, organizations must solve the data pipeline problem without the need to move or copy data in order to do analytics or share data. To do so, we must break free from our dependence on the primary bottleneck to scaling analytics, the centralized monolithic data warehouse, and utilize data where it lies.
The time for a different approach has come and I am happy to be a part of this (r)evolution!