Today marks a professional milestone for me as I join Starburst to help drive the data lake analytics revolution. I don’t post often, but this is one of those moments I can’t help marking. I’ve been watching Starburst closely, practically since it was founded, and it’s so rare that you get to join a story in progress that you happen to believe is a genuine revolution, and also the very revolution that you’d most like to join! I’m not kidding myself that it will be easy. The data warehousing space is one with massive incumbents and entrenched practices. But I’m a true believer that history is on our side.
You might ask, what makes you so confident? Why do you even care so much? The answer lies in my own history, working in and around data warehousing and analytics for many years, and most recently having the experience of putting a modern data lake in place at a data-driven organization.
First of all, looking back, my history in analytics and data warehousing taught me two big lessons: (1) the real action was being driven by data discovery, uncovering insights from ALL of the organizations data, not just the highly curated data in the warehouse, and (2) the underlying architecture to power this will be massively parallel and highly scalable in the cloud.
On point (1), I look back to my time at Endeca when we were actively driving innovation in agile BI around the same time that Qlik and Tableau were first exploding. It was an amazing moment of BI moving from the purview of IT in the walled garden, and into a world of tools for everyone from data engineers to citizen data scientists to create interesting and novel interactive data products, and for practically everyone to interact with those views to discover novel insights. And what made it most interesting was that some of the most interesting discovery was being powered not only by core data sets like CRM and ERP, but also the full spectrum of data including unstructured, machine data, logs, and more. Data discovery was interesting not only because of who could participate, but perhaps even more so because of what they could analyze.
On point (2), it was at this same time that data warehouse and analytical databases like Netezza and Vertica were taking performance and scalability to a new level with technologies like columnar storage and more sophisticated MPP architectures. Suddenly the old MOLAP versus ROLAP debates seemed very quaint, and you could see a world without many hard limits on how much data could be the subject of discovery and analysis. And this all coincided with the migration of so many business workloads into the cloud, and the broad adoption of cloud managed infrastructure, with the scalability and flexibility that it brought.
In more recent years, I have been leading Engineering at Salsify, a pioneer in cloud-based Product Information Management (PIM), and a truly exceptional company. One of the many wonderful things about Salsify is how truly data driven it is. Along our growth journey, as we evolved from a startup to a growth stage company, we needed to develop more robust data infrastructure, and made the wise choice to pursue a data lake architecture. In many ways, this choice was just very pragmatic for a startup – you can start very lightweight, integrate data as needed for incremental use cases, and improve the tooling and governance as needed over time.
From humble beginnings supporting a couple of use cases, we eventually developed an amazingly comprehensive data lake spanning practically every data set in the company, including internal data, data from our production application, data crawled from the web and more, all totaling north of 1.7PB (an impressive stat for a mid-sized company). The lake contains thousands of tables, and has hundreds of active users querying it per month (perhaps a third of the company!). It supports use cases ranging from fully ad-hoc discovery and analysis, to a range of internal reporting, as well as customer facing capabilities such as our Salsify Impact Assessments as well as in-app features. I believe by any practical measure of adoption or impact, the Salsify data lake has to be viewed as an amazing success story!
In a very real way I have been living the promise of the modern data lake architecture. And that promise was meaningfully driven by the lake architecture. Sure, we had a relatively blank slate as a startup, but we also had commensurate resource limitations. We couldn’t implement a big bang architectural change It had to be an architecture where you could show incremental progress and invest as value was delivered.
The data lake architecture fully unlocks this by minimizing the cost to try out new data. With the right controls and governance, you can make it quite friction-free to try out new data sets. Some of these become mission critical and become highly managed for strong governance, reliability and performance. Some have moderate value and can live in a less managed state for more ad hoc use. And most importantly, some data sets may end up showing little or no value. The amazing thing about the lake is that it minimizes the cost of these failed experiments, which actually turns out to maximize experimentation, which then in turn maximizes the velocity of discovering the data sets that are value. And that of course drives adoption of the lake – which Salsify proved beyond a shadow of a doubt! Check in on the Salsify data-lake slack channel on any day and you’ll find traffic about people from across the org putting in more and different data, or querying unforeseen combinations of data in innovative ways.
I believe that practically any organization can get to this state by adopting a data lake architecture as part of their overall data platform. And the great thing about the Starburst architecture is how it can get you there incrementally – no need to forklift move what you already have in place. Existing warehouses, databases, megastores, etc. can federate right in, all while new experimental data sets are availing of the lake model.
Today I’m excited to embark on the journey of helping Starburst to make this future a reality. Again, I don’t think it will be easy. Data warehousing and even the data lake part of the space is highly competitive with big entrenched competitors, But I believe Starburst has the perfect scope of focus to create the winning architecture in the data lake space. And focus matters a lot. This space is so big and so important, I believe it will be driven by a company that sets making the data lake architecture a reality its sole focus, not a side mission. Starburst is that company, and I hope we help bring a data lake to life in your world soon!