This is a crazy and slightly confusing time in the data architecture space. More and more companies are shifting toward data lakes, yet the traditional data warehouse continues to provide value, as it has for decades. Now, to add to that, we have this increasingly popular lakehouse concept, which can potentially string together the best of both worlds. In early February, I had the chance to host a fun debate on data lake, data warehouse or the data lakehouse with proponents of each architecture at Datanova 2021, Starburst’s annual conference. Ultimately, we were trying to determine what the best architecture will be going forward.
Will one of these three concepts prevail?
Will each one carve out its own niche?
Or will the winner be something we haven’t yet imagined?
At the end of the discussion, we asked our audience to weigh in, and the results were surprisingly clear. Before I get to that, though, I’d like to pass on a few key insights from the discussion about what companies ultimately value from these solutions:
Our advocate for data lake architectures was Aaron Colcord, Senior Director, Data Analytics Engineering at Northwestern Mutual. He argued that one of the reasons data lakes were so appealing at the start is that they championed openness. We could tolerate some of the early technical shortcomings because of the openness and cost control we gained. The added advantage now, ten years later, is that the tools used in conjunction with data lakes have matured and expanded, unlocking all kinds of capabilities – without sacrificing openness.
Ease of Use
Traditional data warehouses are nothing if not mature. Greg Taylor, Managing Director at Slalom Consulting, noted that these tried-and-tested platforms are also valuable because you don’t need a whole new set of technical skills or training to work with them. They utilize common, familiar technology and standard tool sets connect to them easily.
I wouldn’t argue with that, and Richard Jarvis, the Chief Analytics Officer at EMIS Health, seconded that idea later in our debate. But Richard, who argued for lakehouse architectures, also talked about how you can achieve this familiarity and simplicity via other means. His group has deployed tools like Starburst Enterprise to standardize on SQL, which allows them to empower a wider talent pool across their business, granting more users easy access to data stored in more platforms.
Flexibility & Scalability
We also talked about the importance of scale – an essential component given the explosion of data. Richard explained that his team built their cloud analytics platform on a lakehouse architecture. During the pandemic, EMIS has done some incredible work using data to help researchers understand the spread of COVID, and how to positively impact health outcomes and improve vaccine rollout strategies. EMIS needed to grant secure access to this data to a wide range of users with very different access patterns. They required something very flexible and scalable – and the lakehouse architecture delivered.
Vendor lock-in is a common sticking point for critics of the traditional data warehouse. In the era of ORC and Parquet and other open data formats, companies want to own their data, and not have it locked into a proprietary format. Aaron pointed out that this is a downside of the lakehouse as well, since it can trap you into the same vendor lock-in scenario, which ultimately limits your ability to explore different tools. The counterpoint, he noted, is that data lakes introduce too many solutions, making it very difficult to find the right one.
Another thread we kept coming back to was the role of data virtualization, and how the line between some of these technologies is getting blurry. Greg talked about how data virtualization solutions allow you to utilize warehouses and data lakes together – giving you the power to curate data without having to move or transform it. And Richard at EMIS described how data virtualization helped him create a best-of-all-worlds scenario in which data scientists could run with data raw could get to work immediately, while those who needed curated data could wait for a few hours to analyze it.
The Results: Blurred Lines, Clear Choices
My final question, after all the back and forth, was what this landscape will look like in three to five years. A convergence is already happening in the form of the lakehouse, but that doesn’t mean the days of traditional warehouses or even standard data lakes are numbered. You don’t just throw away technology and skills we’ve built up and advanced over so many years. I don’t see any of these technologies becoming obsolete in the near term, but there’s no question the pace of innovation is accelerating with more choices for the right design pattern for the right use case and cost. Data fabrics, or data meshes, as some refer to them are also gaining ground in these architecture choices.
Now to the poll results! Did I make you wait too long? I hope not. There was an interesting before and after panel upswing. When we polled our audience about which architecture they expected to be the most popular in three years, here’s the breakdown: only 14% voted for the traditional data warehouse, and just 18% opted for the data lake. Despite the concerns over vendor lock-in, a surprising 68% believed in the lakehouse. For a relatively new technology paradigm, that’s impressive. None of us can say with certainty how architectures will evolve, but I imagine we can all agree that these next few years will be interesting.
ThoughtSpot has seen the rise of the data lakehouse in the market and based on demand from some of our top customers, I’m excited to see our integration with Starburst Enterprise, based on open source Trino (formerly PrestoSQL), go live this month. Yep, that means Search and AI-driven insights based on your Starburst data lakehouse – without the need to move any data!