Back in 2008, when I first started working with Hadoop as a graduate student at Yale Computer Science Department, it was already being touted as the future of big data analytics. Today, the situation has changed dramatically. The Hadoop ecosystem did undergo tremendous growth, but big data storage and analytics mercilessly moved on.
Now the question for many large enterprises is what to do with their current Hadoop infrastructure, and how to prepare for the future. A recent Gartner report, “Choosing the Right Path When Exploring Hadoop’s Future”, offers a detailed analysis of the options, and I would encourage anyone struggling with these questions to download and read it. In this post, I’ll review how the Hadoop ecosystem has evolved, why it ultimately lost its long-term value, and how companies can prepare for this uncertain future.
The Evolving Hadoop Ecosystem
In the early days of Hadoop, the big Silicon Valley firms relied on open-source implementations, but a handful of people from Google, Facebook and Yahoo! launched Cloudera as a commercial version in 2008. MapR entered the market a year later with a more compelling story around enterprise security, and ended up signing a number of large financial services firms as clients. In 2011 a developer-focused, purely open source variation led by Hadoop engineers from Yahoo! sprang up in the form of Hortonworks.
As a co-creator of research project HadoopDB in 2009 and then a co-founder at Hadapt, the first SQL-on-Hadoop company, I got to see the Hadoop hype firsthand. The battle between Hadoop vendors was brutal at times. Besides price wars, they hosted separate Hadoop-centric conferences and built incompatible add-ons. A customer that chose Cloudera had little choice but to stay with Cloudera, even if one of those new features offered by one of the other vendors seemed compelling. The same was true for customers of Hortonworks and MapR. In many cases, the technical and financial costs of migrating to a different platform would have been far too high. This situation was the very definition of vendor lock-in despite the open source foundation.
Then the cloud took over. Ironically, the Hadoop pioneers introduced the idea of the data lake – a place where you could store your data inexpensively and run analytics whenever you wanted. They simply picked the wrong lake. When Cloudera and its competitors started, and began taking money from investors, the cloud was more hype than substance. At that point, cloud vendors did not focus on Hadoop too much, and Hadoop vendors couldn’t yet build a business around cloud storage. But they had to start showing results, and earnings, so they opted for on-prem infrastructure as the data lake. This worked really well at first. Once the cloud evolved, however, data lakes evolved as well.
The Rise of Cloud
The cloud – or cloud-compatible, on-prem implementations – offered the cost-effective, flexible scale that Hadoop implementations couldn’t deliver. With cloud, you don’t have to own physical machines or maintain a massive data center. Cloud separates storage and compute, allowing enterprises to only pay for the compute resources they’re actually using. When you’re not running analytics on your data, you’re just paying for that inexpensive object storage. Overall, it’s a much cleaner architecture.
This brings us to today. HP has acquired the assets of MapR, and Cloudera and Hortonworks underwent one of the stranger mergers in recent history, keeping half the leadership from one company and half from the other. In a fight for survival, they had no choice but to join forces to survive in the public market after years of lofty private valuations. The cloud vendors also started offering Hadoop as a service, allowing you to run your Hadoop with MapReduce, Hive and Spark jobs in their compute environment. This way, you can enjoy that familiar Hadoop experience without the infrastructure overhead. That is, if you still care for the Hadoop legacy at all.
Enterprises, meanwhile, want to continue extracting value from their data while preparing for the future. This is where Starburst has begun to play a valuable role. In some cases, our customers use Starburst to simultaneously query data in Hadoop and another data warehouse, such as Oracle or Teradata. Others simply want fast SQL over HDFS. Either way this is what Starburst Presto was built for. By operating as an abstraction layer between end users and the data they wish to query, Starburst allows companies to continue accessing and extracting insights from that Hadoop data.
We’re also helping companies in all three phases of their Hadoop journey:
Companies that aren’t ready to embrace a complete shift to the cloud are moving toward an on-prem, hybrid cloud deployment. As these companies shift away from HDFS to Object Storage, Starburst ensures that end users enjoy the same fast performance.
Here we’re seeing companies deploy Starburst as a way to maintain high-performance access to data while a complete cloud migration is underway. Starburst operates as a federated query engine, so it delivers the same performance and results regardless of where that data lives – in a Hadoop cluster, the cloud, or a mix of both.
Other Starburst customers are maintaining a significant on-prem footprint while growing their public or private cloud deployment. Again, though, this does not change Starburst’s impact or performance. No matter where your data lies, Starburst is built to perform.
The Future of the Hadoop Ecosystem
Our latest release includes several features that can prolong the lifespan of your Hadoop implementation, including fast access to data in Cloudera CDP 7; more secure access, with SQL standard authorization statements for our Apache Ranger integration; MapR support; and overall improved data lake performance.
Most of the companies we work with intend to shift away from Hadoop to the cloud. COVID-19 has certainly complicated those plans for some, as budget cuts have forced data and infrastructure teams to get more out of what they have today. In that case, too, Starburst has proven to be a valuable query engine, as it’s built to provide secure, cost-effective, high-performance access to your data no matter what.
These are interesting and exciting times in the data analytics industry, and I imagine this can be confusing, too, given all the tools on the market – proprietary and open source. I encourage you to read the Gartner report and please reach out if you have any questions about your own Hadoop options… or Starburst in general.