When Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang created Presto at Facebook in 2012, they were tasked to make a system that solved the existing analytics problem Facebook was facing at the time. These engineers stepped up to the challenge, but they had much bigger plans for this system. They had seen too many projects that focused on immediate problems go to waste once a corporation decides it’s no longer worth funding. They wanted to build a solution that would stand the test of time and the only way they could achieve this is to make it open source. They recognized the existing gaps in big data analytics and all had the backgrounds to solve this problem for companies outside of Facebook. Hear the story from the engineers who created Presto and learn the key features that made Presto the most popular analytics engine.
“At the time, Facebook was using Hive for most of the data analytics. Hive came out of Facebook actually. They created it in 2008. They open sourced it. They made it on a budget project and it was heavily used for all the data transformation and data analytics. People were using it for interactive analytics too, or they tried to use it for interactive analytics.
Basically, they would run a query, and then maybe wait an hour or two hours for the results to come back. Which seemed ridiculous. We thought that could be done much, much faster and we set out to do that. We said, ‘We can do something to run this.’ There was a system that came out of a hackathon at Facebook that attempted to do something like that, but the system wasn’t being maintained. It wasn’t scaling beyond limits it had, and the architecture wasn’t amenable to making it scale beyond what it needed to scale and to be able to add the features that need to be added. So, we said, ‘Let’s look at it with fresh eyes,’ and we started doing something from the ground up, so that’s how Presto was born basically.” – Martin Traverso
“Back in the mid 2000s, there weren’t a lot of options. Hadoop didn’t exist. There were commercial systems like Netezza [which] was actually pretty awesome. It was like a distributed database built on top of Postgres, that came as an appliance, so it was really like a single rack they would just drop it into your data center, plug it in and it would just work. And that always kind of set the bar for how easy a product should be to use and how quickly you can get started with it.” -David
“I think also we used Hadoop, and MapReduce and custom stuff in the early days of Hadoop. And in comparison to something like Netezza or the other commercial products, it was really frustrating, really hard to work with, slow and slow for no reason. You play with it and you’re like, ‘This thing could be orders of magnitude faster if someone just paid attention to it.’” – Dain
“There were a couple things that we wanted to do. One was to make [Presto] open source, but we also had to make it work with internal Facebook infrastructure.
At that point, Facebook was running a custom version of Hive. Even though Hive came from Facebook and was open source, eventually Facebook forked it back in. So, they had customizations. They had their own version of HDFS. And there were a bunch of other systems that we need to be able to integrate for all the monitoring, and collecting metrics and all that stuff.
So, we said, “We need to make sure that Presto works for Facebook, but we also want to make it open source, so how do we do that?” And we kind of realized at some point that we could separate the engine, the core query search engine from the storage layer, and we put it behind a plugin interface.
And that was kind of out of necessity. It was like, well, we need to be able to have Presto run on top of Facebook Hive and HDFS, but also work with open source Hive and HDFS. So, we did that by having plugins that could be swapped out.
So, that was kind of the motivation for that. But, very quickly after that, especially after we open sourced it, we started seeing people using that for integrating with other backends, like with databases and other systems. That was something that we didn’t really plan ahead of time.
But, it became one of the pillars of Presto as one of the things that people look to when they think about Presto and they think are using Presto is the ability to connect to different data sources, bring all the data from the sources together and run queries across all data sources at the same time.” – Martin
“It was clear to us that [Presto] would be open source. We started the project, then when we were talking to Jay Parikh, we said, ‘Hey, we want to make this open source.’ That was around the time when Facebook was working on Open Compute and he was seeing that Open Compute ended up disrupting the hardware industry and we want to do the same thing for the analytics industry. So, he was on board with that. It’s something that we wanted to do from the beginning, make it open source because we had worked with open source projects, we believed that the most successful projects are those that are open source.
Getting other people and other companies involved in the project which makes for a healthier project. You end up not just building something that satisfies the needs from one company, but from everyone else, and in turn, you end up benefiting from that.
If you go look at the history of the project, the first commit was on GitHub. So, we used GitHub. We used all the tools we would eventually use when we open sourced it. It took us a year to open source it, but that was kind of the idea from the beginning.” – Martin
“We went and personally recruited companies like Airbnb, and Netflix, and LinkedIn and kind of all these companies, to get them involved in the early days of the project because we wanted to bootstrap the actual having a real community. So, it didn’t just turn out to just be five people at Facebook hacking away.” – Dain
“And we actually had these companies beta test the software, so that when we did launch, the problems that they had found had been fixed. And so, the first experience of people wasn’t the first time anyone had ever used it externally.” – David
“The fact it’s open source is not an accident. We looked at this project and were like, building a database takes, I don’t know, five to ten years and none of us… well, especially, I’ll speak for myself. I don’t want to work on something for five years, and then have some corporate effort change, and then your five years of code just goes in the trashcan. I’ve seen that way too many times. And in addition to wanting to get input from outside people and wanting to get more help, we wanted to make something that was going to have longevity. Our initial model was we want to build Postgres, but for analytics, and have it be open and free, and have lots of people involved in it and go in that sort of direction of a really big project. From day one, we very carefully designed the project. We did everything on GitHub, every issue on GitHub, the pull requests are on GitHub. All the reviews are public, which is pretty different from how a lot of companies do open source. We did everything publicly and we insisted everyone on the team do everything publicly, which is a pretty big change. But, then it makes the project more open and brings in people, and they don’t feel like you have a special place because this group of people at one company founded the project.
They’re not treated special. Everyone’s code goes through the same process and you can see it because it’s all in public. So, we designed it so that it was this big open thing, and that everyone could see it and feel like they’re an equal member.” – Dain
Up to $500 in usage credits included