Open-Source Presto®

Distributed SQL query engine for big data

Last updated: June 1, 2023, Published March 8, 2021

Presto: Definition, features, use cases

Presto is an open source distributed SQL engine for running fast analytic queries against various data sources ranging in size from gigabytes to petabytes. Presto was designed and built from scratch for interactive analytics. It approaches the speed of commercial data warehouses while scaling to the size of very large organizations. Presto was originally developed by Facebook to scale to the data size and performance they needed.

In Fall 2012 a small team of four engineers at Facebook started working on Presto. By Spring 2013, the first version was successfully rolled out within Facebook. Later that year, Facebook open sourced Presto under the Apache License. In 2018 Martin, Dain, and David left Facebook to pursue building an open source community full-time, under the new name PrestoSQL, which in 2020 became Trino. Facebook and others continued with Presto.

Presto inception and features

When Martin TraversoDain SundstromDavid Phillips, and Eric Hwang created Presto at Facebook in 2012, they were tasked to make a system that solved the existing analytics problem Facebook was facing at the time. These engineers stepped up to the challenge, but they had much bigger plans for this system. They had seen too many projects that focused on immediate problems go to waste once a corporation decides it’s no longer worth funding. They wanted to build a solution that would stand the test of time and the only way they could achieve this is to make it open source. They recognized the existing gaps in big data analytics and all had the backgrounds to solve this problem for companies outside of Facebook. Hear the story from the engineers who created Presto and learn the key features that made Presto the most popular analytics engine.

What Presto Is

Presto provides a quick and easy way to allow access to data from a variety of sources using industry standard ANSI SQL query language. Further, end users don’t have to learn any new complex language or new tool; they can simply utilize existing tools with which they are comfortable.

It handles Online Analytics Processing workloads and although more recently Presto has added features to handle insertions more efficiently, Presto shines in cases of reading and federating data in your data warehouse or data lake. 

Presto can offer a more agile approach to data access by data consumers.

The original use cases of Presto fall under interactive analytics, smaller Batch ETL jobs, and A/B Testing.

What Presto Is Not

Since Presto is being called a database by many members of the community, it makes sense to define what Presto is not. Do not mistake the fact that Presto understands SQL with it providing the features of a standard database. Presto is not a general-purpose relational database and is not a replacement for databases like MySQL, PostgreSQL, or Oracle. Moreover, Presto was not designed to handle Online Transaction Processing (OLTP), similar to other databases designed and optimized for data warehousing or analytics.

Presto Connectors

Presto, a distributed query engine, comes with a number of built-in connectors for a variety of data sources. Presto’s architecture fully abstracts the data sources it can connect to which facilitates the separation of compute and storage. The Connector SPI allows building plugins for file systems and object stores, NoSQL stores, relational database systems, and custom services. As long as one can map the data into relational concepts such as tables, columns, and rows, it is possible to create a Presto connector. What is more, inside a single installation of Presto users can register multiple catalogs and run queries that access data from multiple connectors at once. There is no need to perform a lengthy ETL process since Presto can simply query data where it lives.

Starburst Data Reference Architecture Diagram – On Premises

Presto’s history with Martin Traverso, Dain Sundstrom, and David Phillips

Beyond Hive

Martin Traverso: At the time, Facebook was using Hive for most of the data analytics. Hive came out of Facebook actually. They created it in 2008. They open sourced it. They made it on a budget project and it was heavily used for all the data transformation and data analytics. People were using it for interactive analytics too, or they tried to use it for interactive analytics. 

Basically, they would run a query, and then maybe wait an hour or two hours for the results to come back. Which seemed ridiculous. We thought that could be done much, much faster and we set out to do that.

We said, ‘We can do something to run this.’

There was a system that came out of a hackathon at Facebook that attempted to do something like that, but the system wasn’t being maintained. It wasn’t scaling beyond limits it had, and the architecture wasn’t amenable to making it scale beyond what it needed to scale and to be able to add the features that need to be added.

So, we said, ‘Let’s look at it with fresh eyes,’ and we started doing something from the ground up, so that’s how Presto was born basically.

It Just Works

David Phillips: Back in the mid 2000s, there weren’t a lot of options. Hadoop didn’t exist. There were commercial systems like Netezza [which] was actually pretty awesome. It was like a distributed database built on top of Postgres, that came as an appliance, so it was really like a single rack they would just drop it into your data center, plug it in and it would just work. And that always kind of set the bar for how easy a product should be to use and how quickly you can get started with it.

Dain Sundstrom: I think also we used Hadoop, and MapReduce and custom stuff in the early days of Hadoop. And in comparison to something like Netezza or the other commercial products, it was really frustrating, really hard to work with, slow and slow for no reason. You play with it and you’re like, ‘This thing could be orders of magnitude faster if someone just paid attention to it.’

 Presto query federation and the SPI 

Martin Traverso: There were a couple things that we wanted to do. One was to make [Presto] open source, but we also had to make it work with internal Facebook infrastructure.

At that point, Facebook was running a custom version of Hive. Even though Hive came from Facebook and was open source, eventually Facebook forked it back in. So, they had customizations. They had their own version of HDFS. And there were a bunch of other systems that we need to be able to integrate for all the monitoring, and collecting metrics and all that stuff.

So, we said, “We need to make sure that Presto works for Facebook, but we also want to make it open source, so how do we do that?” And we kind of realized at some point that we could separate the engine, the core query search engine from the storage layer, and we put it behind a plugin interface.

And that was kind of out of necessity. It was like, well, we need to be able to have Presto run on top of Facebook Hive and HDFS, but also work with open source Hive and HDFS. So, we did that by having plugins that could be swapped out.

So, that was kind of the motivation for that. But, very quickly after that, especially after we open sourced it, we started seeing people using that for integrating with other backends, like with databases and other systems. That was something that we didn’t really plan ahead of time.

But, it became one of the pillars of Presto as one of the things that people look to when they think about Presto and they think are using Presto is the ability to connect to different data sources, bring all the data from the sources together and run queries across all data sources at the same time.

Why Presto would be open source

Martin Traverso: It was clear to us that [Presto] would be open source. We started the project, then when we were talking to Jay Parikh, we said, ‘Hey, we want to make this open source.’ That was around the time when Facebook was working on Open Compute and he was seeing that Open Compute ended up disrupting the hardware industry and we want to do the same thing for the analytics industry.

So, he was on board with that. It’s something that we wanted to do from the beginning, make it open source because we had worked with open source projects, we believed that the most successful projects are those that are open source.

Getting other people and other companies involved in the project which makes for a healthier project. You end up not just building something that satisfies the needs from one company, but from everyone else, and in turn, you end up benefiting from that.

If you go look at the history of the project, the first commit was on GitHub. So, we used GitHub. We used all the tools we would eventually use when we open sourced it. It took us a year to open source it, but that was kind of the idea from the beginning.

Dain Sundstrom: We went and personally recruited companies like Airbnb, and Netflix, and LinkedIn and kind of all these companies, to get them involved in the early days of the project because we wanted to bootstrap the actual having a real community. So, it didn’t just turn out to just be five people at Facebook hacking away.

David Phillips: And we actually had these companies beta test the software, so that when we did launch, the problems that they had found had been fixed. And so, the first experience of people wasn’t the first time anyone had ever used it externally.

Dain Sundstrom: The fact it’s open source is not an accident. We looked at this project and were like, building a database takes, I don’t know, five to ten years and none of us… well, especially, I’ll speak for myself.

I don’t want to work on something for five years, and then have some corporate effort change, and then your five years of code just goes in the trash can. I’ve seen that way too many times.

And in addition to wanting to get input from outside people and wanting to get more help, we wanted to make something that was going to have longevity. Our initial model was we want to build Postgres, but for analytics, and have it be open and free, and have lots of people involved in it and go in that sort of direction of a really big project. From day one, we very carefully designed the project.

We did everything on GitHub, every issue on GitHub, the pull requests are on GitHub. All the reviews are public, which is pretty different from how a lot of companies do open source. We did everything publicly and we insisted everyone on the team do everything publicly, which is a pretty big change.

But, then it makes the project more open and brings in people, and they don’t feel like you have a special place because this group of people at one company founded the project.They’re not treated special. Everyone’s code goes through the same process and you can see it because it’s all in public. So, we designed it so that it was this big open thing, and that everyone could see it and feel like they’re an equal member.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.