Distributed SQL query engine for big data

Presto is an open source distributed SQL query engine for running high performance queries against various data sources ranging in size from gigabytes to petabytes.

Presto was designed and built from scratch in Java for interactive analytics as a replacement for Apache Hadoop/HDFS MapReduce jobs. It approaches the speed of commercial data warehouses while scaling up to the size of the largest organizations in the world.

Presto was originally developed at Facebook to scale to the data size and performance they needed to query their Hive-based data warehouse, but was expanded to connect to many other data sources over time.

What is Presto?

The Presto query engine provides a quick and easy way to allow access to data from a variety of sources using industry standard ANSI SQL syntax. End users don’t have to learn any new complex language or new tool; they can simply utilize existing tools for analytics with which they are comfortable.

Presto is best at handling analytics workloads, and though Presto has added some features to handle insertions more efficiently, it shines when reading and federating data in a data warehouse or data lake.

What is the difference between PrestoDB and PrestoSQL?

PrestoDB is the former name of the original version of Presto. It was developed by Eric Hwang, Dain Sundstrom, David Phillips, and Martin Traverso at Facebook. In 2018, they left Facebook and founded the Presto Software Foundation to ensure that the project would remain collaborative and independent. They named their new fork PrestoSQL, which was later renamed to Trino at the end of 2020. PrestoDB was renamed to Presto shortly after, so PrestoDB is now simply called Presto, and PrestoSQL is now Trino.

Presto and Trino share similar features and the same core code. However, ongoing development on Presto has been driven by Facebook, while development on Trino has been driven by companies like Starburst and AWS trying to serve a wide audience. This has made Trino more generally useful, and as explained below, it has benefitted from higher velocity development.

Is Trino better than Presto?

Since the fork in 2018, development on Trino has gone at roughly three times the velocity of development on Presto. It boasts additional connectors that aren’t in Presto, better performance across the vast majority of connectors, expanded SQL support, and is much better at handling batch ETL/ELT workloads.

Whether you’re considering Presto or Trino, the easiest way to start querying your data is with Starburst Galaxy, the simplest and quickest way to get running with SQL. If you don’t want to use Starburst, the Trino website provides tutorials on using it locally on Linux, via a Docker image, or with Kubernetes.

Is Starburst the same as Presto?

Starburst Enterprise and Starburst Galaxy both run Trino, the fork of Presto that is developed by the co-founders of the project. If you are already familiar with Presto, you could seamlessly use Starburst or Trino in its place without any issues. We believe Trino is the better choice between the two similar engines, which is why we use Trino.

What is the difference between Presto and SQL?

Presto understands and can run ANSI SQL queries. What it does not do is provide the features of a standard database. Presto is not a general-purpose relational database and does not store data, and you will need to use Presto’s connectors to query your database, data warehouse, or data lake, whether that data is stored with Amazon, on-premises, or in another cloud. You can then connect Presto to that data for ad-hoc analytics.

What data sources can you use with the Presto query engine?

Presto comes with a number of built-in connectors for a variety of data sources. Presto’s architecture fully abstracts the data sources it can connect to, which facilitates the separation of compute and storage. The Connector API allows building plugins for file systems and object stores, NoSQL stores, relational database systems, and custom services. As long as one can map the data into relational concepts such as tables, columns, and rows, it is possible to create a Presto connector. And with Presto, users can register and connect to multiple catalogs, running queries that access data from multiple connectors at once. This allows you to run a single SQL query to access all your databases, no matter what storage paradigm they use. There is no need to perform a lengthy ETL process to prepare data for analysis, because Presto can query data where it lives.

Presto has connectors for traditional SQL databases like MySQL, PostgreSQL, Oracle, SQL Server; for non-SQL databases like MongoDB, Cassandra, and ElasticSearch; and for modern data lakes like Hive, Iceberg, Delta Lake, and Hudi.

What does a Presto query look like?

Presto queries look like standard SQL queries. It runs low latency queries against a wide variety of data sources and schemas, using the same familiar SQL statements and clauses you already know. It can then be hooked up to a wide variety of visualization or BI tools for viewing and data access. Presto also provides a CLI for lightweight queries or simple testing.

Is Presto still used?

Yes! Presto is used at a wide variety of companies across the globe and in many different industries. It has been open source for over a decade, giving it a lot of time to grow, gain adoption, and become a core part of handling analytics and metrics workloads. You can find the latest development on Presto on Github.

What is the difference between Spark and Presto?

Spark and Presto are similar in that they are both query engines. Spark emphasizes reliability and consistency for writing data and handling ETL workloads, making it dependable even when you have tens or hundreds of terabytes in-memory. Presto, on the other hand, is built for high performance and ad hoc analytics. Presto can run on top of Spark to leverage the benefits of both engines when Spark’s reliability is more needed than Presto’s speed.

Is Presto better than Spark?

Presto is designed to pair with Spark and use it as an execution engine. Presto is primarily used for ad-hoc analytics at high speeds; Spark is used for ETL and batch processing of massive amounts of data. Comparing Presto with Spark doesn’t make much sense, because they do different jobs. If you are using Presto, it is likely that you will also want to use Spark.

Is Trino better than Spark?

Unlike Presto, Trino has a fault-tolerant execution mode that can be used for ETL workloads similar to what Spark excels at. Rather than using two query engines for two different jobs, you can simplify the process with just Trino. It performs better than Presto for ad-hoc queries, and it’s just as reliable as Spark for massive operations that involve writing or modifying data. We would say yes, Trino is better than Spark.

The Data Engineers Guide to Iceberg v3