Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Last Updated: May 30, 2023, Published September 7, 2021
Trino is the project created by Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang in 2012 to replace the 300PB Hive data warehouse at Facebook. The project aims to run fast ad hoc analytics queries over big data file systems like Hadoop Distributed File System (HDFS), Azure Blob File System (ABFS), AWS Simple Storage Service (S3), Google Cloud Storage, Azure Blob Storage, and MinIO.
An initially unintended but now characteristic feature of Trino is also its ability to execute federated queries over various distributed data sources. This includes, but is not limited to: Accumulo, BigQuery, Apache Cassandra, ClickHouse, Druid, Elasticsearch, Google Sheets, Apache Iceberg, Apache Hive, JMX, Apache Kafka, Kinesis, Kudu, MongoDB, MySQL, Oracle, Apache Phoenix, Apache Pinot, PostgreSQL, Prometheus, Redis, Redshift, SingleStore (MemSQL), Microsoft SQL Server.
What’s incredible is that you are able to perform pushdown queries and joins across these different data sets.
Until 2021, Trino went under the name of Presto. Trino and Presto share six years of history between the projects: take a look at the Presto project and you’ll notice the same original history and the same original four creators of Presto, Martin, Dain, David, and Eric. From 2012 to 2018, Martin, Dain, and David remained at Facebook and focused on making Presto a highly successful and healthy open-source project.
However in 2018, Facebook management unilaterally changed the rules around Presto’s governance and imposed automatic committer rights to Facebook employees. As this didn’t align with their values for a healthy open source project, Martin, Dain, and David left Facebook in 2018, forking the Presto repo to create a truly community-owned branch of the project.
From early 2019 to late 2020, both the Facebook-controlled and the community-owned projects went under the “Presto” name. Their respective domain names were used to distinguish the two: prestodb for the original repo and PrestoSQL for the community driven branch.
In late 2019, Facebook established the Presto Foundation under the Linux Foundation and moved to enforce the Presto trademark over the community branch and the project was renamed to Trino.
Related reading: A history of Trino and Presto
Trino has multiple reasons for its speed, especially in comparison to its Hive predecessor.
First, The creators of Trino made a very intentional decision not to rely on checkpointing and fault tolerance methods that were very popular to see in big data systems at the time. Fault tolerance requires expensive and extremely slow writes to disk that, while adding resiliency, adds a tremendous amount of latency.
The theory here is that, if you run queries to interactively probe the data, the turnaround needs to be within seconds to minutes. It’s no longer worth the time it takes to checkpoint any work if the query is already having such a quick turnaround. This also requires the system to run with few issues which in practice Trino is known to manage even larger ETL jobs that take up to hours to complete with a very low rate of failure.
Other elements that make Trino fast, are its ability to push queries down to the source systems where custom indexes already exist on the data, as well as, elements such as the Cost-Based Optimizer.
Trino implements a MPP (massively parallel processing) architecture.
This means that it has traits such as internode parallelism over nodes connected using a shared-nothing architecture. Data is partitioned into smaller chunks and distributed across these nodes. Once they arrive at a particular machine, they are processed in parallel over multiple threads within a particular node. This further segments the work to be done over the big amounts of data.
To find out more, see more about the Trino architecture.
Yes, Trino is an online analytical processing (OLAP) system. Trino is intended to run as a query engine for a data lake or data mesh. These two paradigms extend the original OLAP solution known as the data warehouse.
In a data warehouse, you relied heavily on moving data around using ETL. This was incredibly slow and required moving data unnecessarily. Trino still allows you to interactively run queries across various data sources, without requiring you to move data ahead of time — querying the data where it lives.
For more information, visit the Trino Software Foundation page on the Trino website.
The Trino Software Foundation (formerly Presto Software Foundation) is an independent, non-profit organization with the mission of supporting a community of passionate users and developers, devoted to the advancement of the Trino distributed SQL query engine for big data. It is dedicated to preserving the vision of high quality, performant, and dependable software.
A more practical way to view the foundation is that it is a legal entity to hold assets such as the trino.io website, Trino Slack, Trino GitHub Account, Trino Twitter Page, Trino YouTube Page, etc and to manage CLAs. The Trino project welcomes everyone to join it as a contributor or in a more active governing role of maintainer. The project is governed by individuals, and technical decisions are made by the people that are actively involved in the project.
The Trino Software Foundation supports a diverse, open, collaborative, community of developers and users throughout the world. Everyone is welcome to participate, whether it be via code contributions, suggestions for improvements, or even bug reports.
Trino uses ANSI SQL. Which ANSI SQL? All of them!
The SQL versions are additive, meaning everything in SQL ’92 is in SQL ’99, which in turn is in SQL 2003, and then in SQL 2011 and then in SQL 2016, etc…
Trino has features from all revisions of the ANSI spec, but only where it makes sense. For instance, there are a lot of features that are relevant only to Online Transaction Processing (OLTP) systems. Trino is OLAP and therefore Trino only implements portions of the language that apply to analytics operations. The features that only exist in OLTP systems are therefore not implemented.
As mentioned previously, Trino is a query engine. Informally, it is common to see Trino deployed on a data lake. In this use case, the data lake uses Trino as the query engine, a table format (such as Hive or Iceberg) that models and has a metastore to manage schema, and a storage layer that includes a filesystem (like HDFS or S3 cloud object storage). In this scenario, the Trino data lake could be called a database, as it has all the elements that traditional databases have. However, it should be made clear that Trino is a query engine, and not a database.
Trino is the community-driven project that spurred from the Presto project. Anyone who has sufficient experience to answer this question is biased in one way or another. In the interest of transparency, it should be stated that Starburst is a company built over open source Trino. We prefer to lay out an let the data speak for itself, and invite you to make your own assumptions and reach out to ask questions if any of this is confusing.
Following the split of Presto in 2018, the community largely moved to Trino. This is visible in the data and progress of both projects, shown in the years since.
Trino has many recent optimizations since the project split. To name a few:
For a full list of features read the most recent yearly report blogs front the Trino Community since the project split:
Another important aspect to consider is the supporting community. When comparing the data:
Related reading: The Difference Between PrestoSQL, PrestoDB, and Trino
Up to $500 in usage credits included