What is Trino?

Why Trino might just be the future of data lakehouse compute

September 25, 2024

Cole Bowden

Trino Release Engineer

Starburst

Evan Smith

Technical Content Manager

Starburst Data

Cole Bowden

Trino Release Engineer

Starburst

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

How Starburst and Dell Created Exactly What Enterprise AI Needs

Trino is more popular than ever, but what is it? Let’s start with a definition. Trino is a massively parallel processing, distributed SQL query engine. It helps users perform data engineering and data analytics tasks on very large data sets. There’s a lot to unpack in that definition. This blog will explore Trino, from definition to use case. You’ll see how this SQL engine is shaking up the big data industry and disrupting other processing engines along the way.

TL;DR

Trino separates compute from data storage
Trino excels at lakehouse and federated analytics
Spark suits ETL, Trino suits analytics
Managed Trino reduces operational overhead

What is a query engine?

Let’s start with the query engine itself. The premise of a query engine is relatively simple. First, you begin with a datasource. This could be either a Relational Database Management System (RDBMS) like PostgreSQL or a NoSQL database like MongoDB. It could even be another data warehouse or data lake using data federation. Next, in order to run analytics on the data, you need something to run and process those queries. This is true of both ad hoc queries or dashboards using real-time analytics. Using traditional relational databases, such as MySQL or PostgreSQL, the query engine is built into the database. This means that you can run SQL queries without needing any additional software.

Trino changes all of this. It opens up the idea of a query engine and makes it available for all kinds of different workloads. In this sense, it is part of the open data stack. In the case of data lakes and data lakehouses, whether they run Hive, Hadoop, Delta Lake, Hudi, or Iceberg, the data storage only stores your data; it doesn’t process it. For many people, running a data lakehouse based on cloud object storage is best. This is true whether using Amazon AWS S3, Azure Blob Storage, or Google Storage on GCP. After that, you need something separate to query it. That’s where Trino comes in, and it does its job quickly, efficiently, and with more support and integrations than any other pure query engine out there.

The history of Trino

Facebook originally created Trino in 2012 under the name of Presto. Facebook used it to query very large datasets, specifically legacy Hive data warehouses based on Hadoop HDFS. The goal was to cut data analytics query times down from days to hours, and from hours to minutes. Over time, the project gained adoption by many large tech companies. This includes Netflix, Uber, Airbnb, and LinkedIn.

In 2019, the co-founders of the project left Facebook and created a fork, later renaming it Trino. Since then, it has taken over as the de facto branch of the Presto/Trino project, with significantly more development, faster advancement of features, and more widespread adoption in the data community.

What makes a query engine a query engine

It’s worth diving deeper into exactly what makes a query engine a query engine. Without a doubt, one of the biggest misconceptions about a query engine is thinking of it as a database. Trino, like other query engines, does not store data. Instead, in order to use Trino, you need an underlying data source. Once you have that, Trino connects to the data source, and uses it to run queries. Importantly, it does this using a connector-based architecture. The architecture consists of a core query engine along with the ability to connect that engine to a wide variety of data sources.

Trino works best with data lakes and lakehouses based on Apache Iceberg, Delta Lake, Hudi, or Hive. Because it includes dozens of other connectors, Trino can also be used for query federation. Federated queries use data stored in multiple systems and databases. Trino can connect to and query all of them in unison. This approach uses joins to combine the disparate data with a single SQL query.

Understanding Trino architecture

In addition to having a connector-based architecture used to access different data sources, Trino also has a massively parallel processing (MPP), distributed architecture. This design allows it to scale up and down according to need, enabling it to handle large-scale datasets with petabyte or exabyte workloads. And because it can read from various data sources, Trino allows data engineers to create complex data pipelines that draw in everything that data analysts need to run complex data science projects using a distributed system and dashboards.

Using a single coordinator node and as many worker nodes as you need, a cluster can distribute a Trino query in the most efficient way possible. This might involve a handful of workers, or dozens, or even hundreds, each working in parallel. This approach ensures that no matter how large your dataset is, you can always use it for analytics. Trino also employs a number of optimizations including join reordering, predicate pushdown, and partial aggregations. Using these techniques, Trino intelligently avoids doing unnecessary work. It limits compute costs, and processes your query as fast as possible with very little latency.

The video below shows how the Trino architecture works in practice.

The benefits of using Trino for SQL queries

Trino almost exclusively uses ANSI SQL syntax. SQL is the main language used by data scientists, and most data engineers know it well too. This ensures that your queries are interoperable with other data analytics systems. It also makes it easier for clients, visualization tools, and other integrations to be compatible with Trino. The Trino ecosystem is vast.

This means that no matter what your data stack looks like now or in the future, it should be painless to integrate with Trino. This unparalleled combination of federation, integrations, and high performance is what makes Trino shine. Its use of basic SQL syntax, which most data scientists and analysts should already be well-versed in, ensures that you don’t need to learn specific tips or tricks regarding Trino usage. Once it has been configured and deployed, end-users should find it easy to begin working to deliver insights regarding your data.

Managed Trino: more control, less friction

Running Trino yourself delivers powerful analytics, but it also introduces operational overhead. To do so, teams would need to provision clusters, tune performance, manage upgrades, scale for peak workloads, and maintain security and governance. That effort requires time and deep expertise, which can slow down analytics initiatives.

Managed Trino removes that friction. Teams can focus on querying data and delivering insights instead of managing infrastructure. Deployment is simpler, best practices are built in, and performance and reliability are handled by experts who work with Trino every day.

Managed Trino also helps control costs. Autoscaling and intelligent resource management ensure you only pay for the compute you use. Consistent performance tuning prevents inefficient queries from consuming unnecessary resources. The result is faster time to insight, lower operational risk, and an easier path from experimentation to production.

Managed Trino with Starburst Galaxy

Starburst Galaxy is a fully managed Trino service built for speed, scale, and simplicity. Clusters deploy in minutes and scale automatically with demand. They shut down when not in use, which removes the need for manual capacity planning.

Galaxy is built and maintained by the creators and core contributors of Trino. It stays aligned with the latest Trino innovations while Starburst handles upgrades, patches, and optimizations. This ensures strong performance and reliability without downtime.

Starburst Galaxy also extends open source Trino with proprietary capabilities. These include data ingestion, data maintenance, performance enhancements using Warp Speed, advanced governance, and enhanced observability. Together, they help teams query faster, reduce costs, and maintain control.

Is Trino right for you?

Trino has two core use cases:

Handling data lakes and lakehouses at scale
Handling data federation for organizations with data in several different places.

If either of these scenarios apply to you, then you will gain the maximum value as a Trino user. Beyond this, Trino supports users who deploy it on a small scale. In these settings, even though performance isn’t a major concern and data federation isn’t typically necessary, Trino still provides a lot of value as an industry-standard tool that’s easy to use and easy to integrate with other parts of the data ecosystem. As a general rule, any time that performance, scale, and cost are primary concerns, you’ll want to consider Trino.

Here’s how to make that decision properly

Of course, comparisons are important, and no less so with Trino. There are many benchmarks that you can find and opinions insisting that X tool is faster or Y engine is superior or Z database. Because Trino is a pure query engine, it is relatively easy to test, and this is the approach we recommend.

To do this, use the following approach. First, connect Trino and any other systems that you’re considering to your data stack. Second, run a typical analytics workload of what you might expect to run on a daily or hourly basis. Ideally, this will involve real queries that you’ve already run, and will use the hardware that you would actually use in real-world scenarios. Review the results, and compare these to your expectations. Repeat the process as many times as you need. As you conduct your experiment, remember that every system, workload, network connection, and data set is different. The only way to truly understand what will work best for you is to roll up your sleeves and try it out for yourself.

And of course, while high performance, scalability, and cost are very important, they aren’t everything. As you deliberate, make sure that you consider which features you need and which features you don’t need. Choosing a system that goes 10% faster doesn’t achieve much if, by doing so, you’re missing critical features that limit your ability to access or visualize your data in the way that you need or want.

Trino vs Presto

As discussed earlier, Trino was forked from Presto in 2019. Since then, both projects have remained under development and have diverged considerably over time. At the time of writing, Trino has seen more commits than Presto. This means that Trino now includes more features than Presto. Today, Presto’s main selling point is vector acceleration for Hive and Presto on Spark, and it uses a different type of SQL known as PrestoSQL. These two improvements primarily impact the Hive-Spark data stack that Facebook uses, and to this end, Facebook itself has been a major contributor to the project.

In contrast, Trino has undergone more robust development on its core engine than Presto. This has allowed it to compete with Presto’s performance on Hive even without vector acceleration. Meanwhile, Trino includes additional features that set it apart from Presto. This includes features like SQL MERGE, local filesystem caching, fault-tolerant execution, polymorphic table functions, support for modern Java versions, and a number of new connectors. In light of this, the data community at large has largely shifted to supporting Trino instead of Presto, and the Trino community is a vibrant and dynamic open source community. This has caused Trino integrations to be better-maintained and more likely to remain that way into the future.

There’s no mincing words here: Trino is the better choice for virtually all scenarios compared to Presto.

Trino vs Spark

Spark and Trino are two different tools, with two different use cases. Because of this, they are not in direct competition with each other in the way that other query engines might be. Instead, comparisons between Trino and Spark are best assessed by reviewing your workload and choosing the best tool for the job. For example, Spark is best used for ETL/ELT and data transformation workloads. In this arena, it still performs better than nearly any other tool available. Although Spark is not the fastest compute engine, it is reliable for ETL/ELT workloads. It also includes fault-tolerance. For this reason, Spark has extremely widespread adoption for handling ETL tasks.

On the other hand, Trino is primarily built for analytics. Unlike Spark, it is designed to access and understand your data as quickly as possible. It performs this type of workload much better and faster than Spark. For this reason,if you have serious analytics workloads, you should consider using Trino instead of Spark in this scenario.

There are also areas of convergence between Spark and Trino. Trino’s fault-tolerant execution mode is comparable to Spark, though its adoption in this area is less entrenched as the feature is newer. If you’re currently using Spark and are not currently using Trino, we wouldn’t recommend jumping from Spark to Trino with FTE enabled as a replacement for transformation workloads. However, if you are already using Trino, using a separate FTE cluster can simplify your stack by eliminating Spark from the picture. In this scenario, Trino’s fault tolerance would allow you to use Trino for your entire workload, simplifying your data stack considerably.

Trino vs Starburst

Starburst is the open-core company behind most of Trino’s ongoing development. We offer both on-prem and cloud versions of Trino, called Enterprise and Galaxy, respectively. Why use Starburst if Trino is so powerful? Although Trino is open source, and you can deploy and manage it yourself, it is also highly manual and complex. Because of this, some organizations lack the internal resources to properly adopt Trino in its open source form, despite benefitting from the architecture itself. Starburst is designed to solve this problem, making Trino easy and accessible to everyone. Perhaps you’re unsure how best to use Trino. Maybe you don’t want to deal with the headache of provisioning and managing your own servers and clusters. In these cases, Starburst can simplify the use of Trino for you.

Starburst has also made a number of proprietary improvements to Trino, both for Starburst Enterprise and Starburst Galaxy. These include Warp Speed, a feature that allows you to index your dataset, achieving query speed improvements of up to 700% and reducing compute costs by up to 40%. Starburst’s version of Trino also includes several additional connectors. In addition, Starburst also offers enhanced data telemetry and data governance, and provides more flexibility and versatility with access control and data security.

Why Starburst is the best way to use Trino

Overall, why should you choose Starburst if you’re considering Trino? There are a number of reasons. First, we are the Trino experts, home to the Trino/Presto co-founders and engineers who have been working on the project for over a decade. Second, we make the job of using Trino easy. With Starburst Galaxy, Trino is fully managed in the cloud for you. This allows you to worry less about configuration and tuning, and think more about what you want to do with your data.

Finally, Starburst offers unique features that augment and extend Trino. With autoscaling and auto-shutdown, you don’t need to worry about capacity management, and because Starburst is maintaining it, there’s no need to worry about updates to Trino; we do all the hard work for you without any downtime.

Ready to explore Starburst Galaxy and Trino together? Sign up for a free trial.

FAQs about Trino

Is Trino considered a database?

Trino is not a database but rather a distributed SQL query engine that processes data without storing it. Instead of holding information, it connects to various external data sources. In doing so, Trino can execute queries directly where the data lives. This separation of compute and storage allows for high-performance analytics across disparate systems without the need for complex data migration.

What data sources can Trino connect to?

Trino utilizes a connector-based architecture to query a vast array of data sources, including modern data lakes, relational management systems, and NoSQL databases. Users can perform federated queries that access and join data from multiple storage systems within a single SQL statement. It might, for example, join data from cloud object storage and traditional warehouses. This flexibility eliminates the need to build extensive ETL pipelines to centralize data before performing analysis.

How does Trino differ from batch processing engines?

Trino is engineered specifically for speed and interactive analytics, prioritizing low latency for ad-hoc queries. Traditional batch processing engines are designed for long-running ETL jobs.

While batch engines often focus on heavy data transformations and fault tolerance over extended periods, Trino executes queries in memory to deliver immediate insights for business intelligence. This distinction makes Trino ideal for data exploration, where analysts require rapid results rather than overnight processing.

Does Trino support standard SQL syntax?

Trino employs ANSI-standard SQL syntax, making it highly accessible for data engineers, analysts, and data scientists who are already familiar with the language. The adherence to standard SQL facilitates easier adoption and interoperability within existing data ecosystems. It integrates seamlessly with popular business intelligence tools and allows users to run complex queries without learning a proprietary language.

How does Trino’s MPP architecture improve performance?

Trino’s massively parallel processing (MPP) architecture enables it to distribute query execution across a cluster of worker nodes. It can scale efficiently to handle petabyte-scale datasets. The engine minimizes latency and accelerates query response times significantly. It does so by processing data in memory and utilizing optimizations like join reordering and predicate pushdown. This distributed approach ensures that resources can be scaled up or down based on workload demands to maintain high performance.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

The Data Engineers Guide to Iceberg v3

What is Trino?

More deployment options

Start for Free with Starburst Galaxy

How Starburst and Dell Created Exactly What Enterprise AI Needs

TL;DR

What is a query engine?

The history of Trino

What makes a query engine a query engine

Understanding Trino architecture

The benefits of using Trino for SQL queries

Managed Trino: more control, less friction

Managed Trino with Starburst Galaxy

Is Trino right for you?

Here’s how to make that decision properly

Trino vs Presto

Trino vs Spark

Trino vs Starburst

Why Starburst is the best way to use Trino

FAQs about Trino

Is Trino considered a database?

What data sources can Trino connect to?

How does Trino differ from batch processing engines?

Does Trino support standard SQL syntax?

How does Trino’s MPP architecture improve performance?

Start for Free with Starburst Galaxy