Join Starburst on March 18th for the next iteration of our Live Demo Series

Introducing the Trino spooling protocol

And how it compares with the direct protocol
  • Lester Martin

    Lester Martin

    Developer Adocate

    Starburst

Share

Linkedin iconFacebook iconTwitter icon

Trino’s origin story is all about making interactive queries faster than the original schema-on-read solution, Apache Hive. Generally speaking, one of the fundamentals for performance is that by default, and at all stages of a query’s execution, data is streamed from process to process, including delivery to the client itself, without persisting any data to disk.

In addition to this in-memory processing model, another architectural tenet is that Trino’s coordinator is the only node of the cluster that handles client application connections. The centrality of the coordinator is visualized by the red box in the Trino architectural diagram below.

Image depicting the Trino data architecture, emphasizing the connection between the client and coordinator nodes.

Client protocol basics

To understand this further, it’s important to unpack how client protocols work. Most Trino client applications utilize xDBC or Python to connect. However, under the covers, those drivers use a REST API to communicate with the query engine. Once a client interacts with the coordinator and the query begins executing, the Trino client protocol dictates that the coordinator will return any data that is ready at this initial point in time. If the data returned is the comprehensive result set, then the client gets notified of this, and the query is complete.

If the query is not complete, the overarching protocol’s response will include a “next” URI for the client to use to get more of the comprehensive results set. The client and coordinator will continue this repeated handshake until all of the data is received. The cycle only stops when one of two things is true: 

  • Either the coordinator indicates to the client the query is finished;
  • Or if the client cancels the query.

Image depicting the Trino direct protocol.

Direct protocol

For over a decade, the client API has been implemented solely with the direct protocol, which is now referred to as the v1 implementation. In addition to returning the next URI, the payload includes zero-to-many rows of the comprehensive result set. Based on the original goals of producing a high-performance SQL engine for interactive queries, this has proven to be an extremely successful and performance-oriented solution.

Again, the direct protocol returns actual result set rows in its returned data element, as represented in the next visual.

Depiction of the Trino direct payload, highlighted for emphasis.

Problems with the direct protocol approach

Unfortunately, there are some scenarios where this approach is not optimal. Specifically, when the comprehensive results set size is extremely large, this low-latency solution introduces throughput consequences and creates undesirable pressure on the coordinator.

 

Enter the spooling protocol

The spooling protocol is designed to solve this problem. The Trino community released a spooling protocol extension in Trino 466 to address the following new requirements:

  • Ability to retrieve the result set, out-of-order, in parallel,
  • Support for more space and CPU-efficient result set formats (row and column-oriented),
  • Reduce the load on the coordinator during result set generation by moving it to the workers.

Spooling protocol and REST API

How does this impact the REST API? The sequence of the REST API is fundamentally the same in most regards. However, there are some changes. Specifically, the spooling protocol sends segments within the data element. Each of these segments can be the classic inline data as the v1 implementation, but can also be a reference to “spooled” data that was persisted by the workers directly.

This approach is pictured in the image below.

Depiction of the Trino spooling payload.

These references are essentially URLs linked to an object store that contains the binary data for that given part of the comprehensive result set. This allows the client to operate independently, retrieving the encrypted data from the spooling storage directly, without the need to retrieve the result set rows directly from the coordinator. In fact, there are multiple options for the retrieval mode configuration.

Depiction of the Trino Spooling protocol, highlighted for emphasis.

 

Implications

This model changes a few things. First, it allows for larger data chunks to be returned, providing much higher throughput. At the same time, this approach comes at the expense of some latency. The spooling protocol was intentionally designed to be both backward-compatible and forward-compatible for existing clients. This means that if the client cannot leverage the spooling protocol, the coordinator will only use the direct protocol.

Extensible encoding and compression schemes

Furthermore, this model allows for extensible encoding and compression schemes that are not possible with the direct protocol. The client can retrieve the comprehensive result set even faster for large query results as it can parallelize the retrieval of the files located on the backing object store. The initial launch of the spooling protocol includes support for all major cloud object stores; S3, ABFS, and GCS.

The benefits of spooling are automatic

The best news for spooling-enabled clients is that they do not have to decide whether to receive information as direct or spooled data. The real decision is based on the size of the comprehensive result set. The coordinator can start out sending segments composed of inline data, and then when it determines the data would make more sense to be spooled, it can transition to that approach. It would then send segments that reference the URIs for the spooled data sets.

Freeing resources in the coordinator

Another interesting advantage of spooling results to external storage is that clients can consume all remaining “next URIs” before actually fetching all results. This frees up resources on the coordinator, which can be very beneficial in very active clusters.

Spooling protocol configuration

To utilize the spooling protocol modifications will need to be made to your cluster’s client protocol properties.  Setting protocol.spooling.enabled=true is the flag for this feature, but additional spooling file system properties are needed to configure the object storage system you will be using, as shown in the following example of using AWS S3 as the spooling storage.

spooling-manager.name=filesystem
fs.s3.enabled=true 
fs.location=s3://spooling/ 
fs.segment.ttl=12h 
s3.endpoint=http://minio:9080/ 
s3.region=fake-value 
s3.aws-access-key=minio-access-key 
s3.aws-secret-key=minio-secret-key 
s3.path-style-access=true

Additionally, the clients themselves need to be updated to take advantage of the spooling protocol. Backward compatibility ensures that nothing is broken if the Trino administrator makes the config changes above, but forward compatibility allows the various clients to be upgraded before the query engine itself is modified. 

Results

A recent Trino Community Broadcast demonstrated performance improvements executing the following query from the TPC-H connector via the Trino CLI.

SELECT * FROM tpch.sf10.lineitem LIMIT 1_000_000

Using the direct protocol, the client took approximately 35 seconds to complete. Using the spooling protocol the client in this demonstration finished in approximately 9 seconds. 

As a reminder, these significant improvements will be mostly noticed when the comprehensive result set size is composed of a large number of rows.

 

Conclusion

Overall, now is the perfect time to take advantage of the new spooling protocol implementation in your Trino clusters. Importantly, the protocol changes are both forward and backward-compatible, and the coordinator determines when to use direct or spooling approaches.

Here is a side-by-side comparison of the protocol implementations.

Direct protocol Spooling protocol
Encoding Only json Extensible, json, json+lzr, json+zstd
Optimized for Latency for small result sets Large dataset retrieval
Data retrieval Streaming in inlined chunks Streaming in spooled segments
Data chunk size 1MB 2-128MB pre compression
Bottleneck Coordinator (CPU, I/O) Workers (CPU, I/O), storage (I/O)
Requirements None Spooling storage configuration
Client support All Trino and Starburst clients CLI, JDBC, Starburst ODBC, Python