Faster data compliance and data sovereignty with Stargate Parallel
Lester Martin
Developer Adocate
Starburst
Jan Was
Software Engineer
Starburst
Marius Grama
Engineering Manager
Starburst
Lester Martin
Developer Adocate
Starburst
Jan Was
Software Engineer
Starburst
Marius Grama
Engineering Manager
Starburst


More deployment options
Improving data access is a cornerstone of Starburst’s mission, whether that means data analytics or AI workflows. To facilitate this, Starburst offers access to over 50 data sources using a comprehensive set of connectors. Connectors link organizations to cloud object stores, databases of all types, streaming and search engines, and the ability to mix and match as needed in a hybrid environment.
Understanding the Starburst Stargate connector
One of the most powerful Starburst connectors is the Starburst Stargate connector.
The Stargate connector enables a local Starburst cluster to establish a secure connection with a catalog in a separate, remote cluster. When a user submits a query using Stargate, it is executed in the remote cluster, and the results are streamed through the local cluster to the user.
The image below shows how this process works architecturally. Notice that both the local and remote clusters include coordinators and workers, meaning that Stargate workloads scale in a similar fashion to any other Starburst cluster.
How the Stargate connector helps solve data compliance and sovereignty issues
The Stargate connector is designed to address issues related to data compliance, data regulation, and data sovereignty. It achieves this by allowing control over where an organization’s data is stored and processed. This is an increasingly important area of concern, particularly for certain industries, including financial services, insurance, healthcare, and the public sector.
Why data compliance matters more than ever
Data compliance is a legal requirement in many countries, and its importance is increasing in a global world. For example, the General Data Protection Regulation (GDPR) in the EU and the California Consumer Privacy Act (CCPA) in California both outline restrictions on how data can be moved and processed. To meet data compliance and data sovereignty regulations, many multinational companies have struggled to find a solution to analyze data that resides across borders.
Starburst Stargate is designed to solve these challenges. It provides a compliant pathway for reading data held across national borders, which is a massive game-changer for many organizations, particularly those operating in regulated environments where compliance is required.
Limitations of the original Stargate connector and JDBC connections
While the Starburst Stargate opens the door to addressing data compliance, regulation, and sovereignty requirements, its original architecture presented limitations in some scenarios. Specifically, it fell short when retrieving large amounts of data through a JDBC connection to the remote cluster, which can cause problems with performance.
Stargate Parallel: The ideal solution for compliance and performance
Enter Stargate Parallel–the perfect next-generation Stargate solution that solves issues with both compliance and performance.
And the best part? It’s already live.
Stargate Enterprise and Stargate parallel connector
Beginning in version 467-e of Starburst Enterprise, the Starburst Stargate parallel connector enhances the existing Starburst Stargate connector by providing the ability to retrieve the data relevant to a query result with a much higher throughput. This enhancement becomes particularly apparent when trying to retrieve large amounts of data when executing a query.
The image below shows how this process works architecturally. Using spooling, the coordinator does not have to receive actual data from the workers to send to the client. Instead, workers inform the coordinator about the location of the spooled segment and return it to the client, reducing the coordinator’s workload.
A closer look at the Stargate parallel and the Trino spooling protocol
Stargate parallel directly iterates on the Stargate connector, using Trino spooling to achieve its results. While the original Starburst Stargate connector used a direct protocol to communicate with the remote cluster, the updated Stargate parallel connector uses the Trino spooling protocol. This protocol allows the Stargate connector to retrieve data from object storage faster, using a parallelization technique. Importantly, this approach uses the remote Starburst Enterprise cluster coordinator as a proxy, allowing both customization and configuration as needed.
Want to know more about the Spooling protocol that underpins the Starburst Parallel connector? Read the recent Introducing the Trino spooling protocol blog post for all the technical details.
Benefits of the Starburst parallel connector
What other advantages does the updated Stargate parallel connector provide? Besides the obvious increase in query performance, Stargate parallel provides several other benefits as well
- Less load on the remote coordinator: Stargate parallel processes queries in a distributed way, even if there is only a single split, placing less load on the remote coordinator.
- Slow clients don’t create congestion on the remote coordinator: Using Stargate parallel, the coordinator finishes a query and releases all resources once the client receives all segment URIs, without having to wait for it to actually download all data. This reduces congestion on the remote coordinator.
Understanding Stargate Parallel local or remote cluster configuration
When using the Starburst Stargate parallel connector on the local Starburst Enterprise cluster, it is mandatory to enable the spooling protocol on the remote Starburst Enterprise cluster.
Some queries explicitly return large amounts of data, but there are cases where queries contain aggregates or joins that are not pushed down to the data source. While Stargate Parallel would help in both cases, note that this connector’s purpose is not to replace pushdowns – Starburst continues to improve this in the Stargate connector. Check out all the performance features of the Stargate Connector.
Example: Processing data remotely with and without spooling
In the example below, two clusters operate on 1M rows located in the EU. The first cluster operates on data from within the EU, and the second operates on the same data from a remote location in the US. At first, both clusters operate without spooling, using the original Stargate connector (light purple). Unsurprisingly, the EU cluster shows much less latency, with about 28 seconds compared to about 42 seconds in the US cluster.
Next, both clusters operate using spooling. In both scenarios, spooling decreases processing time to about 10 seconds in the EU cluster and 14 seconds in the US cluster. The result highlights how spooling is more beneficial when there’s additional latency between the client and the cluster. This scenario often occurs when processing data in accordance with data compliance and data sovereignty rules, making spooling especially useful in high-compliance environments.
How to set up Stargate Parallel on your local machine
Switching to Stargate Parallel is easy. To do this, update your connector configuration by changing the connector name from “stargate” to “stargate_parallel”. Because the Stargate Parallel connector builds on the existing functionality of the standard Stargate connector, no other changes are needed.
connector.name=stargate_parallel
Stargate Parallel Remote cluster configuration
Now it’s time to set up the Stargate Parallel Remote cluster configuration. To do this, the spooling protocol properties need to be applied on the Starburst Enterprise remote cluster to enable and customize the cluster’s spooling abilities.
Let’s get started.
Step 1 – Set spooling manager properties
To set up spooling, the remote cluster requires that the object storage used for spooling purposes be configured in etc/spooling-manager.properties. The code example below shows you how to do this in AWS S3.
Note: This example is designed for AWS S3. Other cloud providers may require a different setup.
spooling-manager.name=filesystem
fs.s3.enabled=true
fs.location=s3://spooling/
s3.endpoint=http://minio:9080/
s3.region=us-east-1
s3.aws-access-key=minio-access-key
s3.aws-secret-key=minio-secret-key
s3.path-style-access=true
fs.segment.ttl=5m
fs.segment.pruning.interval=15s
fs.segment.pruning.batch-size=250
# Disable as we don't support SSE-C while writing/reading from S3
fs.segment.encryption=false
Step 2 – Set coordinator properties
Now, you need to set the coordinator properties. The following is an example of the additional properties that need to be added to the remote cluster coordinator’s etc/config.properties file.
Note: This example is designed for AWS S3. Other cloud providers may require a different setup.
# Enable spooling protocol
protocol.spooling.enabled=true
protocol.spooling.shared-secret-key=abc-appropriateKeyValue-xyz
# Enable direct storage access to test fetching against storage
protocol.spooling.retrieval-mode=storage
Step 3 – Set worker properties
Now it’s time to set the worker properties. The following is an example of the additional properties that must be added to the remote cluster workers’ etc/config.properties files.
Note: This example is designed for AWS S3. Other cloud providers may require a different setup.
# Enable spooling protocol
protocol.spooling.enabled=true
protocol.spooling.shared-secret-key=abc-appropriateKeyValue-xyz
# Enable direct storage access to test fetching against storage
protocol.spooling.retrieval-mode=storage
# Do not use inline segments so we can test against storage
protocol.spooling.inlining.enabled=false
Stargate Parallel and security
Stargate parallel is designed with security in mind. Query results are retrieved either directly from the storage or through the remote Starburst Enterprise coordinator as a proxy through pre-signed URIs. This approach adopts industry best practices when sharing data between two parties.
Encryption
Additionally, data is encrypted at rest using ephemeral keys, which are unique for each query. These keys are only known to both clusters for the duration of the query and are not stored. Once the query is done, even if segments are not deleted, they cannot be decrypted.
Abandoned segments can be cleaned up by a pruning job configured in Starburst Enterprise or a life-cycle policy configured in the object storage.
Server-side encryption is used, which does not add overhead to either the remote or local cluster.
Adapting to different security requirements
Stargate parallel is also designed to adapt to different security environments and requirements. Specifically, it employs multiple retrieval modes, allowing a variety of options for different organizations. This approach allows local clusters to read segments:
- Directly from the object store
- Through all nodes of the remote cluster
- Only through the remote coordinator
Starburst and Stargate Parallel
We’re incredibly proud of the work behind Stargate Parallel and invite you to explore the feature during its public preview. Your feedback is invaluable—let us know about any issues you encounter or features you’d like to see. This is your opportunity to help shape what comes next!
We look forward to your feedback and invite you to begin your Stargate Parallel journey today.