In our organization, I am operating a Starburst Enterprise Platform (SEP) cluster on Azure Kubernetes Service (AKS), with one coordinator and several workers. I realize SEP can only run with a single active coordinator. This single coordinator architecture is being questioned by one of our architects from another business unit who’s used to designing platforms for high availability (HA), especially when deploying in Kubernetes environments where HA is expected by default.
Could the Starburst team clarify:
Why does SEP currently require a single coordinator per cluster, even in cloud-native, HA-ready infrastructures like AKS?
Is there a technical or architectural limitation that prevents running multiple active coordinators?**
Is there any plan to enable coordinator-level HA (with active-active or active-standby modes), or what approaches are recommended for coordinator fault tolerance today?
I have reviewed the Starburst documentation and understand that the coordinator manages state and query orchestration across workers, which seems like a potential bottleneck and single point of failure. However, I would appreciate a more official and detailed explanation from the Starburst team, especially so I can explain this stance with supporting reasoning to colleagues who expect HA and seamless failover just because we are using AKS.
Thanks for your insights and any references to official design rationales!
As someone who came to Trino from the Hadoop world, this was my first question, too, given that Hadoop had an active/passive HA model for the NameNode and the ResourceManager that worked pretty well. Neither of us are unique in this question and there are several such instances here on this forums site such as at Can you set up Trino in HA mode?.
Fundamentally, Trino itself does NOT have a HA solution for the coordinator. Various teams and customers have implemented solutions over the years such as described in Trino | 33: Trino becomes highly available for high demand. At the end of it all, many of these solutions lean on a model of routing to 2+ full clusters to create the HA solution when one is not available. Not a bad answer considering you get additional benefits along with it such as query routing and blue/green deployments for zero-downtime upgrades.
I should pause for a moment and state that Starburst Professional Services does have a coordinator HA solution on K8s that has been acceptable by many customers.
But, let’s get back to the gateway/routing model as it is the long-term “official” answer from Starburst for HA/DR. While the name is similar, the Starburst Gateway – Starburst Gateway — Starburst Enterprise, is different from the Trino Gateway and again, the “official” go-forward answer to address this need.
Note: as of Sept 2025, the Starburst Gateway is in ‘private preview’, and the product team is always looking for teams who are interested in getting involved.
@lester Thanks for your detailed answer. Yes, we are looking into starburst gateway for routing the traffic to multi cluster so that we dont have single point failures.