Downtime failover estimates

We are using Starburst High Availability(HA) via CloudFormation. Do we have an estimate on the downtime when failover happens on the Coordinator/Worker in events like update or node failure?

It’s really two things:

  1. Recognize one of the nodes is down. This may take a min or two.
  2. AWS spin up a new EC2 instance. This could then take 1-4min then the bootup of the node/software and register with the coordinator. I’ve seen 5 to 14min or so of total downtime.

Also something to note, when multiple coordinators are started upfront, the failover is much faster (but you pay for the EC2 machine running as a passive Coordinator) . Failure of a Worker does not cause an outage since the cluster is still accepting queries. The speed of the queries is a bit lower since you have a smaller capacity until the replacement Worker is provisioned.

Some of the fault tolerant execution work should help with the task restarts and help reduce downtime to end users.

1 Like

So, what about AKS ? I’ve noticed that if the coordinator pod goes down, scaling worker pods does not help much. Having Starburst on AKS for HA doesn’t seem fully effective when there’s a complete dependency on the coordinator pod. I understand why this dependency exists( would like to get your full opinion), but I’ve seen similar behavior in other tools, like Tableau Server, where the Primary (or Controller) node is critical. Even with fault-tolerant or HA setups, these dependencies can still cause downtime.

Recently, in our organization (Running SEP on AKS ), we faced an issue where the Starburst coordinator went down, causing downtime. We spent 2–3 days adjusting nodes, memory limits, CPU, and other pod settings to troubleshoot the problem. The root cause turned out to be an issue with the pod itself, which was recreated, but this highlights that such failures can happen even with replicas and autoscaling.

I’m interested in hearing how starburst is running Starburst Galaxy on AKS without these issues. Specifically:

  • How do you handle coordinator pod failures in production?
  • What best practices do you follow to reduce downtime for customers?

Apologies for bringing up Tableau clusters as an example as I’ve managed a 14-node Tableau cluster on-premises and in cloud VMs (Windows and Linux), and even with HA and blue/green deployments, dependencies caused headaches lot of restarts and quick fixes. I want to learn best practices to manage Starburst more reliably and make my team life more easier.

Thanks,
Thara