Portability Was Always the Point

What the cloud era got half-right, and why the unfinished half matters now

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

I’ve spent enough years inside the modern data stack to notice a pattern. Every wave of infrastructure sells itself as freedom, and every wave eventually reveals the new shape of the cage. The cloud era was no exception. The pitch was elasticity, managed operations, and the end of capacity planning as a personality trait. What a lot of us actually ended up with was elasticity inside one vendor’s topology, managed operations inside one vendor’s opinions, and a new generation of capacity planning that just happened to live in a billing console.

I don’t say that cynically. The cloud genuinely solved compute scalability. It made operational efficiency cheap enough that “let the vendor run it” became the default answer for most workloads. Those are real wins, and I don’t want them back. The piece that never got finished is the one I keep running into in customer conversations — portability. Not portability of data formats, which the open lakehouse movement has largely delivered. Portability of the operational layer itself, the ability for a managed service to come to where your data already lives, instead of asking your data to move to where the service prefers to run.

That’s the gap Galaxy Bring Your Own Cloud (BYOC) is built around.

Format portability was the easy half

The lakehouse won the format argument. Iceberg, Delta Lake, Hudi — pick your flavor; the underlying point is the same. Your tables aren’t trapped in someone’s proprietary file layout, you can read them from more than one engine, and you’re not held hostage to a single vendor’s pace of table evolution. That’s a real shift, and it’s why open table formats have become the default substrate for serious data engineering.

Unpacking the traditional cloud tradeoffs around data centralization

But format portability is only one layer of the cake, and arguably the easy one. The harder layer is operational portability. What does it mean for the managed service — the upgrade cadence, the telemetry pipeline, the support model, the governance layer — to deploy into whatever environment your data happens to need to stay in? For most of the history of managed analytics, the implicit answer was “it doesn’t.” You got the operational benefits of a managed service because you ran inside the vendor’s cloud. If you needed to run inside your own, you took the self-managed path and accepted what came with it, including the patching, the on-call, the upgrade weekends. That’s a legitimate tradeoff, and plenty of teams choose it deliberately because they want every dial in their own hands.

So the field divided itself into two worlds — managed SaaS and self-managed — and quietly assumed the operational model couldn’t cross between them. That assumption was always more about engineering effort than physics.

Scalability that scales with the enterprise, not just the cluster

When people in our space talk about scalability, they almost always mean compute. Can the cluster handle ten times the queries? Can the storage tier keep up? Fair questions, mostly solved. The kind of scalability I find more interesting now, and the kind that actually shows up in enterprise buying conversations, is organizational scalability. 

What is needed for organizational scalability?

It comes with a host of questions. Can your deployment model stretch to fit the org chart and the regulatory map of an actual large company? Can you stand up a cluster for a business unit with a tighter security posture than the rest of the company? Can you accommodate a jurisdiction with data residency rules that your current footprint doesn’t satisfy? Can you bring on an acquisition without asking them to re-architect their environment to match your platform’s assumptions? 

This is where the managed-vs-self-managed binary stops being useful, not because either side is wrong, but because a large enterprise rarely has only one shape of workload. The same regulated bank might run self-managed Trino for the workloads where they want every dial in their own hands, lean on a fully managed SaaS path for the workloads where they want speed and don’t have residency constraints, and still have a third bucket where neither answer fits. This means workloads that need managed operations, but where the underlying data legally can’t leave the corporate network.

What Starburst BYOC offers

BYOC is the third answer. The conversation I keep having goes something like this. A platform lead at a regulated enterprise has the analytics workloads that fit a managed SaaS pattern already on one, a second tier of workloads they want to run themselves for control reasons, and a third bucket they can’t place anywhere. Those workloads need managed operations, sit in the company’s own AWS account, and carry InfoSec sign-offs that won’t let data leave the VPC. The historical answer has been “self-manage it or push it to SaaS,” and for that third bucket, both options fail the test. The deal stalls, the project stalls, and the workload ends up running on whatever was already there — usually something nobody loves but nobody has time to replace.

Breaking it down

In this setup, the data plane is deployed in the customer’s AWS account, involving their EKS cluster, their VPC, and their cross-account IAM model, with encryption keys that live inside the account and never leave it. Production and sensitive workloads, which either legally or contractually can’t leave the corporate network, stay where they’re already governed, and the compute runs alongside them rather than reaching in from outside. 

The operational spine, involving upgrade orchestration, telemetry, billing, and identity,  remains unified with the same Galaxy control plane that every other Galaxy cluster reports to. One operational model, many deployment topologies, without forking the product into a second product that drifts away from the first one over the next four quarters. 

And inside a single enterprise, all three answers, whether self-managed, SaaS, and BYOC,  can sit side by side, each carrying the workloads it’s actually fit for. The point isn’t picking one. It’s that the deployment model should follow the workload, not the other way around.

Image depicting the architecture data drives the Starburst control plane and BYOC (bring your own cloud) architecture.

Figure 1: Galaxy BYOC keeps the operational spine centralized while the data plane, compute, and encryption keys live entirely in the customer’s account.

Same product, different cloud, always your account

What does it all mean? The practical version of “the deployment model should follow the workload” suggests that nothing about how your team uses the product should change because of where it runs. A BYOC cluster shows up in Galaxy as just another cloud region, with the same console, catalogs, connectors, and access control as every other Galaxy cluster. Your data engineers don’t learn a second product. Your security team doesn’t review a second policy model. Your platform team doesn’t operate a second support relationship.

Security underpins the connection

What holds it at the seam is an outbound-only secure tunnel between your account and the Starburst control plane, meaning no inbound firewall holes, no PrivateLink overhead, the same network pattern Galaxy already uses today. Push-based deployment from the control plane stands the whole thing up in roughly 30 minutes; you don’t run a local agent in your VPC, don’t own the upgrade cadence, and don’t lose the telemetry visibility that makes managed support actually feel managed. From the control plane’s perspective, a BYOC cluster is just another cloud region. From your perspective, it’s your infrastructure.

Centralized access, supporting decentralized data architecture 

The architectural reason this works at scale comes down to the components that stay centralized, including query routing, identity, billing, and the event bus. Distributing the operational spine would mean a separate billing pipeline per deployment, a separate identity provider per deployment, and a separate observability stack per deployment. A product that fractures into a slightly different shape every time someone installs it isn’t really a product anymore; it’s a family of bespoke deployments wearing the same name. Centralization is what lets the data plane go anywhere. Ironically, what supports decentralized data architecture. 

Where we are today, and where we’re heading

It’s worth being honest about where the boundary currently sits. The control plane runs in a single AWS region today. For most regulated buyers, the binding requirement is that data and compute stay in their environment, and that’s fully delivered. For accounts with harder in-region requirements for the control plane itself, that work is on the roadmap, and we’re scoping it now with design partners. I’d rather name the gap than paper over it; a roadmap that acknowledges what it doesn’t yet do is the only kind worth trusting.

The AI flow that most enterprises can’t approve

All of this comes down to understanding the flow of data needed by AI in production. Most enterprise AI architectures, when you actually trace the data flows, move the data to the model. The model lives somewhere, for example, a hyperscaler inference endpoint or a vendor-managed API. Getting any value out of it requires shipping context to this model. For regulated enterprises, that flow hits the same wall as every other “send data outside the network” pattern. It doesn’t get approved, or it gets approved so narrowly that the use case stops being interesting.

A better way, made possible by universal data access

The more durable posture, and the one I think the industry is converging on, is the inverse. Bring the query interface to the data. Galaxy BYOC clusters are reachable via the Starburst MCP server, which means any MCP-compatible client — whatever model your security team has signed off on — can query governed data products directly against the BYOC cluster, inside the customer’s network boundary, under the customer’s IAM. 

The access-control model isn’t a second policy surface bolted on for AI. Instead, it’s the same Galaxy access control that governs every other query.

Starburst AIDA and BYOC

That’s also why AIDA, our agent layer, ships off by default on BYOC today. AIDA currently routes context through Starburst-managed infrastructure outside the customer’s account, and enabling it by default would quietly contradict the residency premise the customer bought BYOC to satisfy. The egress policy is a per-deployment toggle that the customer controls, and customers who’ve completed their AI risk review can turn it on deliberately. 

The roadmap closes the remaining gap with in-account model routing, so AI inference itself never leaves the customer boundary. It involves the same architectural principle as the data plane. Keep what belongs to the customer in the customer’s account, and earn trust by not insisting on the parts of the stack they’ve already governed.

Follow the data

There’s a famous line from the 1976 film, All the President’s Men. When you don’t know what’s actually driving a decision, follow the money. The infrastructure equivalent is the one our industry has been working around for the better part of a decade. Follow the data. Where it has to live, what governs it, what won’t let it move. The deployment model is supposed to bend around those answers, not the other way around.

Why portability matters more for AI than ever before

The reason portability matters more now than it did five years ago is that the data landscape has fragmented in ways that aren’t going to consolidate back. Enterprises have data held in multiple cloud platforms and on-premises systems. That data isn’t moving. Similarly, regulated environments place hard constraints on where compute can run, and that isn’t changing either. Mergers and acquisitions create situations where legacy stacks come along with the acquisition. This, too, will not change. The bet that a single cloud’s native services would eventually subsume the rest, and therefore that infrastructure diversity was a transitional state on the way to consolidation, hasn’t aged well. The opposite–diversity–has won the day, not in spite of AI but because of it.

The emerging AI data architecture is diverse and powerful

The architectures that hold up under that reality have a few things in common. Open formats at the storage layer. A federated query engine that reaches data where it sits instead of demanding it move first — Trino, in our case, and the federation isn’t incidental; it’s the thing that makes “compute where the data is” actually viable across sources that don’t share a vendor. A deployment model that puts compute next to data, not next to the vendor. A managed operations layer that travels with the deployment rather than with the cloud provider. None of these are new ideas in isolation. What’s new is the willingness to combine them and ship the result as a single, supportable product.

BYOC is one concrete expression of that. The cage every cloud era eventually reveals is the one built out of the vendor’s assumptions about where your data is allowed to live. The way out of it isn’t a better cage. It’s an architecture that follows the data instead of asking the data to follow it.

We’ve spent a decade asking data to move toward our platforms. The next decade belongs to platforms that move toward the data.

Galaxy BYOC launched at AI + Datanova 2026. To learn more or explore a design-partner conversation, contact us.

 

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free