Starburst Galaxy: Engineering the Data Foundation Agentic AI Demands

Make your data foundation reproducible, self-maintaining, and ready for the agentic era

June 23, 2026

Zachary Hanson

Senior Product Manager

Starburst

Zachary Hanson

Senior Product Manager

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Dremio’s Exit Is the Clearest Sign Yet That Lakehouse-Only Won’t Survive AI

The conversation around enterprise AI has changed. We have moved past the phase of experimentation and demos, and into the harder, more consequential phase of production. Because of this, the question is no longer whether AI can impress in a sandbox. The question is whether AI can be trusted to act within the business, at scale, every day.

That shift raises an uncomfortable question for most organizations. What does it actually take for AI to succeed in production? After watching enterprises deploy real agentic workloads, the answer has become clear, and it is not the one the market spent the last two years obsessing over. As we have explored in depth, what enterprise agentic AI needs to succeed in production comes down to three things.

Is the AI accurate, returning the right answer?
Is it consistent, returning that right answer more than once?
And is it auditable, so others can understand how it reached the answer at all?

All three of those depend on the same thing, and it is not the model. An agent does not hallucinate because the model is weak. It hallucinates because the data beneath it is stale, the definitions beneath it are inconsistent, or the path it took cannot be traced. That is a problem of agent grounding rather than model quality. The model was rarely the bottleneck. It was always the data all along.

Let’s dive in.

Why the data foundation is the real bottleneck for enterprise AI

Our last Galaxy release focused on the agentic interface itself. But an interface is only as good as what sits beneath it. AI relies on data, especially contextual data, and that is what allows it to enhance accuracy and achieve results. This release focuses squarely on the data foundation itself.

The problem is that most foundations were never built to carry AI. Enterprises are pouring budget into frontend AI tools while the foundation underneath them stays fragmented, manually managed, and impossible to reproduce. A foundation like that does not just slow AI down. It is where AI initiatives quietly break.

So what does a foundation built to carry AI actually look like? It comes down to four things.

It can be reproduced exactly, so an agent behaves the same way today and tomorrow.
It keeps itself healthy, so the tables an agent reads stay clean and query-ready.
It is fed quickly and affordably, so the data is fresh enough to trust.
It stays stable under unpredictable load, so answers arrive consistently.

Reproducible, self-maintaining, well-fed, and stable. Those are the four properties this release delivers, and each maps back to whether your AI is accurate, consistent, and auditable.

None of that builds itself.

Every single part of this architecture has to be engineered, a point our CEO Justin Borgman has made about engineering the AI leap across the enterprise. As a fully managed, cloud-native platform, Galaxy delivers this foundation without handing your team a pile of infrastructure to run.

Let’s take a look at what’s new.

A data foundation you can manage as code

For years, the most consequential parts of a data platform have been the least repeatable. Environments get configured by hand through UI workflows, governed data assets get assembled field by field, and there is no clean record of what changed, who changed it, or how to stand the whole thing up again somewhere else. That is tolerable when a few analysts run reports against it, but it becomes a serious liability when AI agents are making decisions on top of it, and a regulator, or your own team, needs to understand why.

To fix this problem, this release brings the discipline of software engineering to the data foundation itself in the following two areas.

Infrastructure as Code with the Terraform Provider for Galaxy

The Terraform Provider for Galaxy is now generally available, letting teams provision and manage their Galaxy environment entirely as Infrastructure as Code. Catalogs, clusters, roles, and other resources can be defined declaratively, version-controlled, and rolled out through the same CI/CD pipelines teams already use for application code.

Instead of clicking through setup and hoping staging matches production, you describe the platform you want and apply it, with state management, dependency resolution, and a full history of every change. The provider is available through the Terraform Registry and included with Galaxy at no additional cost.

Data Products as Code, the contract your AI reasons from

Data Products as Code extends that same principle to the governed data layer, where it arguably matters most. A data product is the execution-time contract between your business and your AI. It defines which tables are authoritative, how metrics are calculated, and which business rules apply, so an agent inherits real meaning rather than guessing.

With Data Products as Code, that contract becomes software and will be available near the end of July. When it launches, Galaxy data products will be represented as self-contained, human-readable YAML files, exported from the UI or API, committed to Git, reviewed in a pull request, and imported into any environment. A new CLI tool lets teams scaffold, lint, and validate data products offline, catching errors before they ever touch a cluster, and imports land in a draft state for review before publishing. Code-first engineers get full CLI and API control, UI-first teams keep working exactly as they do today, and data stewards finally get the version history, rollbacks, and audit trail they have always needed. For a refresher on how a data product evolves from creation to retirement, see the data product lifecycle.

Together, these two capabilities mean the foundation an agent reasons over is no longer a hand-built artifact you hope you can reproduce. It is defined, reviewable, and reproducible, the same way every other piece of serious infrastructure already is. That reproducibility is what makes an agent’s behavior consistent across environments, and that audit trail is what makes its decisions defensible after the fact.

A data foundation that stays healthy on its own

A data foundation is not built once and left alone. Apache Iceberg tables accumulate small files, orphaned snapshots, and bloated metadata over time, and left unmanaged, that decay quietly degrades query performance and inflates storage costs. If you are new to how Iceberg and Trino combine into a managed data lakehouse, our explainer on the Icehouse architecture is a good primer. As we have also written about keeping Iceberg easy on Galaxy, regular maintenance is what keeps a data lakehouse running at its best. The traditional answer has been to ask data engineers to write and babysit their own maintenance scripts, turning skilled people into janitorial staff for the data lakehouse.

Serverless Icehouse Table Maintenance

Icehouse Table Maintenance (LakeOps), now in Public Preview for Galaxy, removes that burden. Using a serverless architecture, Galaxy automatically runs the critical upkeep that keeps Iceberg tables in peak condition, including file compaction, manifest rewrites, and snapshot expiration, with no external orchestration and no clusters to provision for the job.

Observability you can verify

Just as important, this release pairs that automation with visibility. A new Maintenance Observability Dashboard shows the real-time impact of every maintenance run. Teams can see the reduction in file counts and metadata overhead, the storage space reclaimed, and the resulting cost savings, and detailed logs for every background task. The result gives teams automated maintenance without sacrificing control or insight, enabling them to move from reactive troubleshooting to proactive, verifiable data health. For the mechanics of why this upkeep matters, see our deeper dive on automated table maintenance for Apache Iceberg.

This is foundation work in the most literal sense. An agent is only as accurate as the tables it reads. Tables that are continuously kept clean, compact, and query-ready are tables an agent can actually trust.

A data foundation fed by all your data

Before data can be useful to anything, analytics or AI, it has to land. Getting data into Iceberg has been one of the most stubborn bottlenecks data engineers face, as the manual work of discovering, formatting, and loading files produces fragile pipelines and constant maintenance overhead. Galaxy’s Managed Ingest already handles JSON across both streaming and batch, but real enterprises run on far more than JSON.

New CSV and Avro support in Managed Ingest

This release expands Managed Ingest with Public Preview support for CSV and Avro. For batch file ingestion, Galaxy now automatically discovers CSV files in cloud storage, infers their structure, and loads them into Iceberg with no custom code. For streaming, new Avro support integrates with the Confluent Schema Registry to handle deserialization automatically and, crucially, to absorb schema changes as they happen rather than breaking the pipeline when a field shifts. Throughout, Galaxy’s serverless architecture manages exactly-once delivery and sub-minute latency, and the table maintenance described above keeps the resulting tables healthy on the way in.

What the ingestion benchmarks show

The performance behind this is not incremental. In independent benchmarking from Concurrency Labs, Galaxy delivered roughly 7x the record ingestion rate of AWS Data Firehose and Confluent Tableflow. Further, ingestion costs came in 87% lower than Firehose and 81% lower than Tableflow at the tested throughput, because Galaxy produces smaller, more compressed Iceberg files rather than simply copying data into place. We have covered why Galaxy delivers superior data ingestion in more detail separately. A foundation for AI has to be fed continuously and affordably, and this is what that looks like.

A data foundation that stays stable and predictable

The final property of a production-grade foundation is one teams only notice when it is missing, and that is stability under unpredictable load. AI and analytics workloads are bursty and varied by nature, a mix of interactive, batch, and high-concurrency demands hitting the platform at once. Traditional load balancing, whether static or round-robin, ignores what is actually happening inside the clusters. That leads to queuing, uneven performance, and the temptation to over-provision just to be safe.

Smart Load Balancing routes based on real-time queue depth

Smart Load Balancing, reaching general availability for Galaxy, replaces that guesswork with intelligence. It continuously evaluates real-time system load and runtime telemetry, including how deep each cluster’s queue already is, then routes each query to the cluster best able to handle it at that moment. Routing on live queue depth is the difference between sending work to a cluster that looks available and one that actually is.

The result is lower latency, better utilization across multi-cluster deployments, minimal queuing during autoscaling events, and lower infrastructure costs, because you are no longer over-provisioning to cover for blunt routing. It also retires the manual workload-isolation strategies teams have leaned on for years.

All of this matters. Predictable, consistent performance is not a luxury when agents are the ones issuing the queries. An agent that gets a fast, reliable answer one moment and a queued, degraded one the next is not a foundation you can build trusted automation on. Stability under load is what makes consistency possible at scale.

Engineering the leap, not waiting for it

Step back from the individual features, and a single picture comes into focus.

A data foundation you can define and reproduce as code.
A data lakehouse that keeps itself healthy.
Ingestion that is fast, affordable, and continuous.
Performance that holds steady no matter what is thrown at it.

These are not four unrelated improvements. They are four properties of a single data foundation, and together they answer what production AI actually requires. This means data that’s clean and current enough to be accurate, a platform reproducible and stable enough to be consistent, and a foundation governed and version-controlled enough to be auditable.

This is the part of the AI story the market keeps skipping. The breakthrough that unlocks enterprise AI was never going to be a better model. It was always going to be a foundation solid enough to trust, and that foundation has to be deliberately engineered. The same logic underpins the broader shift from BI dashboards to AI decisions, where trusted answers depend on trusted data underneath. With this release, Galaxy gives teams a managed, cloud-native way to do exactly that, without standing up infrastructure or hiring a platform team to keep it running.

The organizations that win the next phase of AI will not be the ones waiting on the perfect model. They will be the ones who engineered the foundation underneath it.

Documentation and further reading

For teams ready to put these capabilities to work, the Starburst Galaxy documentation covers each one in depth.

The Terraform Provider for Galaxy documentation walks through provider setup, authentication, and the resources you can manage as code.
The Data Products documentation covers how to create, manage, and govern data products in Galaxy.
The data maintenance documentation explains table maintenance and the observability metrics that track it.
The data ingest documentation details how to configure file and streaming ingestion into managed Iceberg tables.

Frequently asked questions

What is a data foundation for AI?

A data foundation is the layer of governed, accessible, well-maintained data that analytics and AI workloads depend on. For enterprise AI, it determines whether an agent’s answers are accurate, consistent, and auditable. The model sits on top, but the foundation is where trust is won or lost.

How does a data foundation relate to the context layer?

The context layer gives AI the business meaning behind the data, including definitions, metrics, relationships, and governance rules. The data foundation is what that context is applied to and served from. Governed data products are where the two meet, carrying both the data and the business context that an agent needs to reason reliably.

Why does the data foundation matter more than the AI model?

In production, most agent failures trace back to data that is stale, inconsistently defined, or impossible to audit, not to the model itself. A reproducible, self-maintaining, well-governed data foundation is what makes enterprise AI trustworthy enough to act on.

Start for free with Starburst Galaxy. Try our free trial today and see how you can build the data foundation your AI strategy depends on.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

The Data Engineers Guide to Iceberg v3

Starburst Galaxy: Engineering the Data Foundation Agentic AI Demands

More deployment options

Start for Free with Starburst Galaxy

Dremio’s Exit Is the Clearest Sign Yet That Lakehouse-Only Won’t Survive AI

Why the data foundation is the real bottleneck for enterprise AI

A data foundation you can manage as code

Infrastructure as Code with the Terraform Provider for Galaxy

Data Products as Code, the contract your AI reasons from

A data foundation that stays healthy on its own

Serverless Icehouse Table Maintenance

Observability you can verify

A data foundation fed by all your data

New CSV and Avro support in Managed Ingest

What the ingestion benchmarks show

A data foundation that stays stable and predictable

Smart Load Balancing routes based on real-time queue depth

Engineering the leap, not waiting for it

Documentation and further reading

Frequently asked questions

What is a data foundation for AI?

How does a data foundation relate to the context layer?

Why does the data foundation matter more than the AI model?

Start for Free with Starburst Galaxy