Your AI Doesn’t Know Where the Data Lives

Why context matters more than ever

April 10, 2026

David Azaria

Sr. Product Manager

Starburst

David Azaria

Sr. Product Manager

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

What is Data Federation?

Before I worked in data infrastructure, I was a customer using data infrastructure.

I spent years at one of the largest financial institutions in the world. The kind of place where the corporate data platform isn’t just a product, but an ecosystem that has evolved over many decades, spanning multiple data technologies.

What does legacy data infrastructure look like in practice?

In practice, this means many things, not all of them good. Hundreds of clusters. Teradata instances that have been running since before the cloud existed. Oracle databases underpinning core operations. PostgreSQL clusters spun up by teams who needed something fast ten years ago. Object storage in multiple clouds. On-premises systems that will never move because regulators said so.

I tell people this because when I hear conversations about making data AI-ready, I think about what that actually looks like inside a place like that. The honest answer is that most of the infrastructure that AI needs to work well doesn’t exist yet, and the gaps are not where people think they are.

This post unpacks the implications of that and explores why context and data access are more important than ever.

AI success is about far more than models

AI is about more than models. The models are fine, of course. But I’m not worried about the models. What I’m worried about is everything underneath them.

When an AI system queries your data, whether it’s generating a report, answering a natural-language question, or feeding a pipeline, it’s making a set of assumptions. It assumes the table it’s reading is the right one. It assumes the data is current. It assumes the schema means what it looks like it means. It assumes that if two tables share a column name, they’re describing the same thing.

In a single, well-curated data warehouse, those assumptions mostly hold. In a large, federated enterprise, the kind where data lives in fifteen different systems across three continents, they don’t. And when they don’t, the AI doesn’t fail loudly. It returns something that looks right. Formatted well. Confident. Wrong.

I’ve watched this happen. An analyst asks a question, gets an answer that’s technically correct for one business unit but completely misleading for another, because the system has no idea that the same table name means different things in different domains. There’s no error. There’s no warning. Just a plausible answer that quietly sends someone in the wrong direction.

Mapping what’s actually missing

How do we approach the problem differently? Here’s the main thing I’ve learned, first as someone consuming data in a regulated enterprise, and now as someone building the integrations that connect these systems.

The metadata you’d need to prevent those failures is either scattered, incomplete, or doesn’t exist. Let’s unpack this, because it typically takes a few different forms.

The technical catalog

This is the first place where context is lost. It controls which tables exist, what columns they have, and what types those columns are. Although it is usually available, but not through any single interface. You’re stitching together SQL queries against information schemas, REST API calls, and metastore lookups.

No one call gives you the full picture. It works, but it’s a patchwork, and patchworks have seams.

Business context

This area controls who owns a dataset, what it’s actually used for, and who to call when it breaks almost never exists as structured data in the query engine. It lives in wikis that haven’t been updated in two years. In Slack threads. In the heads of people who’ve been at the company long enough to remember why a table was created. AI can’t read any of that.

Lineage

Data lineage governs how data flows from the source to destinations. Knowing this is technically possible if you’ve configured the right event listeners and your queries happen to be the kind that those listeners capture. But column-level lineage across federated sources? That’s not a service you can turn on. It’s a deployment pattern that requires configuration, has limitations, and covers a subset of your operations. Most teams I’ve worked with have partial lineage at best.

Access control

This area controls who’s allowed to see what is enforced. But the problem is that the enforcement is distributed across multiple systems. There’s no unified view that says, “here are all the permissions on all the objects.” If someone asks which AI pipelines touch sensitive data, the honest answer at most enterprises is: we’re not entirely sure.

None of this is a failure of any one product. It’s the natural state of a large organization that’s been building data infrastructure for twenty years. But AI doesn’t know that. AI treats your entire landscape like it’s a clean, well-modeled warehouse. Nobody tells it otherwise.

Why accessing context is harder than it looks

When I was on the customer side, I assumed the fix was straightforward. Get a good catalog, connect everything, and done.

Now that I’m on the side, building those connections, I understand why it takes so long. The challenge isn’t writing the integration. It’s that the systems you’re connecting were never designed to share context with each other. They share data; bytes move across JDBC connections and REST endpoints just fine. But meaning? Ownership? Lineage? Trust? Those were never part of the protocol.

Why data federation is the key

A query engine can federate a join across five data sources. It can do that in seconds. What it can’t inherently tell you is that one side of that join is a curated data product, published by a platform team from a cloud environment, consumed by on-premises clusters, and that the governance tags on that product should follow the data wherever it goes. That semantic layer, the meaning behind the data, is exactly what AI needs to be trustworthy. And it’s exactly what’s missing from the wiring.

The hardest part of this work isn’t technical complexity. It’s the discovery that you’re building connective tissue between systems that have never had a reason to communicate about meaning, only about bytes.

Why I think we should still be optimistic about context

I realize I’ve painted a bleak picture, so let me tell you why I’m actually hopeful.

The problems I described (fragmented metadata, missing lineage, disconnected business context) aren’t unsolvable. They’re just underinvested in. For a long time, metadata was an afterthought. You built the pipeline, you shipped the dashboard, and you moved on. Nobody cared about the semantic layer because nobody was building systems that depended on it.

AI changed that. When you have a system that needs to understand your data, not just move it, the gaps become obvious. And obvious gaps are the ones that get funded.

I’m seeing this now from the inside. Catalog teams are building richer context layers. Integration partners are investing in open standards like OpenLineage to make lineage portable across systems. Customers are starting to treat data product definitions and domain ownership as real infrastructure, not just governance theater. The conversations I’m having today are about how to model meaning across federated systems. These weren’t happening two years ago.

The enterprises that will get the most out of AI aren’t the ones with the fanciest models. They’re the ones that do the quiet, foundational work of connecting their metadata. Mapping ownership. Closing lineage gaps. Making sure their AI systems know not just where the data is, but what it means and whether it should be trusted.

That work is slow. It’s unglamorous. But it compounds.

I’ve seen both sides of this. The frustration of being a customer who needs these answers and can’t get them, and the satisfaction of being on the team building the connective tissue. We’re not done, not even close. But we’re moving in the right direction, and I think the next few years will be remarkably productive for anyone working at this layer.

The metadata problem is real. But it’s also solvable. And the people working on it across catalog companies, query engines, and data teams within enterprises are among the most thoughtful engineers I’ve ever worked with. That gives me a lot of confidence.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Starburst’s mission is to free our customers to see the invisible and achieve the impossible

Your AI Doesn’t Know Where the Data Lives

More deployment options

Start for Free with Starburst Galaxy

What is Data Federation?

What does legacy data infrastructure look like in practice?

AI success is about far more than models

Mapping what’s actually missing

The technical catalog

Business context

Lineage

Access control

Why accessing context is harder than it looks

Why data federation is the key

Why I think we should still be optimistic about context

Start for Free with Starburst Galaxy