How Data Architecture Makes or Breaks your AI Data Strategy

And why it all comes down to access to data and context

May 14, 2026

Evan Smith

Technical Content Manager

Starburst Data

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

The 8 AM Heartbeat Moments Before Your Data Pipelines Go Live

What does it take for artificial intelligence (AI) initiatives to succeed in production?

It’s a question worth asking, and one that’s increasingly leading to one very particular blockage.

Context. Let’s take a look.

Where AI is failing in production due to a lack of context

Let’s start with what isn’t the problem but feels like it would be.

Many teams assume the main bottleneck is the Large Language Model (LLM) they’re using. That’s reasonable, and reflects the central role that models play in AI generally.

Often, however, models typically aren’t the issue. The real issue is context.

In AI, context is the hidden king that separates success from failure. Without it, AI struggles to return results that fit the expectations of the business.

The race for a context layer

What’s being done about this? A lot, actually.

AI’s ravenous need for context isn’t a secret. In fact, it’s creating something of a gold rush around contextual data. For this reason, context is quickly being recognized as the key bottleneck in achieving AI value.

To address this, the race is on to provide a context layer capable of bringing the semantic meaning that AI models need to supplement their basic training. The resulting semantic layer is fast becoming the missing gap between models, hardware, and success in AI production. Without a strong context layer, your AI and agentic solutions will flounder.

In this article, we’ll examine why your data strategy needs to evolve, and what changes you can adopt to build a strong data foundation for a successful AI strategy and agentic workloads.

Why AI fails in production without context

To understand the problem fully, it’s worth asking why context is so important for LLMs. The answer is interesting.

LLMs, by their nature, do a great job of parsing existing artifacts, such as natural human language, audio, and video. But when it comes to creating outputs specific to your business’ unique use cases, from customer experience personalization to demand forecasting, they quickly fall prey to hallucination or return outdated information.

Hallucination is part of the reality of LLM technology, and context is the solution

Importantly, this isn’t a design flaw. LLMs and the algorithms behind them are trained on massive, generalized sets of years-old data. They lack the real-time, domain-specific context that enterprise AI systems need to deliver accurate, real-world results.

Think of this as a twist on the old adage “Garbage In, Garbage Out” (GIGO). In the case of AI, this might be phrased better as “Generic In, Generic Out”. Generalized inputs result in generalized responses. Poor data quality, in particular, leads to AI outputs that are unreliable or misleading.

And where do you find the best context? Much of the best context is spread across your entire data estate. To find this context, it needs to be exposed and made available to AI applications along with rich semantic metadata.

How do you do that? Read on.

What existing data architectures need to support AI

Building this context layer requires a data architecture that can surface data from anywhere in the enterprise. It requires grounding. Fortunately, this doesn’t mean building a separate stack from your analytics data architecture. Rather, it means enhancing your existing data infrastructure so it can handle both.

This raises one important question. What’s missing from your existing stack?

Is your existing data architectures built for exploration?

One important consideration surrounds data exploration. The normal situation looks something like this.

Most analytics workloads are created for a concrete purpose.
An executive or a team needs a report or a dashboard.
You then source the data and create a gold-standard dataset that supports that request.

The problem is that, with AI-powered agents, you don’t know what data you need before you need it. This can expose gaps in your context layer when the time comes to delve deeper into one particular area than expected.

To fix this, you need two things:

A broad ground layer of context to cover all bases
The ability to explore that context easily

AI development is an iterative process. You need to incorporate iterative cycles to address data quality issues, respond to model updates, and pivot quickly in response to changing business conditions.

How Starburst helps find context needed by AI

Luckily, Starburst can help in both cases.

The first part is what Starburst was founded to do, universal data access across data sources. Providing fast access to enterprise data across the organization gives data engineering teams the freedom to experiment. Using this approach, they can find relevant data and decide which datasets work best and for which approaches, e.g., using Retrieval-Augmented Generation (RAG) vs. fine-tuning.

The second part is really a function of how Starburst works, namely, using data federation. Engineering teams can leverage distributed access to experiment with data before they make any major architectural commitments about where they’ll ultimately store that data.

Data silos starve AI agents of context

That raises a larger problem. How do you find that data in the first place? And this problem is really related to data silos, another Starburst specialty fix.

Data silos, where data is stored in disparate storage services and in multiple (often incompatible) formats, have always stymied analytics and other data workloads. They’re just as lethal for AI because they prevent teams from utilizing critical context.

Importantly, this isn’t anyone’s fault. It’s often the result of businesses moving fast and giving teams autonomy to make business decisions. It’s also a scalable way to grow, as it enables teams to own and manage their own data rather than depend on a centralized data engineering team to do it for them.

These are all good things. But the advent of AI makes it more important than ever to build bridges between these islands of data and connect them to the corporate mainland. Without data integration across siloed sources, AI systems can’t assemble the full picture they need for decision-making.

The solution, again, is data federation. Using federation, you can create a data model where all data sources are accessed through a centralized point of access. This balances local data ownership with centralized discovery and data governance, leading to a situation where:

Teams can own and control their own data, and make decisions quickly.
Everyone can discover what data other teams own and leverage it themselves.
The organization can govern all data centrally through governance policies, ensuring it conforms to all security and compliance guidelines.

Why data centralization is still not the answer to the context problem in the AI era

It’s worth reflecting on a classic cautionary tale in the data world. In the old days, the sage advice was to centralize all the data you needed for analytics projects. The modern data warehouse was born of these assumptions, with its focus on Extract, Load, and Transform (ELT) operations and centralized data pipelines to enable late-bound decisions and to utilize data for a variety of use cases.

There are two problems with this approach, and both of them cause as much of a problem for AI as they did for analytics. The first is that it never really worked. Most projects that promote complete centralization either take too long to complete or fail entirely.

The second is that it’s impossible in the modern data ecosystem. Data volumes are growing exponentially. By the time you’ve centralized the data you need, you no longer need it.

This is a basic physics problem. You don’t have the people, capacity, and time to have everything you need in one place. Building scalable data management practices through federated data access is the only way forward.

The solution is choice over access and centralization

The solution is choice. It’s not that you should never centralize data. Rather, you should adopt a selective centralization data strategy that can adapt to your AI data strategy over time. Make distributed data the default, and then move data into a modern, open table storage format in a high-speed data lake or data lakehouse to meet performance demands.

How do you do this in production? Here, the landscape is clearer. The emerging standard for this is Apache Iceberg. With its support for high-performance, high-volume tables, rich metadata, and advanced features like time travel, Iceberg is becoming the data format of choice for the context layer for AI. Its support for data lineage and validation also helps teams build trust in the datasets that feed AI models and machine learning workflows.

Data architectures lack a governed context layer

The overall takeaway here is that you need a data architecture that fits your actual business, not the other way around. Nowhere is this more true than in the area of data governance. Having access to raw data doesn’t necessarily make it easy to manage or govern. Raw data still needs to be:

Transformed, cleaned, and maintained over time
Made usable via defined accessibility patterns (SQL, JSON exports, API calls, etc.)
Secured via access controls and compliance rules to ensure appropriate usage and safeguard customer data

In other words, there’s still a missing piece, a mechanism for bundling gold-standard datasets and exposing them via the context layer. Without this, even high-quality data can’t be optimized for the AI use cases and automation workflows that drive business outcomes.

The importance of data products for AI data architecture

This is where data products come in. Data products are curated, accessible wrappers built around high-quality datasets. Think of them as packages of data that include everything you need to use and maintain the underlying data over time.

Haven’t we been here before? Yes, data products have been around for a while. But with the advent of AI, they’ve gained a newfound importance. The rich metadata that data products provide cuts down on hallucinations and inaccurate outputs, making them the ideal building block for your AI context layer. They provide a strong data foundation by delivering the data quality, semantic context, and scalability that AI initiatives demand.

Build a context-centric AI data architecture with Starburst

Overall, the landscape for AI is startling, revolutionary, and chanageable. At the same time, the foundations for AI success include many of the same ingredients that made analytics architecture important.

Most AI projects don’t fail because of bad prompts or models, but thanks to a lack of context. A federated data architecture, built on the open lakehouse and data products, gives your AI projects this much-needed context layer. It connects them to the data they need today, with the data quality and governance required for long-term success and competitive advantage.

Starburst is purpose-built to provide the context layer for tomorrow’s AI-powered solutions. Offering federated access to 50+ data sources, Managed Iceberg for demanding workloads, and data products for building your AI context layer, Starburst provides a single foundation for all your AI and analytics workloads, giving your data strategy the scalability it needs to grow.

Talk to us today to learn more about how Starburst can accelerate your AI deployment and help you move your AI ideas from prototype to production.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

The Data Engineers Guide to Iceberg v3

How Data Architecture Makes or Breaks your AI Data Strategy

More deployment options

Start for Free with Starburst Galaxy

The 8 AM Heartbeat Moments Before Your Data Pipelines Go Live

Where AI is failing in production due to a lack of context

The race for a context layer

Why AI fails in production without context

Hallucination is part of the reality of LLM technology, and context is the solution

What existing data architectures need to support AI

Is your existing data architectures built for exploration?

How Starburst helps find context needed by AI

Data silos starve AI agents of context

Why data centralization is still not the answer to the context problem in the AI era

The solution is choice over access and centralization

Data architectures lack a governed context layer

The importance of data products for AI data architecture

Build a context-centric AI data architecture with Starburst

Start for Free with Starburst Galaxy