Join us on October 8-9 in New York City for AI & Datanova 2025!

GenAI Requires an Open Data Architecture

Here's why that matters for your data
  • Starburst Team

    Starburst Team

Share

Linkedin iconFacebook iconTwitter icon

AI tools, especially GenAI, depend on large, diverse, and high-quality datasets. Predictive power, contextual accuracy, and even operational reliability all improve as models are trained on richer, more representative data. 

For example, a retail business seeking to deploy an AI agent to forecast inventory needs must leverage not just sales or logistics data, but also marketing, supply chain, weather patterns, and competitor actions. Collectively, this data is stored in multiple systems and formats, each configured to accommodate varying levels of sensitivity.

Traditional data systems, including monolithic data warehouses, standalone SaaS app stores, or legacy analytics stacks, are restrictive by nature:

  • Siloed Systems: Data is partitioned by function. Typically, these divisions include finance, sales, HR, and operations. Each domain comprises unique interfaces, access rules, and data models. Unfortunately, this can lead to siloed data, which results in situations where data from one team is inaccessible to another. 
  • Vendor Lock-in: Proprietary formats, query languages, and APIs make switching to or integrating new, best-in-class tools costly and complex.
  • Slow Centralization: The old solution to data sprawl was copying everything to a central repository, an approach that is time- and resource-intensive, frequently outdated, and runs counter to modern privacy and sovereignty requirements.
  • Poor Data Discoverability and Governance: Without a unified view that accounts for data quality, teams can neither find nor consistently secure the datasets that matter. This is dangerous in regulated industries and disastrous for data-driven AI models.

In the context of AI, this results in underpowered models, hallucinated responses (due to insufficient business context), and monumental compliance challenges.

 

What do you mean by data architecture?

Data architecture refers to the framework that defines how data is collected, stored, transformed, distributed, and consumed within an organization. It encompasses the models, policies, rules, and standards that govern which data is collected, how it’s stored, arranged, integrated, and used in data systems. In the context of AI, data architecture is critical because it determines whether data is accessible, interoperable, and usable across systems. Each of these factors directly impact AI performance. An effective data architecture enables organizations to treat data as a strategic asset that can be leveraged for business intelligence, analytics, and AI initiatives.

What are the three types of data architecture?

The three primary types of data architecture are:

  1. Centralized Architecture: All data is stored in a single, central repository (like a data warehouse). This approach provides a single source of truth but may create bottlenecks and challenges in accessing real-time data.
  2. Distributed Architecture: Data is stored across multiple locations or systems, allowing for greater scalability and fault tolerance. This includes data lakes, federated databases, and multi-cloud deployments.
  3. Hybrid Architecture: Combines elements of both centralized and distributed approaches. For example, an organization might maintain a central data warehouse for structured data while using distributed data lakes for unstructured data. This approach is becoming increasingly common as organizations strive to strike a balance between governance and flexibility.

In modern contexts, we’re also seeing the emergence of open architectures, including the Icehouse architectural model, that emphasize interoperability, standard formats, and the separation of compute and storage layers.

Is ETL part of data architecture?

Yes, Extract, Transform, Load (ETL) processes are an essential component of data architecture. ETL refers to the procedures used to extract data from various sources, transform it to fit operational needs, and load it into target systems for analysis, reporting, or AI training.

While ETL is part of data architecture, it’s not the architecture itself. Instead, it’s one of the processes that operates within the architectural framework. In traditional approaches, ETL often involves copying data to centralized repositories, which can be time-consuming and create outdated snapshots. Modern data architectures may supplement or replace traditional ETL with approaches such as ELT (Extract, Load, Transform) or data virtualization, which allows for querying data in place, thereby reducing the need for extensive data movement.

What does a data architect do?

A data architect is responsible for designing, creating, deploying, and managing an organization’s data architecture. Their key responsibilities include:

  1. Strategic Planning: Developing data strategies aligned with business objectives and identifying how data can support organizational goals.
  2. Architecture Design: Creating blueprints for data management systems that balance performance, scalability, security, and compliance requirements.
  3. Standards Development: Establishing data standards, policies, and governance frameworks to ensure data quality, consistency, and regulatory compliance.
  4. Technology Selection: Evaluating and recommending appropriate data technologies, tools, and platforms based on business needs.
  5. Integration Planning: Determining how various data sources and systems will work together, often designing APIs and interfaces between systems.
  6. Security Implementation: Ensuring appropriate data security controls are in place to protect sensitive information.
  7. Team Leadership: Guiding data engineers, analysts, and scientists on how to work within the established architecture.

In the context of AI initiatives, data architects increasingly need to design systems that enable AI models to access comprehensive, up-to-date data while maintaining security and governance controls.

 

Open data architecture defined

The solution is an open data architecture. This data architecture rests on a few critical properties:

  • Interoperability: Data is accessible across systems via universal interfaces, not tied to one vendor’s technology stack.
  • Standard Formats: Data tables use open structures, such as Apache Iceberg, enabling multiple compute engines and workflows to read/write data concurrently.
  • Federation: The architecture enables querying across distributed sources (cloud, on-premises, SaaS, and various databases) in a unified manner.
  • Optionality: Organizations aren’t trapped by artificial boundaries on storage, compute, or governance. They adjust infrastructure as needs evolve.
  • Separation of Compute and Storage: Compute engines (like Trino) are decoupled from storage format/location. You can swap, scale, or update one without disrupting the other.
  • Self-Describing, Governable Datasets: Metadata, lineage, and access controls are readily available, enforceable, and extensible.

 

Why GenAI needs open data architecture

In particular, GenAI thrives on an open data architecture, and when adopting an AI strategy, it is essential to keep an open data architecture in mind. Let’s look at a few reasons for the importance of interoperability and open architecture. 

1. Universal data access for context-rich AI

GenAI, particularly LLM-based tools or AI agents that automate complex business workflows, achieves its full potential only when given broad, context-rich access to data. If the AI only ingests CRM or sales records but lacks knowledge of supply chain disruptions, customer support incidents, or compliance rules, its inferences will be superficial and occasionally risky.

Example: A global logistics company deploys an AI-powered agent to forecast shipping delays. If the architecture only allows access to shipping manifests and past delivery times in its ERP, the forecasts will miss vital local port strike data sitting in a third-party SaaS; they’ll also ignore up-to-the-minute weather data available via a cloud API. With an open data architecture, the AI can federate queries across on-premises data, cloud object storage (such as S3), and external APIs, producing a holistic and accurate forecast.

2. Data sovereignty, compliance, and governance

AI initiatives increasingly face scrutiny regarding personally identifiable information (PII), regulatory compliance, and data sovereignty. Enterprises—especially in banking, healthcare, insurance, and government—must ensure that data never crosses borders or leaves specified cloud regions. Centralizing all data into a single hyperscaler’s cloud is often illegal or impossible.

An open data architecture enables bring your AI to the data, rather than moving sensitive data to the AI. This approach allows on-premises and hybrid AI deployments. Data access and compute are governed and auditable, with role-based and attribute-based access controls consistently enforced.

Example: A healthcare provider wishes to deploy a GenAI-powered assistant for patient query routing, but privacy laws (such as HIPAA or GDPR) prevent the bulk export of patient data to a cloud LLM provider. With an open architecture, LLM fine-tuning or RAG (retrieval-augmented generation) workflows happen locally on the provider’s infrastructure, querying consented data on-premises and federating only anonymized insights where allowed.

3. Eliminating data silos and vendor lock-in

AI models that rely solely on the data conveniently available in a SaaS application or a single data warehouse suffer from a limited perspective. Furthermore, changing business needs or new AI innovations could require pivoting to new tools or engines.

Open architectures—especially those utilizing open table formats (Apache Iceberg, Delta Lake, and Hudi) and compute engines (such as Trino)—allow teams to evolve, upgrade, or migrate their tooling without re-platforming their entire data estate.

Example: A financial institution runs risk analytics in a proprietary data warehouse. Later, they wish to experiment with a new, more efficient AI computing engine or incorporate a specialized fraud detection library. Because their underlying data uses an open format and is accessible via standard SQL interfaces, they can do so without migrating terabytes of sensitive, regulated data or getting locked into new contracts.

4. Real-time data, high velocity AI

Modern AI doesn’t just need massive amounts of data; it requires the most current data. In fast-moving sectors (e.g., trading, supply chain logistics, cybersecurity), even a one-hour lag can render insights useless. To achieve that requires an understanding of data velocity

Open, federated architectures support near real-time ingestion, streaming, and analysis. This is something that closed or tightly centralized repositories struggle to do efficiently.

Example: An e-commerce company utilizes GenAI to personalize its website and make product recommendations. If the AI relies on a nightly batch-processed data warehouse load, product recommendations will be stale. With an open architecture that embraces federated, real-time data streams (from in-store sensors, inventory systems, and web logs), the AI delivers timely, relevant suggestions, thereby maximizing sales.

5. Collaborative curation and data products

AI is most powerful when organizations treat data not as a static resource, but as an evolving part of their business. Data products help you do this. These curated, governed, purpose-built assets are shareable and reusable. They help unlock data silos and are a critical aspect of both analytics and AI workloads

Open architectures enable the packaging of cross-source, federated datasets into governed data products with traceable lineage and embedded access controls. Data products facilitate cross-team collaboration while maintaining robust data governance. This combination makes them instantly discoverable and usable for a wide range of AI and analytics projects.

Example: A multinational insurer aggregates claims data, policy details, and third-party loss event reports into a governed “Catastrophe Analysis” data product. Under an open data architecture, the AI models can be trained or prompted on these assembled products without navigating a maze of incompatible schemas or access barriers.

 

The Icehouse: Open architecture as an AI foundation

Open data architecture is ideally suited to the data lakehouse. It provides the ideal, open foundation for enterprise data, analytics, and AI. But not all data lakehouses are created equally. Some are more open than others. One particular type of lakehouse – one built using Icehouse architecture with Apache Iceberg and Trino – stands out, providing an exceptional commitment to open standards. This model shifts the data stack from a proprietary, brittle monolith to an open, flexible, and future-proof platform.

Key properties of an Icehouse include:

  • Data can reside on-premises, in public or private cloud, or across multiple providers. Compute engines can be swapped or diversified as needed.
  • Data is queryable using either centralization or federation, reducing the need for automatic, costly migrations.
  • Governance and security are standardized and enforceable across the organization.
  • New AI workflow demands (retrieval-augmented generation, vector search, AI agents) are met without re-architecting.

Organizations leveraging this approach not only accelerate AI adoption but also reduce technical debt, cut costs, and avoid painful business problems as technology or regulations evolve.

 

Real-world implications

Open data architecture has numerous advantages. Some of these are technological, and others are organizational. Businesses that embrace open data architectures:

  • Accelerate AI Launches: Time-to-production for new AI applications drops from quarters to weeks.
  • Reduce Risk: Regulatory and security posture improves due to consistent, auditable governance.
  • Cut Costs: Cloud-neutral storage and compute, along with the separation of concerns, avoid lock-in and enable best-of-breed selection at every layer.
  • Drive Innovation: New tools, languages, and models can be integrated without requiring architectural adjustments, ensuring businesses aren’t left behind as GenAI continues its rapid advancement.

Scenario: A multinational corporation acquires a smaller company in a different regulatory jurisdiction. The parent wants to deploy unified AI-driven analytics leveraging both firms’ data, but legal and technical barriers exist. Thanks to an open, federated architecture, joint AI workloads can be assembled using governed access to each legacy system, without violating residency laws or waiting months for ETL migrations.

 

Conclusion

GenAI transforms what organizations demand from their data. This includes greater volume, speed, context, and responsiveness, while delivering all of this under tight governance. Only an open data architecture can reconcile these requirements.

Such an architecture brings the AI “to the data,” not the other way around, enabling secure, fast, and contextually rich intelligence where and when it’s needed. Proprietary systems, isolated data lakes, and “one-cloud-fits-all” models will slow or stall progress.

The future belongs to organizations that embrace open standards, separation of compute and storage, federated access, and collaborative data product creation. This lays the foundation not only for successful GenAI but also for other data-driven initiatives.

 

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free