Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Every enterprise will build a context layer for AI agents, whether they call it that or not. The ones that build it deliberately will operate agents materially more accurately than competitors using the same foundational models. The discipline this requires is new. This piece is the architecture for it.


Every enterprise building AI agents hits the same wall at the same moment. The prototype works beautifully. The agent writes SQL, generates dashboards, and answers business questions. Then someone asks: “What’s our customer churn rate?”

The agent confidently writes a query against the wrong table. It picks mysql_support.tickets.customer_id instead of duckdb.main.subscriptions.customer_id. It doesn’t know that “churn” in your organization means canceled-within-30-days, not simply inactive. It doesn’t know that the churned column in DuckDB is the canonical source, while the support database only captures symptom data. It returns a number. The number is wrong. And no one catches it until it’s in a board deck.

This is not a model problem, and it will not be solved by the next Opus, or GPT, or Gemini model. It is the first symptom of an architectural shift the data stack has not yet absorbed: the consumer of enterprise data changed, and the infrastructure was not built for the new consumer.

For thirty years, every consumer of enterprise data has been able to tolerate ambiguity. Analysts asked follow-up questions. BI tools displayed metric definitions next to numbers so a human could judge them. Batch pipelines were built by engineers who knew which “customer” they meant. The infrastructure did not have to encode business meaning explicitly because the consumer always carried the meaning in their head.

Agents don’t. They cannot stop to browse, cannot fall back on tribal knowledge, cannot pattern-match their way past a vague metric definition. They either receive the right context before they reason — assembled, scoped, structured, with confidence weighting attached — or they produce a confident, plausible, wrong answer at the speed of inference.

The discipline that closes this gap is agent grounding: the function responsible for delivering trustworthy business context to an agent before it reasons. It is not a feature of any existing system. It is the function none of them perform, not because the vendors haven’t tried, but because the optimization target is fundamentally different from anything the modern data stack was built around.

The evidence is direct, and it shows up in three independent literatures. Snowflake’s engineering team demonstrated that the same foundation model achieves 50% accuracy on enterprise data questions without structured business context, and over 90% accuracy with it. AWS independently confirmed that adding column-level descriptions, value distributions, and join constraints improves text-to-SQL accuracy through correct literal selection and correct join conditions. Princeton’s SWE-agent authors documented at NeurIPS 2024 that agent-computer interfaces (ACI) require fundamentally different design than human-computer interfaces (HCI), favoring compact, structured, deterministic context delivery over exploratory browsing. Whichever literature you start from, the conclusion is the same: the model is not the bottleneck. The grounding layer is.

This piece is the architecture for building grounding deliberately. It introduces the Enterprise Context Layer (ECL) as the implementation pattern — the system in your stack that performs the grounding function — the seven-layer reference architecture for what an ECL must contain. The more important argument is the discipline itself. Every enterprise will build grounding capability. The ones who build it as a first-class primitive will compound an accuracy advantage. The rest will rebuild it later, having paid for it twice.

Why the Existing Stack Doesn’t Do This

The data stack has four layers that touch metadata. Each solves a real problem. None performs the grounding function.

Data catalogs were designed for human discovery. Analysts couldn’t find data across sprawling warehouses, so the catalog gave them a browsable, searchable portal. This works well for its original consumer. It fails for agents because the consumption model is incompatible: humans can stop mid-task to browse; agents cannot stop mid-reasoning to search. Context has to arrive before reasoning starts, not during it. A search index optimized for human exploration is architecturally mismatched to inference-time context delivery, not because the vendors haven’t tried, but because the optimization target is different.

Semantic layers (dbt MetricFlow, Cube, LookML) were designed for metric consistency across BI tools and warehouses. They solve a real problem: the same ‘revenue’ metric calculated seventeen different ways across reports. They work well within a unified data platform. They hit their ceiling at cross-source identity. When ‘customer’ lives in your CRM, your billing system, and your support database as three different identifiers under three different schemas, no semantic model can express the entity that unifies them — because semantic layers define how to compute metrics, not how to resolve what a concept means across systems.

Knowledge graphs were designed to represent relationships between concepts in structured, queryable form. They solve a real problem: how does “customer” relate to “subscription” relate to “revenue event”? The ceiling is economics and stasis. Traditional knowledge graphs were too expensive to build and too static to maintain as inference-time operational context. Construction required months of expert curation; keeping them current required continuous investment that few organizations could sustain. That made them organizational artifacts: valuable in theory but chronically outdated in practice, and never designed for the millisecond retrieval requirements of a live agent.

MCP (the Model Context Protocol) was designed for connectivity. It solves a genuine problem: agents need a standardized way to call tools across disparate systems. Every major catalog and platform now ships an MCP server. But MCP is a transport layer. It specifies how an agent calls a tool. It says nothing about what the agent should know before it reasons. Connectivity is not grounding. You can expose the entire data stack through MCP and still have an agent that writes confidently wrong SQL because it doesn’t know your churn definition.

These four layers do their jobs. The gap between them is not a feature gap in any one layer. It is a missing function: the function responsible for assembling structured, scoped, business-semantic context for an agent at inference time, before it reasons.

That function is grounding. The system that performs it is the Enterprise Context Layer.

What Grounding Is Not

The naming contest will be loud. Catalog vendors will argue that grounding is “catalog with an MCP endpoint.” Semantic layer vendors will argue it’s “semantic layer, extended for agents.” Knowledge graph vendors will argue they have been here all along. RAG framework vendors will argue grounding is what they already do. Each will be half-right about the substrate and wrong about the function.

It is worth being precise about what grounding is and isn’t, because the category will not stick if the boundary is fuzzy.

Grounding is not catalog search. Catalogs answer a discovery question: “What data exists?” Grounding answers an assembly question: “What does this agent need to know about this question, right now?” Adding an MCP endpoint to a catalog exposes the discovery interface to an agent. It does not assemble.

Grounding is not a semantic layer. Semantic layers compile metric and business logic definitions to SQL, typically operating within or on top of a warehouse or lakehouse. Grounding spans warehouses, encodes cross-source identity, includes business rules that aren’t expressible as metrics, and serves the result as one assembled, scoped package per question. A semantic layer is a component of grounding, not a substitute.

Grounding is not a knowledge graph. Knowledge graphs encode concept relationships in queryable form. Grounding requires a knowledge graph to exist, but adds the agentic construction pipeline that builds and maintains it without the heavyweight governance cycles that have historically made knowledge graphs slow to maintain, the scoped retrieval that delivers focused context per question, and the continuous refresh that prevents the staleness that historically made knowledge graphs governance artifacts rather than operational primitives.

Grounding is not RAG over documentation. RAG retrieves passages of text. Grounding assembles structured, executable context: schemas come with statistics, metrics come with verified SQL, joins come with confidence scores, rules come with severity attached. An agent given retrieved documentation still has to reason its way to correct SQL. An agent given grounded context executes known-good patterns.

The pattern in all four cases is the same. Existing systems handle a substrate that grounding requires. None of them perform the function. Retrofitting is not the same as rebuilding around the new consumer.

What Grounding Requires

For an Enterprise Context Layer to perform the grounding function, it must do four things, in order: harvest live metadata from the data infrastructure; structure it into business semantics; assemble scoped context for each question; and serve that context to the agent before reasoning begins.

Three properties distinguish a grounding-native system from a metadata system that has had an agent interface bolted on.

Built by agents, certified by humans. The ECL is constructed through an agentic discovery pipeline. LLMs scan your data infrastructure, detect entities and relationships, propose metrics and business rules, and present findings for human review. A human does not hand-build the ECL. An agent builds it. A human approves it. This inversion is the only thing that makes economics work. Every prior generation of enterprise knowledge graph failed because the cost of construction exceeded the value of discovery. That cost has now collapsed.

Built for agents. The ECL serves context through a scoped retrieval API. An agent asks: “I need context for a question about customer revenue.” The ECL returns exactly the tables, metrics, entities, join paths, and business rules relevant to that scope, sized to fit a context window, ranked by confidence, ready for reasoning. What’s returned isn’t a list of search results to parse or a graph to traverse. It’s assembled ground truth.

Continuously maintained. The ECL monitors your data infrastructure for schema changes, re-evaluates confidence scores as usage patterns shift, detects drift between metric definitions and actual data, and proposes updates. The substrate is a living system that evolves with your business, not a static document.

The next section makes this concrete with a seven-layer reference architecture for the contents of an ECL, drawn from a working implementation built against a federated multi-source data environment.

A Reference Architecture

Each of the seven layers answers a different question an agent cannot answer correctly without explicit grounding. Together, the layers encode not just what data exists, but what it means, how it connects, and how to use it correctly.

Layer 1: Physical Layer — What Exists

The foundation. An agentic scan of your data infrastructure that captures not just schema metadata but operational context: row counts, freshness, null rates, value distributions, PII classification. (See Appendix A.1 for a full example.)

This is not a SHOW COLUMNS dump. It is enriched metadata that tells an agent: this column is a primary key (safe to join on), this column contains PII (do not expose in results), this column has a skewed distribution (aggregations may mislead without filtering). The accuracy impact is direct. AWS research showed that adding column descriptions with possible values and foreign key constraints independently improves SQL generation through correct literal values in filters and correct join conditions.

Layer 2: Entity Resolution — What Is the Same Thing

Entity resolution answers the question that no catalog answers automatically: across all your data sources, which columns represent the same real-world concept? A single resolved entity carries a master source, a ranked list of mappings in other systems, the join expression for each, the sampled value overlap that validates it, a confidence score, and the detection method that produced it. (See Appendix A.2.)

This is what makes cross-source questions answerable. When an agent is asked, “Show me customers with high support ticket volume but low product ratings,” it needs to know that customer_id in three different databases refers to the same human being, and exactly how to join them.

The discovery pipeline builds this in three stages: column name pattern matching identifies candidates; value overlap sampling validates them statistically; LLM semantic analysis resolves ambiguous cases like cust_id vs customer_id. Every resolution carries a confidence score and detection method. The agent knows how much to trust each mapping, and why.

Without this layer, “customer” is three unrelated identifiers. With it, “customer” is a unified entity with a canonical source and verified join paths to every system that holds customer data.

Layer 3: Ontology — How Concepts Relate

The ontology defines your business domain structure as a first-class semantic model, not tags bolted onto tables. A domain groups the entities, key tables, cross-source join paths, and usage guidance that apply to a coherent slice of the business (Commerce, CustomerExperience, Finance) so that any question can be routed to a bounded, domain-scoped context. (See Appendix A.3.)

The ontology is what makes scoped retrieval possible. When an agent receives a question about “revenue trends,” the ECL resolves “revenue” to the Commerce domain and delivers a focused context package: the relevant tables, metrics, join paths, and business rules for that domain, and nothing else. An ECL with 500 tables still delivers constant-size, focused context for any single question.

Layer 4: Metrics Store — Executable Business Definitions

A metric is not a description. It is executable SQL with full provenance: the canonical query, a structured lineage of tables, columns and filters it depends on, a confidence score, a verification author, and a last-validated timestamp. (See Appendix A.4.)

When an agent is asked “What’s our MRR?”, it does not generate SQL from scratch. It retrieves the canonical metric definition, adapts it if needed, and executes known-good SQL. The agent becomes a metric executor, not a metric inventor. The lineage is bidirectional: from a metric, trace down to source tables and columns; from a table, trace up to every metric affected by a schema change.

Layer 5: Entity-Metric Lineage — The Provenance Graph

Pipeline lineage tracks data movement: this job read from A and wrote to B. Semantic lineage tracks meaning: which business entities and metrics connect to which physical assets, and why. Every entity resolves to a master source, the metrics derived from it, the join paths that made those metrics possible, and the domains it participates in. (See Appendix A.5.)

When an agent touches customer_id, the ECL surfaces: this entity connects to 3 metrics across 2 domains. The churn_rate metric has a business rule: only count customers with more than 30 days of tenure. Semantic lineage is forward-looking; it derives from what things mean, not from what jobs have run.

Layer 6: Business Glossary — Disambiguation for Machines

“Customer” in Sales means an account with a signed contract. “Customer” in Product means anyone who logged in. “Customer” in Support means anyone who filed a ticket. A human absorbs these distinctions over months of tribal knowledge. An agent needs them as machine-readable disambiguation logic, not documentation but executable filters. Each glossary term carries a canonical definition, a set of context-variant filters, and notes on common misinterpretations. (See Appendix A.6.)

When asked “How many customers do we have?”, the agent can apply the default definition or ask: “Do you mean paying customers (12,400) or all users with accounts (48,291)?” Confident disambiguation rather than confident wrong answer.

Layer 7: Business Rules — Tribal Knowledge Made Executable

Every organization has rules that exist nowhere in code. They live in Slack threads, onboarding docs, and the heads of senior analysts. These rules separate technically valid SQL from the SQL that matches what Finance actually reports. Each rule carries a domain, a severity, a machine-readable filter, the metrics it protects, and the business impact of violating it. (See Appendix A.7.)

Business rules are the ECL’s integrity layer. They prevent the class of errors that are technically valid SQL but semantically wrong — the errors that erode trust in AI systems fastest. An agent that writes beautiful, executable SQL that overstates revenue by 8% does more damage than one that returns an error. Silent failures are not failures the system catches. They are failures the business catches, after the number has already traveled.

Built by Agents: Context Harvesting

The seven layers above describe what the ECL contains. How those layers get populated is the harder problem: the mechanism that continuously extracts new metadata, validates what changed, and proposes updates from live data infrastructure without drowning humans in review queues.

Traditional approaches to enterprise knowledge (catalog curation, ontology building, metric documentation) fail for the same reason: the cost of construction exceeds the value of discovery. Gartner puts the failure rate of data governance initiatives at 80% by 2027. Without automation, enterprise metadata completeness stalls between 30 and 40 percent. The blocker was never the concept. It was the curation cost.

The ECL inverts this. The discovery pipeline is agentic: LLMs do the heavy lifting, humans review and approve.

Image depicting the agentic discovery process.

Each phase produces proposals, not final answers. Every proposal carries a confidence score and detection method. The review interface uses confidence banding:

  • High confidence (>0.85): Bulk-approve with one click. Schema facts, obvious entity matches, verified metrics. Roughly 80% of proposals land here.
  • Medium confidence (0.50–0.85): Review individually. Cross-source entity resolutions, inferred relationships, LLM-suggested metrics needing domain expert validation.
  • Low confidence (<0.50): Full detail with editing. The system flags its uncertainty and asks for human judgment.

The economics are the point. Hand-curating an ECL for 3 data catalogs and 9 tables took approximately 8 hours. The agentic pipeline produces equivalent output in under 2 minutes. Coverage is higher too, because LLMs surface patterns humans miss (like cust_id being a VARCHAR-encoded version of customer_id). Industry benchmarks show automation reduces data steward curation hours by 70 to 80 percent.

The blocker to high-quality enterprise semantic layers was always the curation cost. That cost has collapsed.

Built for Agents: Context Assembly and Serving

The ECL’s value is realized at the retrieval layer, where structured knowledge becomes ground truth delivered to a reasoning agent, assembled and scoped to the question at hand.

The contrast with catalog retrieval is instructive, not to criticize catalogs but to illustrate why grounding is a different function.

A catalog retrieval (GET /search?q=revenue) returns a ranked list of matching assets. The agent parses results, follows links, and assembles context. It is doing discovery inside its reasoning loop, which is exactly the wrong moment for discovery to happen.

ECL scoped retrieval works differently. The agent calls get_ecl_context(domains, entities). One call, roughly 2,000 tokens, returns a single assembled package: the relevant tables with their column statistics, the verified join paths between them, the canonical metric SQL, the business rules that constrain those metrics, and the glossary terms that disambiguate them. (See Appendix A.8 for the full response shape.)

One call. Everything the agent needs. Nothing it doesn’t. Context is pre-assembled, scoped to the question, and sized to fit a context window. Rather than searching, the agent receives ground truth.

The retrieval system operates in two modes:

  1. Index mode: A lightweight domain summary (~300 tokens) lives permanently in the agent’s system prompt, letting the agent determine which domains and entities are relevant to any question.
  2. Scoped retrieval: The agent calls get_ecl_context() with specific domains and entities. The ECL returns detailed context (table schemas, metric SQL, join paths, business rules) for exactly those scopes.

Token efficiency stays constant as the ECL scales. An ECL with 500 tables and 200 metrics delivers the same focused, correctly-sized context package for any single question as one with 20 tables.

The Strongest Counter, Confronted

The honest objection: will this not go away? Foundation models keep getting better. Context windows are now millions of tokens. Agentic search frameworks like SWE-agent demonstrate that agents can do impressive in-loop discovery. Why build infrastructure for a problem that scaling will eventually absorb?

Two reasons. Neither is about model capability. Both are about what the model cannot know.

Larger context windows solve capacity, not authority. An agent with a million-token window can hold the entire schema. It still does not know which of three “customer” definitions matches what Finance reports. Authority — this is the canonical definition of revenue, verified by a domain expert, last validated April 10th, with these exclusion rules and this lineage — is not encoded in the data. It exists only where someone has explicitly written it down. No amount of context capacity creates authority that does not exist.

Better agentic search solves findability, not trust. SWE-agent works because code is self-validating: the test suite either passes or it does not, and the agent can iterate against that signal. Enterprise data has no equivalent. An agent can find a metric definition through search. It cannot independently verify that this is the metric Finance uses, that this filter accounts for refunds, that this join handles the legacy VARCHAR encoding correctly. Trust requires provenance and review, both of which have to be built into the substrate before the agent reasons over it. An agent that searches harder for the wrong metric still produces a wrong answer.

The pattern is older than AI. Every generation of enterprise data infrastructure has eventually built an authority layer because the underlying data alone never carries enough context to be trusted at the speed of decisions. Pre-AI, that authority lived in BI tool definitions, in semantic layer configs, in the heads of senior analysts: slow consumers who absorbed ambiguity, asked follow-up questions, and applied judgment to the gaps. The agent era requires the same authority delivered to a different consumer: at inference time, in machine-consumable form, with confidence weighting attached, before reasoning begins.

This is not a problem foundation models will absorb. It is a problem foundation models will expose, more sharply, with each capability gain. Faster reasoning over wrong context produces wrong answers faster.

The Stack, Complete

Image depicting the agentic stack.

The ECL occupies the gap between raw data infrastructure and reasoning agents, not replacing any existing layer but performing the function none of them performs: grounding agents in structured business reality before they reason.

Every agent in the organization (SQL agent, analytics copilot, BI chatbot, data quality monitor) calls the same retrieval API. They all receive the same ground truth. The ECL is the control plane for agent reasoning across the entire data estate.

A note on the metadata origin. Grounding quality compounds upward from the substrate that seeds the physical layer. A metadata consumer (a system that crawls sources periodically and stores ingested copies) introduces drift between refresh cycles, which propagates upward through entity resolution, metric validation, and confidence scoring. A metadata origin (a system that holds live, authoritative cross-source metadata as a structural byproduct of normal operation) does not. The most common pattern that meets this requirement is a federated query engine, because federation produces a unified live information_schema and statistically rich runtime metadata as side effects of doing its primary job. (Appendix B expands on this implementation choice.)

Why Now

Three forces are making this urgent.

The demo-to-production gap is widening. Agents work on 3-table demo databases. They fail on 300-table production environments. The BEAVER benchmark, which tests models on schemas averaging 105 tables and 4.25 joins per query, reflecting real enterprise complexity, finds that state-of-the-art models fail on both table retrieval and SQL assembly even when given full schema access. This is not a model capability problem. It is a context infrastructure problem that scales with data estate complexity. The agents enterprises actually need are exactly the ones most exposed to this gap.

The stakes are rising faster than the accuracy is. A chatbot that confabulates a fact is embarrassing. An agent that produces a revenue number 8% too high because it didn’t know about the refund exclusion rule, and that number reaches a board deck, is a materially different class of failure. Industry surveys indicate that nearly half of enterprise AI users have already made a major business decision based on hallucinated AI output. As agents move from experimental to operational, the question shifts from “is this impressive?” to “can we trust this?” Trust requires ground truth.

LLMs can now build what they could not previously be given. Two years ago, constructing a cross-source entity resolution layer required months of manual ontology work. Today, an LLM can do all of that work in minutes: scan a schema, detect entity relationships across disparate sources, propose metric definitions, generate business glossary terms, surface business rules. The 50% to 90%+ accuracy improvement from structured context injection is available now, not a future capability. The curation cost that made knowledge graphs a governance curiosity rather than an operational primitive has collapsed.

The category is here. The question is who builds it deliberately and who builds it the slow way.

The Prediction

Within twenty-four months, every enterprise running production AI agents will have built a grounding layer, whether they call it that or not.

The companies that build it deliberately, as a first-class primitive with dedicated tooling, ownership, and evaluation, will operate agents measurably more accurate than competitors using the same foundation models. By 2028, the primary differentiator between enterprises in AI maturity benchmarks won’t be which foundation model they use, or how cleverly they prompt it, or which agent framework they’ve adopted. It will be how well they ground their agents in their own business.

The companies that don’t will discover the layer the slow way, as a patchwork of catalog exports, semantic layer YAML, and undocumented Python that someone wrote after the third revenue number didn’t match Finance. They will rebuild this primitive at three times the cost, having paid for it twice: once in the lost decisions and eroded trust along the way, and again in the eventual formalization.

Every era of enterprise computing has been defined by a control plane, the layer that manages and coordinates the era’s critical resource. The infrastructure era’s control plane managed compute and storage. The data era’s control plane managed pipelines, warehouses, and access. Every serious enterprise eventually built it, because nothing above it could operate reliably without it.

The critical resource of the agent era is neither compute nor storage. It’s trusted context: the structured, current, business-specific ground truth that decides whether an agent’s reasoning lands correct or wrong.

Agent grounding is the discipline. The Enterprise Context Layer is the control plane. Foundation models are the engine.

The engine, without the rest of the system, reasons in the dark.


Based on research and development of the Enterprise Context Layer primitive, a seven-layer architecture for agent-consumable business context, built and evaluated against federated multi-source data environments. Statistical references: Snowflake Engineering Blog (text-to-SQL accuracy improvement with structured semantic context); Gartner (data governance initiative failure prediction, February 2024); NeurIPS 2024 SWE-agent paper (Agent-Computer Interfaces); EDBT 2025 / BEAVER benchmark (enterprise text-to-SQL analysis); AWS Big Data Blog (metadata enrichment and SQL accuracy); industry benchmarks on automated metadata curation efficiency.


Appendix A: Enterprise Context Layer Examples

The prose above describes what each layer contains. This appendix shows the concrete shape of each layer as it is stored, queried, and served. Examples are drawn from a reference ECL built against a multi-catalog Trino environment (DuckDB + Postgres + MySQL).

A.1 Physical Layer

Physical Layer for duckdb.main.customers:
  columns:
    - customer_id: BIGINT (PK, unique, 0% null)
    - email: VARCHAR (unique, PII:sensitive)
    - signup_date: DATE (range: 2020-01-15 to 2025-12-01)
    - plan_type: VARCHAR (enum: free|pro|enterprise, 62% pro)
    - lifetime_value: DECIMAL(10,2) (p50: 340, p99: 12,400)
  stats:
    row_count: 48,291
    last_updated: 2025-04-14T08:00:00Z

A.2 Entity Resolution

Entity: customer
  master_source: duckdb.main.customers.customer_id
  mappings:
    - postgres_reviews.product_reviews.customer_id
      join: direct equality (BIGINT = BIGINT)
      overlap: 94.2% (sampled)
    - mysql_support.tickets.cust_id
      join: CAST(cust_id AS BIGINT) = customer_id
      overlap: 87.1% (sampled)
      note: legacy system uses VARCHAR format
  confidence: 0.91
  detection: column_pattern + value_sampling + llm_semantic

A.3 Ontology

Domain: Commerce
  description: Revenue, orders, subscriptions, and customer lifecycle
  entities: [customer, product, order, subscription]
  key_tables:
    - duckdb.main.orders (fact: transactions)
    - duckdb.main.customers (dimension: customer attributes)
    - duckdb.main.subscriptions (fact: recurring revenue)
  cross_source_joins:
    - customer_id links orders → customers → subscriptions
  usage_guidance: >
    Always filter orders by status != 'cancelled' for revenue calculations.
    Use subscription start_date, not order_date, for MRR metrics.

A.4 Metrics Store

Metric: monthly_recurring_revenue
  domain: Commerce
  full_query: |
    SELECT
      DATE_TRUNC('month', s.start_date) AS month,
      SUM(s.monthly_amount) AS mrr
    FROM duckdb.main.subscriptions s
    WHERE s.status = 'active'
      AND s.subscription_type = 'recurring'
    GROUP BY 1
  lineage:
    tables: [duckdb.main.subscriptions]
    columns: [start_date, monthly_amount, status, subscription_type]
    filters: [status = 'active', subscription_type = 'recurring']
  confidence: 0.93
  verified_by: domain_expert
  last_validated: 2025-04-10

A.5 Entity-Metric Lineage

customer (entity)
  ├── duckdb.main.customers (master source)
  ├── monthly_recurring_revenue (metric: via subscriptions.customer_id)
  ├── customer_churn_rate (metric: via subscriptions.churned)
  ├── avg_satisfaction_score (metric: via tickets.customer_id)
  └── Commerce (domain), CustomerExperience (domain)

A.6 Business Glossary

Term: customer
  canonical_definition: >
    An individual or organization with an active account,
    identified by customer_id in duckdb.main.customers.
  context_variations:
    - context: Sales
      meaning: Account with signed contract
      filter: "WHERE plan_type IN ('pro', 'enterprise')"
    - context: Product
      meaning: Any user with login activity in last 90 days
      filter: "WHERE last_login > CURRENT_DATE - INTERVAL '90' DAY"
    - context: Support
      meaning: Any user who has filed at least one ticket
      filter: "EXISTS (SELECT 1 FROM mysql_support...tickets t WHERE t.cust_id = CAST(c.customer_id AS VARCHAR))"
  common_misinterpretations:
    - "'Total customers' almost always means Sales definition unless specified"
    - "Do not count free-tier users in customer counts for board metrics"

A.7 Business Rules

Rule: revenue_excludes_refunds
  domain: Commerce
  severity: critical
  description: >
    All revenue metrics must exclude refunded orders.
    Filter: status != 'refunded' AND status != 'cancelled'
  affected_metrics: [monthly_recurring_revenue, total_revenue, arpu]
  impact: >
    Without this filter, revenue is overstated by ~8%.
    Board-reported metrics will not match Finance numbers.

A.8 Scoped Retrieval Response

get_ecl_context(domains: ["Commerce"], entities: ["customer", "order"])

Returns (one call, ~2000 tokens):
{
  tables: [
    { name: "duckdb.main.orders", columns: [...], stats: {...} },
    { name: "duckdb.main.customers", columns: [...], stats: {...} }
  ],
  join_paths: [
    "orders.customer_id = customers.customer_id"
  ],
  metrics: [
    { name: "monthly_revenue", sql: "SELECT ...", dimensions: [...] }
  ],
  business_rules: [
    { name: "revenue_excludes_refunds", filter: "status NOT IN (...)" }
  ],
  glossary: [
    { term: "customer", definition: "...", context_filter: "..." }
  ]
}

Appendix B: Implementation Note — Why a Federated Query Engine Makes a Strong Foundation

The main piece argues that a metadata origin (a system that holds live, authoritative cross-source metadata as a structural byproduct of its normal operation) produces a higher-quality ECL than a metadata consumer that ingests periodic snapshots. This appendix expands on what that means in practice and why federated query engines, and Trino in particular, fit the requirement well.

The substrate that seeds the ECL’s physical layer needs three properties.

Live cross-source schema. Entity resolution depends on simultaneous visibility into every connected source. A unified information_schema queryable in one call, spanning object storage, transactional databases, NoSQL systems, and SaaS connectors, provides this directly. Periodic crawl-and-cache architectures lose this property between refresh cycles, and the staleness propagates upward through entity resolution and confidence scoring.

Statistical metadata. Entity resolution quality depends on value overlap sampling. Business rule confidence depends on whether filters match actual data distributions. Both require the column-level statistics (null fractions, distinct value counts, value distributions, ranges) that a cost-based optimizer collects as part of normal query planning. Engines without a CBO produce weaker EKG seeds.

Live runtime cardinalities. Adaptive query plans expose actual cardinalities versus estimates, which validates the statistics layer continuously rather than relying on stale snapshots.

Trino meets all three requirements as a structural property of being a federation engine, not as a feature added for grounding purposes. Starburst Galaxy adds automatically generated column-level lineage from transformation workloads without requiring separate lineage pipeline configuration, which strengthens the entity-metric lineage layer (Layer 5) directly.

The argument is not that Trino is the only viable foundation. It is that the substrate matters, and the more heterogeneous the data estate, the more the substrate matters. A single-warehouse environment doesn’t need cross-source entity resolution because everything is already in one place. An enterprise with a federated data estate spanning multiple systems, where the most important business questions are inherently cross-source, is precisely where a federation-native metadata origin compounds the advantage of a grounding layer built on top of it. The complexity of the estate is the moat.

 

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free