Why Data Products and Lakehouses Work Better Together

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Data products and data lakehouses represent two sides of the same architectural evolution. Data products apply product thinking to data assets, treating data consumers as customers and emphasizing discoverability, governance, and clear interfaces. Meanwhile, data lakehouses provide the technical foundation that enables these product capabilities to be practical at enterprise scale.

Data system complexity is the norm 

The combination addresses a fundamental challenge: enterprise data is distributed across multiple systems, cloud platforms, and regions, yet teams need unified access with consistent governance. 

Traditional approaches force a choice between moving all data to a central location or accepting fragmented access patterns. In answer to this, data products built on data lakehouses offer a third path that preserves data locality while enabling federated consumption.

What data products solve

A data product is a domain-owned, consumable data asset with defined interfaces, service-level objectives, and governance controls. Unlike traditional data sharing, data products treat consumers as customers with specific needs and expectations.

Data products expose underlying datasets through curated interfaces, making it easier to abstract the complexities of the data and their location. This flexibility serves different consumption patterns, from business intelligence to machine learning feature stores to real-time decisioning systems.

The approach solves several persistent problems. Teams can discover and access data without navigating complex organizational boundaries. Domain experts maintain ownership and accountability for data quality. Consumers get predictable interfaces with documented schemas and evolution policies. Governance becomes distributed but coordinated, rather than centralized and brittle.

How data lakehouses enable the architecture

Data lakehouses combine the storage economics and flexibility of data lakes with the performance and governance capabilities of data warehouses. They store data in open formats on object storage while providing ACID transactions, schema enforcement, and SQL query performance.

For data product architectures, lakehouses provide the storage and metadata layer that makes distribution practical. In addition to this, Apache Iceberg enables schema evolution, partition management, and time travel capabilities that data products need for reliable operation across multiple consumers.

The lakehouse also serves as the integration point between batch and streaming data. This matters because data products often need to combine historical analysis with real-time updates, requiring a storage layer that handles both patterns efficiently.

Technical enablers that make it work

Several technical developments enable data products and lakehouses to work together effectively. Apache Iceberg’s feature enhancemtns let data product owners modify schemas and partitioning strategies without breaking downstream consumers.

Meanwhile, Trino, the distributed SQL query engine, enables federation across multiple data sources and formats. Teams can query data products whether they live in Iceberg tables, traditional data warehouses, operational databases, or streaming platforms. This query federation reduces the need to copy data between systems while preserving the ability to join across them.

Flexibility and interoperability 

Overall, this data architecture’s flexibility comes from its separation of storage, metadata, and compute. Data products can be accessed through different engines and tools while maintaining consistent metadata and governance. This prevents vendor lock-in and lets teams choose the right tool for each workload.

Modern streaming ingestion and file ingestion capabilities bridge the gap between real-time data sources and open data lakehouse storage. Platforms can now ingest from Kafka and similar systems directly into Iceberg tables with exactly-once guarantees and sub-minute latency, making fresh data available to data products quickly.

Addressing AI and ML requirements

AI and analytics workloads create specific demands that data products operating on lakehouses are particularly well-suited to handle. Much of this stems from AI’s need for context

AI is hungry for context as a way of overcoming the generic nature of LLM training data. The solution comes in the form of contextual data. But to access this kind of data requires a data architecture capable of adaptation. Data products are well-suited to this because they can expose the same underlying context-heavy data through different interfaces. 

Using this approach, governance becomes particularly important for AI workloads due to regulatory and requirements. Data products provide clear lineage and ownership, while the lakehouse enables fine-grained access controls and audit trails. Teams can track which data was used to train which models and ensure appropriate access controls remain in place.

Implementation patterns that work

Successful implementations follow several common patterns. Teams start by identifying natural domain boundaries and assigning clear ownership for creating and managing data products within each domain. The lakehouse provides a shared technical foundation, but governance and quality remain distributed to domain teams.

Open table formats become the standard interface for analytical data products. Iceberg tables provide the schema management and performance characteristics that consumers need, while remaining accessible to multiple query engines and processing frameworks.

Using this approach, data federation comes before data centralization in most successful architectures. Teams explore and prototype with federated queries across existing systems, then selectively move data into the lakehouse when performance or governance requirements justify the cost and complexity of data movement.

Governance across distributed systems

Data products and lakehouses together enable federated governance that scales across multiple systems and regions. Rather than centralizing all policy enforcement, the architecture allows consistent policy expression while distributing enforcement.

Row-level security and column masking can be applied at query time, letting the same underlying data serve different access levels based on user context. Policy engines like Apache Ranger integrate with both the lakehouse storage layer and the query federation layer to provide consistent enforcement.

Data sovereignty requirements fit naturally into this model. Data products can remain in their regions of origin while still being discoverable and accessible through federated queries. Cross-region access happens only when needed and with appropriate controls in place.

What this means for data strategy

Data products and lakehouses together represent a shift from centralized data platforms to federated data architectures. This doesn’t eliminate the need for platform teams, but changes their focus from data movement and storage to enabling self-service capabilities and maintaining governance standards.

The combination reduces data movement costs while improving data freshness and access flexibility. Teams can prototype and explore with federation, then optimize performance through selective data placement and caching strategies.

For organizations dealing with multiple cloud platforms, regions, or acquisition integration, this architecture provides a path to unified data access without massive migration projects. Data migration solutions can publish data products from existing systems and access them through consistent interfaces while underlying systems evolve independently.

The technical foundation supports both current analytics needs and emerging AI workloads through the same architectural patterns. This convergence simplifies platform strategy and reduces the need for separate AI-specific data platforms, as demonstrated in successful implementations.

Organizations across various industries are seeing the benefits of this approach. Financial services organizations leverage the architecture for risk management and regulatory reporting, while healthcare providers use it for research and patient outcomes analysis. Even federal agencies are adopting these patterns for secure, compliant data sharing across departments.

The combination of data products and lakehouses represents a mature approach to building data applications that can scale from departmental use cases to enterprise-wide platforms. Understanding the complete data product lifecycle becomes crucial for organizations looking to implement these patterns successfully.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free