How does data federation work?

Tutorial: Easily access all of your data, no matter where it lives

August 3, 2023

Evan Smith

Technical Content Manager

Starburst Data

Erin Rosas

Curriculum Developer

Starburst

Evan Smith

Technical Content Manager

Starburst Data

Erin Rosas

Curriculum Developer

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Accelerate AI with a data lake analytics platform

For data to be valuable, it has to be useful, and that means it needs to drive business insights. But in the modern data landscape, data often resides in multiple locations. For example, an organization might use a data warehouse for one use case and a data lake for another. Sometimes these choices follow an organizational structure, with each department creating its own data sources.

This creates a siloing problem. In such a scenario, gaining insights becomes both complicated and costly, often requiring data to be moved from one system to another to create a single source of truth. But realizing this source of truth can be an endless task, and many businesses either commit limitless resources to it or never achieve success. Think of Sisyphus rolling his rock up the hill for eternity.

What is data federation?

Data federation is an architectural approach that lets you query and combine data across multiple, heterogeneous sources without physically moving or copying all that data into a single repository.

Solving the data silo problem is really what data federation is all about. Federation unlocks the value of your data by connecting multiple data sources. This powerful approach gives businesses more options and increased flexibility, opening up a world of possibilities. Using federation, your organization no longer needs to move data unnecessarily to a central source of truth. Instead, you can focus on creating insights and driving value.

How do you connect to disparate data sources?

Federation is best when it offers lots of options. This ensures that no matter where your data lives, it can connect with data in other sources. Starburst’s connector ecosystem includes 50+ connectors, allowing connections to both cloud and on-prem data sources.

This breadth of connectors includes many enhanced proprietary connectors, further enhancing the options available. Overall, federation lowers costs, increases convenience, and improves versatility.

Getting started: query federation tutorial

Discover, locate, govern, and query your data from multiple data sources

Access tutorial

Who uses data federation?

Any data professional who manages data or queries data from multiple sources through data federation. This includes:

Data managers(i.e., Data engineers, data architects) create catalogs to connect to their organization’s data sources.
Data consumers(i.e,. Data scientists, data analysts) write queries to federate data across data sources.

How does data federation work?

The Trino SQL query engine uses connectors to communicate with many data sources simultaneously, processing and joining data from disparate sources as needed to complete a query.

Supporting this, our connector ecosystem is broad, and we’re continuously adding and improving connectors.

We connect to a variety of types of data sources, including NoSQL stores like Elasticsearch or MongoDB and relational databases like PostgreSQL. Additionally, we simplify data lake analytics by supporting all major table formats, including Iceberg and Delta Lake, persisted on Amazon S3, Azure Blob, and Google Cloud object stores.

The following image displays some of the connectors included in our connector ecosystem.

How do I federate data with Starburst platforms?

Federation is easy with Starburst Galaxy. To get started, simply create catalogs to connect to the data sources you’d like to include.

Next, join tables from different data sources in the same way you would join tables from the same data source.

The following video walks through federation in more detail using a sample dataset. You can use the same dataset with Starburst Galaxy.

Want to try federating data for yourself?

Starburst Academy has you covered. We’ve got several hands-on labs and tutorials to get you up and running quickly with federation.

FAQs about data federation

What is the difference between data federation and data virtualization?

Data federation is a technology or capability that creates a unified, virtual view of data across disparate sources, enabling querying without physically moving data.

Data virtualization, on the other hand, is a broader concept that encompasses federation, along with services such as metadata management, data abstraction, and security, providing a complete layer that isolates applications from data sources.

In other words, data federation is one component of data virtualization, focusing on integrating data for querying.

What business advantages does data federation offer beyond simply accessing data?

Data federation significantly reduces costs by eliminating the need for redundant data copies and storage infrastructure. It accelerates time-to-insight by providing real-time access to current data across multiple systems, thereby improving decision-making accuracy. By enabling flexible data use without complex data centralization processes, data federation also enhances organizational agility. It enables teams to leverage diverse data sources more efficiently.

How do query engines like Trino process data in a federated environment?

Query engines like Trino use connectors to simultaneously communicate with various data sources, regardless of their native format or location. When a query is initiated, the engine processes and intelligently joins data from these disparate sources on demand. It does so without moving the underlying data.

This approach enables efficient execution and result aggregation, delivering a unified view of the requested information to the user.

What challenges should organizations consider when implementing data federation?

Two primary challenges exist: First, maintaining consistent data quality and definitions across numerous independent sources can be a challenge in some data federation implementations. Optimizing complex queries that span multiple systems for efficient performance can also be a significant hurdle, requiring careful query tuning.

In addition, security and authentication protocols require vigilant maintenance. Authorization protocols across a federated environment do as well. Diverse data governance policies demand meticulous planning and ongoing management.

Tutorial: Federate multiple data sources

Practice federating data in Starburst Galaxy and using some of the other features available

Practice federating data

Course: Federate data with a simple query

Set up Starburst Galaxy and federate data with a simple query.

Practice on Galaxy

The Data Engineers Guide to Iceberg v3

How does data federation work?

More deployment options

Start for Free with Starburst Galaxy

Accelerate AI with a data lake analytics platform

What is data federation?

How do you connect to disparate data sources?

Getting started: query federation tutorial

Who uses data federation?

How does data federation work?

How do I federate data with Starburst platforms?

Want to try federating data for yourself?

FAQs about data federation

What is the difference between data federation and data virtualization?

What business advantages does data federation offer beyond simply accessing data?

How do query engines like Trino process data in a federated environment?

What challenges should organizations consider when implementing data federation?

Tutorial: Federate multiple data sources

Course: Federate data with a simple query

Accelerate AI with a data lake analytics platform

Query Federation Made Simple at Comcast