Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Layering a data fabric over existing infrastructure creates a richer, more accessible data ecosystem that enables advanced data analytics to power business outcomes.
Last updated: Feb 12, 2024
This article will explore data fabrics, how they work, and why companies use them to create enterprise-spanning analytics platforms.
A data fabric is a data management architecture that uses artificial intelligence and machine learning algorithms to automate data ingestion best practices, creating a data virtualization layer that eliminates silos and makes all enterprise data sources accessible to business users through knowledge graphs’ interconnected metadata.
Data warehouses use centralized repositories of structured data to support high-performance data analytics. However, these systems require data copies and movement through complex, difficult-to-maintain ETL data pipelines.
Data fabrics leave data at the source, reducing data management overhead and the risks of data duplication. In addition, data fabrics let users access any data format, including real-time data.
Data lakes are repositories of structured and unstructured data kept in object-based cloud data storage services. Although an ELT model simplifies data ingestion, data lakes cannot collect data from every enterprise source.
Data fabrics integrate entire data ecosystems, allowing companies to optimize storage strategies and enhance data sharing.
Data meshes decentralize data management by pushing responsibilities to business domains where data producers and consumers are enabled to create, publish, discover, and manage curated datasets, also known as data products.
Data fabrics provide governance compliance consistency. Central data teams set metadata standards, implement artificial intelligence processes, and control the virtualized environment.
Data fabrics resolve the tension between centralization and decentralization that has plagued enterprises in this age of big data.
As data volume, velocity, and variety continue increasing, companies struggle to exploit their most valuable data assets.
At the same time, the number of enterprise data sources keeps multiplying across on-premises, hybrid cloud, and multi-cloud environments.
Companies alternate between two data management strategies to control these big data forces.
Centralization is a natural response to big data’s disruptions. Consolidating data management decision-making within a central data team gives the team more control over the company’s architecture by consistently applying data semantics, metadata, column names, formats, and other design policies. Centralization also brings data security and governance enforcement within a single team.
However, centralization introduces friction as the data management team becomes an organizational bottleneck through which all but the most basic requests must pass. Without an unlimited budget, data teams prioritize requests, frustrating business users with the slow data delivery process.
The other option is to push control away from the central organization by letting domains decide how to store and share data. Decentralization gives business units more flexibility to choose optimization strategies that make sense for their operations.
Yet, decentralization introduces risks. Domains may implement security and governance policies differently. Semantic drift will result in metadata variances between domains. Even worse, data silos may make entire datasets inaccessible to anyone outside a domain. As a result, data processing workflows become more complex as each domain’s data teams must cooperate to bring consistency to new datasets.
Over the years, companies have developed formal and informal architectures that blend both approaches in ways that optimize their worst aspects. Data fabrics resolve this tension by leveraging AI-based automation to centralize data management while alleviating pressures on resource-constrained data teams.
During data integration, the fabric’s AI systems enrich each new dataset’s metadata with the company’s standardized business semantics. This rich metadata connects the new dataset into a knowledge graph, linking it with every other dataset.
A data fabric operates as a layer on top of a company’s existing storage infrastructure. Its rich metadata and knowledge graphs virtualize the on-premises and cloud architecture to provide a single interface for data exploration, discovery, and processing.
By automating metadata management, leaving data at the source, and providing a virtualized interface for different data sources, a data fabric delivers the benefits of centralization while empowering business units and individual data users.
Automating routine workloads makes life easier for the central data team. They can focus on the higher-level responsibilities of fabric orchestration, such as defining semantics and metadata standards. Freed from more onerous tasks, the data teams become more accessible to business users.
Even then, the data fabric reduces demand for data engineering by making company-wide data more consistent and accessible. Machine learning tools recommend datasets to streamline discovery. Integrating different datasets requires fewer and simpler data pipelines since the fabric guarantees semantic consistency and minimum data quality levels.
Unlike the migration from databases to data warehouses and then to data lakes, implementing a data fabric does not require replacing or duplicating a company’s data infrastructure. Instead, companies weave a data fabric from three key elements: the storage infrastructure they already have, artificial intelligence and machine learning tools, and their existing data teams.
Data fabric initiatives build upon the storage infrastructure and data management tools already in place. Nothing about how existing sources store and process data needs to change.
Moreover, companies do not need to migrate data into new repositories. Data can remain in place, whether that’s a transactional database, a data warehouse, or a data lake.
Initially, this approach allows domains to retain control over their data sources. Eventually, datasets that support diverse use cases will become the central data team’s responsibility.
Data fabric solutions rely on artificial intelligence to automate ingestion workflows by extracting and analyzing metadata from new sources. Machine learning algorithms identify metadata patterns to create knowledge graphs that group related datasets for various use cases, significantly simplifying data discovery.
A data fabric’s AI-powered automation does not replace the central data team’s engineers. In fact, it makes their roles more important than ever. They define the standards for data quality, business semantics, and metadata that ensure AI and ML algorithms process data that will yield consistently reliable outputs. Their work lets data scientists and business users trust the output of dataset recommendation engines and accelerate data analytics.
Data virtualization creates a data fabric’s interface between architecture and analytics. Data resides in on-premises data centers and cloud data platforms but appears to users as a single enterprise data source.
Between its rich metadata, knowledge graphs, and recommendation engines, a data fabric makes it easier for users at various skill levels to access data. A self-service model lets analysts find the right data to support decision-makers using their existing business intelligence apps. Data scientists can count on the consistency between different data sources to reduce their data preparation workloads.
Starburst has become the query fabric for organizations taking control over big data’s twin challenges. Starburst’s data analytics platform virtualizes enterprise data architectures to create a single point of access to every data source for any authorized user.
Connectors for over fifty enterprise data sources let companies unify their entire storage architecture within a single data analytics platform. Abstracting disparate data sources within Starburst’s virtualization layer creates a unified view of a company’s data assets without requiring complex, costly, and risky data migration projects. Freed from routine pipeline development, data teams can focus on improving semantic consistency and data quality to enhance every data asset’s value.
Starburst’s universal discovery, governance, and sharing layer, Starburst Galaxy, simplifies data access. Gravity automatically catalogs a new source’s metadata to speed data discovery.
Similarly, Starburst’s query engine uses ANSI-standard SQL, letting experienced users write complex queries with little learning curve. Any SQL-compatible business intelligence app or machine learning platform can integrate with Starburst to serve the needs of data scientists and decision-makers alike.
Starburst’s robust security features include role-based and attribute-based access controls that automate fine-grained security and governance policies. These policies can control access to data catalogs and tables down to the row and column level. Filtering and masking rules can determine whether users may access individual records or only aggregated data. Thanks to Starburst’s security and governance capabilities, companies can deliver access to globally distributed data stores without compromising local privacy and data sovereignty regulations.
Starburst enhances Trino’s massively parallel SQL query engine with query optimizations that balance cost and performance on large-scale data workloads. Smart indexing, caching, push-down queries, automatic query planning, and other features multiply query performance at petabyte scales for a fraction of the compute costs.
Up to $500 in usage credits included