Schema Discovery

Schema discovery is a data engineering practice for finding and documenting the structure of data sources within a repository, such as a relational database, data warehouse, or data lake.

Effective schema discovery tools are essential to the maintenance and usability of big data repositories with significant downstream impacts on data governance and insight generation.

Schema discovery is an essential factor in a data lake’s ability to support big data analytics. Weak discoverability makes data less accessible and undermines good governance practices.

This guide to schema discovery will explain how making the structure and contents of enterprise data assets easy to find addresses the core challenges of data lake analytics.

Discoverability challenges in data lakes

Building a data repository using the commodity object storage of services like Amazon S3 or Azure Blob Storage promises significant cost, performance, and scalability advantages over relational databases or data warehouses. However, these benefits require tradeoffs.

A warehouse or database schema imposes structure on data at the time of ingestion. Engineers devote considerable time to creating data models and schemas that meet the company’s analytics requirements. Users can access these schemas to understand the contents of any dataset in the warehouse.

A data lake, on the other hand, stores raw data in flat object storage, allowing the lake to hold different data types from more varied sources. This flexibility can be a strength but comes with multiple challenges, including:

Managing volume and velocity

Data flows through the lake’s ingestion pipelines weekly, daily, hourly, and minute-by-minute at ever-increasing volumes. In addition to the traditional output from transactional systems, data lakes must integrate streaming data from operational, e-commerce, and other real-time systems.

The volume and velocity of change become a management headache. Data teams must constantly update the lake’s metadata and documentation or risk turning the data lake into a data swamp full of undocumented and inaccurate data.

Fragmented enterprise data architectures

In theory, a data lake is the centralized repository for enterprise analytics. In practice, data lakes never contain all the data needed for insight generation. Data consumers never know enough about their company’s storage architecture to find the extra data they need. Data engineers have that expertise but are a finite resource, quickly becoming bottlenecks inhibiting analytics.

Governance inconsistency

Fragmented architectures and weakly documented datasets pose a significant challenge to data governance. Data quality standards vary across organizational boundaries, access control becomes inconsistent, and regulatory compliance becomes riskier.

Friction in pipeline development

Data pipeline development and maintenance become more burdensome with poor discoverability. Engineers must devote extra time and effort to find, understand, and manage data sources.

That effort doesn’t end once a pipeline is in production. Poor discoverability makes tracking changes in data sources more difficult. Even small changes could break the pipeline or make its output unreliable.

Data accessibility and awareness

If poor discoverability creates difficulties for engineers, it creates insurmountable challenges for data consumers. Analysts and data scientists can’t understand data’s lineage, quality, or meaning without the ability to discover schemas. As a result, data analytics cannot fully address business and machine learning algorithms can’t train on the most complete datasets.

Maximizing the value of data stored in a data lake

Schema discovery features search and retrieve metadata about the datasets in a data lake’s object storage. This metadata turns into schemas that define table structures and allow the creation of table and column previews to aid exploration and query development.

By making schema more discoverable, these features let companies maximize the value of their data lakes through:

Expanded data accessibility

Putting discoverability tools in data consumers’ hands has a two-fold benefit. First, it reduces their dependence on central data teams and frees engineers from having to field low-level requests. Second, discoverability empowers users to explore the company’s data assets to produce richer analyses for data-driven decision-making. Analysts can conduct ad hoc analyses or create visualizations to support decision-makers without waiting in the data team’s queue. Similarly, data scientists can interactively query data assets to build their machine learning projects’ frameworks and training data.

Data integration and data virtualization

Making it easier for engineers to discover schemas streamlines pipeline development and helps eliminate internal barriers between data assets. Engineers can develop ingestion pipelines faster when they have reliable access to information about data sources. Likewise, discoverability tools can quickly identify schema changes that could disrupt existing pipelines.

Schema discoverability also unifies an enterprise’s disparate data sources through data virtualization. A schema tool that can access every database, warehouse, and operational system breaks down barriers that inhibit data usage.

Data optimization

In addition to helping engineers tear down data silos, discoverability supports data optimization. The stale and inaccurate datasets of a data swamp become easier to find. Engineers can integrate schema into workflows that improve data quality, currency, and integrity.

Consistent data governance

With silos eliminated and swamps drained, data governance systems can bring all enterprise data sources within uniform quality, integrity, and security policies. The reliable documentation of data asset schema directly translates into consistent governance practices and makes it easier to automate governance enforcement, such as ensuring that access permissions comply with regional data privacy regulations.

Schema discovery and Starburst: How does schema discovery work in data integration?

Starburst Galaxy is a data lakehouse analytics platform based on Trino’s open-source massively parallel processing SQL query engine. Galaxy enhances Trino with enterprise-class performance and manageability features. One of those features is Starburst Gravity, a universal discovery, governance, and sharing layer that makes schema discovery accessible to any user.

Schema discovery helps users discover existing and new datasets across the enterprise’s data architecture. Starburst users can examine object storage metadata to discover and register tables and views. Schema discovery will produce information about a source, including:

  • Bucket URL.
  • Timestamp of the discovery run.
  • Status of the discovery run.
  • Changes made since the previous run.
  • Logs of schema and table updates.

Gravity’s schema discovery will also return table information, including:

  • Schema name.
  • Table name.
  • File format.
  • Changes since the previous run.
  • Table columns and configuration options.

Read more about Starburst Galaxy’s discoverability features, or take this free tutorial to learn how to use schema discovery in Starburst.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.