Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Not only do quality issues impact data product availability, but their identification, investigation, and resolution increasingly consume limited data team resources. Data observability adapts DevOps practices to automate data quality management at scale. This guide will introduce the emerging practice of data observability, explain its benefits, and describe how enterprises implement data observability initiatives.
Data observability is the set of practices that help organizations understand data health and performance across the enterprise. Patterned after application observability in DevOps, data observability provides the visibility and metrics data teams need to predict, prevent, and detect data quality issues to support more effective data-driven decision-making.
Adopting a data observability framework lets data engineering teams implement the technologies and processes needed to maximize the quality of enterprise data assets.
One such framework consists of the following five pillars of data observability:
High-quality data is essential to data-driven decision-making. Data analysts depend on access to quality data sources to create the dashboards and other products company leadership relies upon to manage the business. Without high-quality inputs, data scientists waste time preparing data for machine learning and artificial intelligence projects.
Observability’s data quality pillar describes the state of an organization’s data sets. Data quality monitoring systems generate statistics about the contents of every data table, including:
High-quality data is essential to effective decision-making. By identifying data quality issues quickly, observability tools jumpstart the issue resolution process.
Data governance is the set of policies that aligns data management with business strategy. These policies include the criteria observability systems use to evaluate data quality statistics. Do the values within a table fall within an acceptable range? Is the table’s volume larger or smaller than expected? Observability rules based on governance standards will quickly identify data quality issues.
Observability policies can also identify secondary effects of poor data quality. For example, a change to the organization of data tables could signal an underlying issue. Data observability solutions will monitor for schema changes, generate alerts, and identify who made the change. An investigation can then determine what underlying issues drove the change and whether the revised schema impacts other workflows.
Data lineage visualizes the metadata generated as data flows through data architectures from its source to each downstream data product and repository. This metadata not only describes where and when data travels but also how pipelines transform the data along the way. Data lineage plays a critical role in troubleshooting data quality issues, discovering their root causes, and understanding their downstream impacts.
Data engineers use lineage tools to conduct a root cause analysis that identifies where an issue appears in the data lifecycle. In addition, they can see any dependencies further downstream that the bad data could impact. By quickly identifying the source and impact of an issue, data lineage contributes to a quick resolution and minimizes data downtime.
Data integration depends upon extract, transform, and load (ETL) and extract, load, and transform (ELT) pipelines to connect data warehouses, data lakes, and other repositories with their sources. However, the proliferation of sources, repositories, and data products has turned pipeline maintenance into a significant burden for data management teams. As data reliability declines, consumers lose trust in their data products, undermining business decision-making.
Data observability platforms let engineers manage data health across hundreds or thousands of pipelines. With enhanced visibility and scalable automation, data teams can evaluate the quality of every pipeline’s inputs and outputs as well as the quality of transformations within the pipeline. Streamlining pipeline maintenance improves data reliability and reduces data downtime.
This fifth pillar of data observability comprises the data monitoring systems that enable rapid anomaly detection and mitigation. Profiling through machine learning models allows these systems to distinguish good data from bad. Automated notifications and data health dashboards surface issues and help data teams prioritize their responses.
The term “data observability” is useful for understanding the parallels between this emerging DataOps technique and the more established DevOps practice of application observability. Developers use logs, metrics, and traces to monitor the state of applications and microservices in the modern data stack and resolve issues quickly. Although analogous, data observability is not the same since it focuses on the quality and reliability of an enterprise’s data.
In addition, some traditional data management practices, such as data testing and pipeline monitoring, have similarities to data observability. They differ in their ability to provide holistic, scalable methods for managing data quality.
As unit testing lets DevOps teams evaluate code, data testing lets DataOps teams assess the impacts data issues may have downstream. By its nature, data testing can only evaluate known issues. The practice cannot predict or identify unknown causes of poor data quality. Scope is another way testing differs from observability. Engineers typically conduct tests while developing a pipeline or a step within that pipeline. Their testing only partially assesses the potential impact of data issues within the context of the entire data ecosystem. Observability provides scalable, end-to-end visibility into data quality throughout the data lifecycle.
Another traditional data management practice with similarities to data observability is pipeline monitoring. This technique evaluates whether a pipeline performs its extract, load, and transform functions correctly. However, pipeline monitoring tools may not always assess data quality throughout the pipeline. A perfectly functioning pipeline will output poor-quality results from poor-quality inputs. Unlike this traditional approach, observability systems monitor the data entering, flowing through, and exiting pipelines for quality issues.
Data has become the engine of modern business, which places a premium on data quality. Beyond the impact of chronically poor data quality on decision-making, each quality incident further disrupts the business.
Data downtime, the period between the occurrence of an issue and its resolution, can render data products unusable. Furthermore, downtime consumes a data management team’s limited resources, limiting its ability to support business objectives.
With data volumes and complexity increasingly outpacing traditional data management tools, enterprises have adopted more scalable observability practices to improve the quality of their data systems.
Data observability is an emerging practice with varied modes of implementation. Initiatives may adopt any combination of the following approaches.
Given the emerging nature of data observability practices, existing solutions may not meet the needs of large organizations with complex data architectures. These companies will develop their own observability systems to monitor and address data quality issues. While this approach provides control over the observability system’s capabilities, it requires a long-term commitment to development and maintenance.
Third-party data observability tools may not provide complete end-to-end solutions to a company’s unique data quality needs. Integrating tools from multiple vendors can complement the company’s existing data infrastructure. However, system integration adds its own complexities and expenses.
Some data platforms and cloud services offer built-in observability features that work seamlessly within their solutions without additional development or system integration. These features may not be as comprehensive as those offered by specialized observability solutions providers, yet they may sufficiently enhance data quality with minimal effort.
Starburst Galaxy is a modern data lake analytics solution that federates disparate data sources within a virtual access layer. Starburst Gravity adds a management layer that streamlines discovery, access control, and governance. Gravity now includes data observability features that let data teams:
Gravity’s data profiling features generate statistics about the tables in Iceberg, Delta, and Hive data lakes. Engineers get one-click insights into data volumes, NULL and unique values, and data value ranges.
Gravity’s data quality criteria automate the evaluation of data profiling statistics. Data teams can define single or multi-column rules, and Gravity will generate alerts with each profiling event.
Gravity’s data lineage capabilities provide the tools engineers need to investigate these events. Visualizations of data flows within a data ecosystem map transformations leading up to and cascading downstream from a quality issue.
Individually, these features enhance data team productivity. Taken together, Gravity’s data observability capabilities empower holistic data quality monitoring and issue resolution workflows.
Starburst includes everything you need to install and run Trino on a single machine, a cluster of machines, or even your laptop.
Up to $500 in usage credits included