Today, many organizations spend most of their time and energy on building the perfect data infrastructure. Engineering and analytics teams focus their efforts on investing their resources in the best data warehouse, the best data lake, the best BI tools, etc. but often forget to ensure the data itself is providing the accurate insights that it should.
Unfortunately, data can’t always be trusted. This has come to be known as “data downtime” which refers to periods of time when your data is wrong, inaccurate, or otherwise erroneous. Data downtime is certainly not new, but it has become a bigger challenge for many organizations as companies ingest more data, build increasingly complex pipelines, and accrue technical debt, creating a snowball effect of bad data that could have been avoided if it were caught sooner.
This unfortunate pitfall has made it clear that testing your pipelines alone is not enough to prevent data downtime. As data becomes an increasingly important part of decision making and product development, data trust and reliability should be more of a priority for engineers and analysts, especially when embarking on a journey towards a sound Data Mesh architecture.
At Datanova, I shared during my talk that in order to have full trust in your data, and your Data Mesh, you must test the health of your data itself through data observability.
Data observability, an organization’s ability to fully understand the health of the data in their system, works by applying DevOps Observability best practices to eliminate data downtime. With automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues, data observability leads to healthier data pipelines, more productive data teams, and, most importantly, happier data consumers.
The pillars of data observability
Data observability can be broken down into five pillars that will help you screen the health of your data:
- Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated. Freshness is particularly important when it comes to decision-making; after all, stale data is basically synonymous with wasted time and money.
- Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range. Data distribution gives you insight into whether or not your tables can be trusted based on what can be expected from your data.
- Volume: Volume refers to the completeness of your data tables and offers insights into the health of your data sources. If 200 million rows suddenly turn into 5 million, you should know.
- Schema: Changes in the organization of your data, in other words, schema, often indicate broken data. Monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem.
- Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it. Good lineage also collects information about the data (also referred to as metadata) that speaks to governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.
But that’s just the beginning of the process because data observability is more than identifying the problem, it’s a lifecycle. The next step is to take action and solve for what’s missing or incorrect, followed by implementing prevention tactics to keep your data reliable going forward.
How data observability plays a role in your data mesh initiative
Data observability should play a major role in your Data Mesh initiative, particularly when handling domains within the Data Mesh infrastructure. You might have wondered who should own the process of data observability within your organization, and I believe placing that responsibility on the domain owners is how you ensure the highest rate of effectiveness. Ultimately, your marketing domain leaders are already laser-focused on their own data and needs, and not concerned with what’s going on with the finance or sales domains, for example, making trust their top priority. Additionally, the level of trust your team has in the data from one domain may look totally different from another, so it’s important that this process is kept domain-specific to accommodate those differences in standards.
At the end of the day, it is all about early detection, early resolution, and early prevention of data issues, all in the context of generating trust so that both producers and consumers can adopt that data and truly become data-driven. By applying this approach, data teams can better collaborate to identify, resolve, and prevent data quality issues from occurring in the first place, and truly optimize data to its full potential.