Five years ago Forbes and KPMG reported that 84% of CEOs were concerned with the quality of data that they utilized to form their decisions. Today, data quality remains a top executive concern for those striving to build a data-first organization. With unreliable data, the bottomline is that businesses will have a far more difficult time deriving business value from data. The truth is that while it’s easy to embrace the idea of being data-driven, it’s far more difficult to execute on that commitment.
Unfortunately, more often than not, when key stakeholders receive a report, they’ll find and declare:
Let’s take a closer look at why data quality remains a top concern for many executives today.
Data quality is the state of the data, reflected in its accuracy, completeness, reliability, relevance, and timeliness. Data quality powers many important data analytic dashboards, data products, and AI/ML models and that’s why the concept of “garbage in, garbage out” is more important than ever. This means that data-driven insights are only as good as the data.
Experienced data executives and business leaders will only adopt data-driven solutions if they trust and have confidence in the data. What does this look like? They will likely review the results of their dashboards and reports to see they are consistent with other sources of data. With a close eye to detail, they are in essence manually validating the data.
There are five characteristics that are vital to data quality. These indicators empower organizations to understand whether the data is suitable to use or not:
Quality data entails reliable and accurate information. Precise data means that data won’t be misinterpreted or riddled with errors. The information is comprehensively delivered, such that users won’t mishandle the data.
Ensure that information is comprehensive, accessible, and available for employees. If data was incomplete or inaccurate, organizations would not be able to use the information to their fullest potential.
Data shouldn’t be contradictory to various other data sources. If the information were to overlap, then it would create inefficiencies or worse, make becoming data-driven far more difficult.
Understand the objective and the significance of collecting data. Collecting relevant information and data decreases the costs and increases profitability for organizations. Meanwhile, irrelevant data adds noise.
Maintaining updated data is critical for modern applications and reporting. Obsolete data makes data-driven reports untrustworthy and unreliable.
Data analysts, scientists and engineers responsible for producing data-driven reports have an important role in data-driven organizations. Engineers build data pipelines and analysts run queries to produce reports to business decision makers.
As such, they’re managing and ensuring data quality, trustworthiness, accessibility and usability at every stage of the process, at scale: with complex data pipelines, before data ingestion, or in data analysis. Data quality powers many dependencies and even determines the health of your data lifecycle. Single-use data typically has a shorter lifecycle as it’s marked as stale more quickly, which eventually leads it to being archived or deleted. To prolong the use of data, increase its reusability, which also is a sign of data trustworthiness, confidence and quality.
Yes, there are data quality tools that can provide key functions such as data cleansing, parsing, standardizing, matching, profiling, enriching and monitoring. What’s more exciting is that there are current data management trends that can potentially shift the way we think about data quality.
Because of the volume, velocity and variety of data, data professionals made the shift from data warehouses and included data lakes to their enterprise architecture. Also, even though the data lake approach decoupled compute and storage, it has failed to offer quality data (i.e. data swamp) and the expected performance modern enterprises need and want to thrive in an uncertain economic climate.
To better integrate data warehouses and data lakes, companies are adopting the data lakehouse architecture, a modern data management option that promotes flexibility and cost-efficiency. With the data lakehouse, we’re combining the benefits of a data warehouse with the more positive benefits of a data lake as organizations can reduce costs in addition to achieving a higher level of efficiency and performance.
What’s more, from a data quality perspective, a data lakehouse architecture has a warehouse layer with a data lake that enforces schema which provides data quality and facilitates faster BI and reporting.
Data Mesh is a decentralized sociotechnical approach in managing and accessing data at scale. Rather than data silos, the organizational culture can shift towards data management that is supervised by domain-owners as they are closest to the business. Domain and data product owners will oversee data quality and as a result, data quality is no longer an isolated, independent process, but nurtured by domain owners and upvoted by other data consumers because they have confidence in the data product.
The main aspect of Data Mesh that fulfills data quality is the fourth pillar: federated computational governance. Shifting from a centralized to a decentralized architecture, data governance is imperative to maintain the balance and requirements of governance. Governance ensures that data is given to users on a need-to-know basis. This ensures that the data will comply with required regulations while providing high quality, reliable, and accessible data to authorized end-users. Data Mesh enables the responsibility with the federated governance model by applying global policies throughout the domains. Computational governance is the enforcement to maintain the balance between interoperability and global standards. Domains adhere to a certain set of practices and standards to apply governance.
Today, many organizations spend most of their time and energy on building the perfect data infrastructure.
Read more
Data Mesh is based on four central concepts, the fourth of which is federated computational governance.
Read more
The next analytics strategy with numerous ‘modern’ technologies to solve the endless pain of data management and analytics.
Read more