×

Glossary

What is Data Quality?

Data Quality

Five years ago Forbes and KPMG reported that 84% of CEOs were concerned with the quality of data that they utilized to form their decisions. Today, data quality remains a top executive concern for those striving to build a data-first organization. With unreliable data, the bottomline is that businesses will have a far more difficult time deriving business value from data. The truth is that while it’s easy to embrace the idea of being data-driven, it’s far more difficult to execute on that commitment.

Unfortunately, more often than not, when key stakeholders receive a report, they’ll find and declare:

  • the data is missing from a critical report;
  • the data wasn’t quite what they requested;
  • there’s duplicate data;
  • the numbers are inaccurate

Let’s take a closer look at why data quality remains a top concern for many executives today.

What is data quality?

Data quality is the state of the data, reflected in its accuracy, completeness, reliability, relevance, and timeliness. Data quality powers many important data analytic dashboards, data products, and AI/ML models and that’s why the concept of “garbage in, garbage out” is more important than ever. This means that data-driven insights are only as good as the data.

Experienced data executives and business leaders will only adopt data-driven solutions if they trust and have confidence in the data. What does this look like? They will likely review the results of their dashboards and reports to see they are consistent with other sources of data. With a close eye to detail, they are in essence manually validating the data.

What are the five data qualities?

There are five characteristics that are vital to data quality. These indicators empower organizations to understand whether the data is suitable to use or not:

Accuracy: data that can be used as a reliable source of information

Quality data entails reliable and accurate information. Precise data means that data won’t be misinterpreted or riddled with errors. The information is comprehensively delivered, such that users won’t mishandle the data.

Completeness: data that is available to extract maximum value

Ensure that information is comprehensive, accessible, and available for employees. If data was incomplete or inaccurate, organizations would not be able to use the information to their fullest potential.

Reliability: data that doesn’t contradict to minimize margin of error

Data shouldn’t be contradictory to various other data sources. If the information were to overlap, then it would create inefficiencies or worse, make becoming data-driven far more difficult.

Relevance: data that is applicable to the organization’s needs and objectives

Understand the objective and the significance of collecting data. Collecting relevant information and data decreases the costs and increases profitability for organizations. Meanwhile, irrelevant data adds noise.

Timeliness: up-to-date data generates relevant data-driven reports and applications

Maintaining updated data is critical for modern applications and reporting. Obsolete data makes data-driven reports untrustworthy and unreliable.

Why is data quality important?

Data analysts, scientists and engineers responsible for producing data-driven reports have an important role in data-driven organizations. Engineers build data pipelines and analysts run queries to produce reports to business decision makers.

As such, they’re managing and ensuring data quality, trustworthiness, accessibility and usability at every stage of the process, at scale: with complex data pipelines, before data ingestion, or in data analysis. Data quality powers many dependencies and even determines the health of your data lifecycle. Single-use data typically has a shorter lifecycle as it’s marked as stale more quickly, which eventually leads it to being archived or deleted. To prolong the use of data, increase its reusability, which also is a sign of data trustworthiness, confidence and quality.

Current Data Management Trends Brings Renewed Interest To Data Quality

Yes, there are data quality tools that can provide key functions such as data cleansing, parsing, standardizing, matching, profiling, enriching and monitoring. What’s more exciting is that there are current data management trends that can potentially shift the way we think about data quality.

Build a data lakehouse to avoid a data swamp

Because of the volume, velocity and variety of data, data professionals made the shift from data warehouses and included data lakes to their enterprise architecture. Also, even though the data lake approach decoupled compute and storage, it has failed to offer quality data (i.e. data swamp) and the expected performance modern enterprises need and want to thrive in an uncertain economic climate.

To better integrate data warehouses and data lakes, companies are adopting the data lakehouse architecture, a modern data  management option that promotes flexibility and cost-efficiency. With the data lakehouse, we’re combining the benefits of a data warehouse with the more positive benefits of a data lake as organizations can reduce costs in addition to achieving a higher level of efficiency and performance.

What’s more, from a data quality perspective, a data lakehouse architecture has a warehouse layer with a data lake that enforces schema which provides data quality and facilitates faster BI and reporting.

Data Mesh empowers owners with high quality data

Data Mesh is a decentralized sociotechnical approach in managing and accessing data at scale. Rather than data silos, the organizational culture can shift towards data management that is supervised by domain-owners as they are closest to the business.  Domain and data product owners will oversee data quality and as a result, data quality is no longer an isolated, independent process, but nurtured by domain owners and upvoted by other data consumers because they have confidence in the data product.

The main aspect of Data Mesh that fulfills data quality is the fourth pillar: federated computational governance. Shifting from a centralized to a decentralized architecture, data governance is imperative to maintain the balance and requirements of governance. Governance ensures that data is given to users on a need-to-know basis. This ensures that the data will comply with required regulations while providing high quality, reliable, and accessible data to authorized end-users. Data Mesh enables the responsibility with the federated governance model by applying global policies throughout the domains. Computational governance is the enforcement to maintain the balance between interoperability and global standards. Domains adhere to a certain set of practices and standards to apply governance.

Continue Learning

In Data We Trust: Data Observability for the Data Mesh

In Data We Trust: Data Observability for the Data Mesh

Today, many organizations spend most of their time and energy on building the perfect data infrastructure.
Read more

Data Mesh and Starburst: Federated Computational Governance

Data Mesh and Starburst: Federated Computational Governance

Data Mesh is based on four central concepts, the fourth of which is federated computational governance.
Read more

Starburst Lakehouse: Data Warehouse Functionality, Without The Cost

Starburst Lakehouse: Data Warehouse Functionality, Without The Cost

The next analytics strategy with numerous ‘modern’ technologies to solve the endless pain of data management and analytics.
Read more

 

Data Mesh for Dummies

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.