Data Classification

Data classification is a framework for organizing data in ways that improve data management, information security, and risk management.

The policies, processes, and roles in a data classification scheme rely on a system of tags based on categories like types of data, sensitivity, and regulatory requirements.

Why classify your data? What does data classification do?

Data classification systems help you understand all the data sitting in your storage ecosystem. For example, you can make safeguarding data more straightforward by classifying it by sensitivity level:

Public data: There’s little reason to control access to public information, but you might want to control its accessibility on websites or in reports.

Internal data: Some data companies generate isn’t particularly sensitive. If it reaches the public, the data won’t materially impact the business. All the same, you want to know what data ought to stay behind the firewall.

Sensitive data: Tighter security requirements apply to critical data that could damage the company if released. Examples include intellectual property, financial reports, and similar sensitive information.

Confidential data: Privacy compliance regulations will impose stiff penalties for data breaches that result in the release of personal data like social security numbers or medical records. More stringent security controls and other policies must prevent unauthorized access.

Restricted data: Access to some information must be tightly controlled and limited to a small group within the company. Some examples include the classified information defense contractors handle or the merger and acquisition documents an investment bank creates.

Applying these classifications lets you implement a security policy for each type rather than wasting resources by treating the sensitivity of all data equally.

What is the difference between data classification and data protection?

Data protection is a set of processes for preventing data’s loss or corruption. Security is one aspect of data loss protection (DLP). Besides controlling access, data protection also includes ensuring data remains available and not lost in a vast storage infrastructure.

Data classification plays a role in data protection. The previous section explained how data sensitivity classifications can protect data from unauthorized access. Another classification method would manage availability by flagging how to allocate data between solid-state, disk-based, tape-based, and archival storage management systems.

How data classification and data discovery are related

Data discovery is the foundation of analytics. Consistent classification enriches metadata for structured and unstructured data alike, providing more ways to generate queries across multiple sources. Effective classification also keeps data accessible and avoids the perils of dark data and data swamps.

Related reading: Data discovery unlocks big data insights

Benefits of data classification

An association with government secrecy makes security an obvious example of classification’s benefits. However, security is not the only area that gains from an effective classification system. Effectively categorizing data helps organizations manage data lifecycles and improves analytics while also mitigating data risks.

Data lifecycle management

Data may be the engine of business decision-making, but that doesn’t mean companies can get away with keeping every bit they generate. Instead, they must develop data lifecycle management policies governing data retention and destruction.

Classification systems apply appropriate flags that let data teams know how to store data and when data is redundant and safe to delete within the limits of legal, financial, and compliance requirements.

Data management and analytics

Classifications are part of the metadata that data management teams use to support business analytics. Proper classification is particularly important for a data lake’s pool of unstructured data. Categorizing this data makes it easier to find, retrieve, and process. With faster access to the right data, analysts produce richer results that support more agile and effective decision-making.

Risk management

Of course, data risk management is the most widespread use of classification. Understanding the nature, location, and accessibility of all the data your company stores helps you:

  • Defend information systems from cybersecurity threats.
  • Protect the privacy of regulated personal information.
  • Improve resilience through robust recovery practices.
  • Ensure compliance with industry and government compliance frameworks.

Data security classifications guide investments in network defenses and data access control policies. They support compliance initiatives, playing critical roles in meeting the requirements of frameworks like ISO/IEC 27001.

Data privacy classifications simplify the protection of personal information in your company’s possession, especially data protected by regulations. Organizations that manage protected health information (PHI) in the United States must meet Health Insurance Portability and Accountability Act (HIPAA) requirements. Companies that collect personally identifiable information (PII) about European Union residents must comply with the EU’s General Data Protection Regulation (GDPR).

Classification systems also reinforce resiliency programs through their roles in data lifecycle management. For example, data managers use classifications to understand the scale and scope of their data assets as they develop retention, backup, and recovery policies.

Challenges of data classification

Realizing these benefits takes time, planning, resources, and commitment — all of which present immense challenges, including:

Sustained stakeholder support

Deciding how to classify data is a subjective exercise. For instance, compliance regulations do not dictate specific, one-size-fits-all actions. Companies must choose the most appropriate course to follow. As a result, classification initiatives require collaboration between legal, business, and technical teams to reach a consensus on the optimal classifications. Most importantly, senior leadership must support that consensus to drive enterprise-wide adoption.

Competing for attention

Everyone in the organization must understand the classification system as well as their roles and responsibilities. That means ensuring that other priorities can’t overrule policy enforcement. Data discovery becomes more difficult without proper classification, undermining the decision-making process. Poor execution also increases organizational risk with the rising chance of privacy or security compliance violations.

Impermanent classifications

Data and its classifications may change across the lifecycle. Information that was once highly sensitive can become old news. Regulations and compliance frameworks may change. New business strategies introduce new risks that may require new classifications. Within today’s dynamic business environment, companies must constantly assess how they’ve classified their data.

Resource allocation

Data teams need the people, skills, and resources to monitor classification schemes continuously. Without that support, data teams will struggle to keep classifications consistent and valuable. Data will become less secure and more expensive. Most importantly, data won’t be available to decision-makers when they need it.

How data classification works | Types of data classification

The Federal Information Security Management Act (FISMA) led to a NIST classification framework for government agencies relevant to any organization. Previously known as the CIA Triad, this framework addresses data risk through three types of data classification: confidentiality, integrity, and availability. Classification levels within these types describe the impact — low, medium, and high — of a security breach.

Confidentiality refers to the restricted access to information, including personal and proprietary data. An unauthorized data release is a failure of confidentiality.

Integrity refers to the control over changes to or deletion of data, plus how organizations ensure quality and authenticity. A ransomware attack is an example of data integrity failure.

Availability describes whether authorized users can access data quickly and reliably. An availability failure can be something as simple as a defective VPN gateway or as impactful as the loss of communications during a natural disaster.

The data classification process

An effective data classification process starts with an inclusive approach to planning. Stakeholders from multiple domains must participate under the overview of senior management.

Shared principles and goals will drive the creation of understandable classification schemes. Consistent commitment from the top will drive acceptance, investment, and consistent execution, allowing the organization to realize classification’s benefits.

However, as with any data management initiative, the heaviest burden will fall on the data team. Manual processes won’t work at enterprise scale. Automation through data classification software can alleviate this pressure by taking on time-consuming, routine activities. In the next section, we’ll see what automated classification can look like.

Data classification policy with Starburst

Gravity is Starburst’s unified access and governance layer that lets you optimize data pipeline availability and reliability from a single interface. Classifications are a core element of Gravity’s Attribute-Based Access Control (ABAC) system. However, managing tags at scale can overwhelm data stewards while increasing the risk that mis-tagged data could impact data analysis quality. Gravity applies AI paradigms to automatically recommend tags, reducing manual effort as well as the potential for tagging errors.

Data stewards may run Starburst Galaxy data classifier jobs manually or automatically against an attached cluster. Galaxy applies a sample of the underlying data to its AI models to identify possible classifications and return tag recommendations for the steward to accept or reject.

You can apply tags at the Catalog, Schema, Table, or Column level to create granular ABAC policies at enterprise scale.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.