Data & More data classification

The Data More classification automatically identifies different types of critial data types

Data & More: Data Classification Categories

Personal Identifiable Information (PII)
At Data & More, we’ve developed a comprehensive and granular approach to classifying Personal Identifiable Information (PII). Our system is grounded in a deep understanding of GDPR and other global privacy regulations, enabling us to accurately analyze and categorize data across various contexts.

We’ve broken down PII into hundreds of distinct, generic types, each representing a unique category of personal data. These subcategories allow for detailed analysis and recognition of PII in diverse countries and languages. Recognizing that each country and language introduces its own specific complexity—including national IDs, specific official documents, specific certificates, and country specific entities such as churches, political parties, and unions—our classification system maps and accommodates thousands of unique country- and language-specific PII categories.

Our classification models have been rigorously validated through the analysis of billions of files, images, and other data types. Importantly, we’ve developed and trained specialized language models—distinct from large language models (LLMs)—to identify and categorize PII with precision and scale.

Additionally, our system is continuously improved through feedback from hundreds of thousands of users. When users identify and report misclassifications, their input helps us refine and enhance our classification accuracy over time.

(see a list of the high level document classes below)

Critial Security information

At Data & More, we’ve also developed a comprehensive classification system for identifying critical security information. This system is designed to safeguard sensitive operational and technical data that, if compromised, could pose significant risks to organizational security. It includes categories such as passwords and secrets, which cover user access credentials and encryption keys for both human and machine communications; source code that may inadvertently expose secrets or vulnerabilities; log files from applications or servers; and infrastructure configuration files, including automation scripts like Ansible. Furthermore, our classification extends to vulnerability assessments, which encompass documents detailing security evaluations, CVE vulnerability analyses, and penetration testing results. This approach ensures that critical security information is identified, categorized, and protected with the same precision as PII data.

(see a list of the high level critical security document classes below)

 

Here is an overview of all the different high level privacy document classes:

Name

Description of Personal Identifiable Information (PII) document class 

Payment card

Data containing information about a person's credit card.

Algorithms are used for the search, which specifically look for the number logics that characterize credit cards.

Misc. ID Various data for personal identification
Driver's license

Data for driving licenses that can be attributed to one or more people.

Algorithms are used for the search for the unique codes that appear on driving licenses.

In addition, a search is made for the words found on driving licenses and whether there is a picture of a person.

Ethnic orientation

If the scanned data contains information about the ethnic orientation of one or more persons.

Searches are made for all existing ethnic orientations or that one comes from a certain country

Grant application Personally identifiable data that appears in applications to foundations for financial support.
Health card

If there are health cards in the scanned data, such as the health insurance card and the blue EU health insurance card.

Requirements for the search include that a social security number appears and that it is an image file.

Health cards are primarily found using OCR scanning.

Health info

Data that provide information about the health of one or more people, such as sick leave. In the search, e.g. general corona information, safety data sheet, newsletters, internal manuals etc. are excluded.

The search is for phrases that clearly indicate sick leave and a specific diagnosis, a visit to a general practitioner or the like, or medical preparations.

The criteria are that there must be both a data subject and specific health information

Union membership

Data containing information about one or more persons' membership of a trade union.
A search is made for all existing trade unions in your country.

National ID number

National ID that appears in the scanned data.

Algorithms are used for the search for numbers that meet criteria for being a real National ID, and keywords such as "personal identification number" and similar are searched for.

In addition to searching emails and chats, many National ID numbers are found using OCR scanning e.g. of image and PDF files.

National ID card Data for ID cards belonging to identifiable person from different countries.
Passport

When passports are found in the scanned data. To achieve this categorization, the data must contain an image of a person, and the unique country codes that appear on passports must be included. We also search for individual passport numbers, which, for example, may be included in email correspondence between multiple people.

Political orientation

If the scanned data contains information about one or more persons' membership of a political party or political observance.

All existing political parties in your country are searched for.

Recruitment

Personal data that appear in solicited or unsolicited applications, as well as in CVs.

This document class also contains rejections of job applications. Searches for phrases that are unique to job applications, whether solicited or unsolicited.

Religious orientation

If the scanned data contains information about a person's religious orientation. Searches are made for all known religious orientations and membership of state-recognised churches.

Salary / financial info

Data that contain information about a person's salary, for example payslips and fee papers. Also information about bonus schemes is searched for. To find data in this category, search for combinations of words that only apply to payslips. In addition, phrases are searched for that appear when information is given about one or more people's salary, such as what a person's monthly salary is.

Sexual orientation

Scanned data that contains information about the sexual orientation of one or more persons. All existing sexual orientations are searched for.

Tax info Data that contain information about a person's tax information, especially in the form of annual statements. PDFs are specifically searched for, as e.g. annual statements appear in this file format.
Employee termination

Dataset with information about the termination of an employee's employment within an organization, including resignations, departures, layoffs, and more. Dataset titles that indicate resignations and words that are particularly relevant to resignations is a part of the search.

Employment info

Data for employment agreements between employee and employer, whether the terms are described in documents or in a written communication. A large collection of word combinations and phrases that are unique to employment agreements between an employee and an employer is searched for. Contracts that do not relate to employment, e.g. business leases are exempt.

Travel info Data that contain information about a person's travels at specific times, such as hotel, airline and restaurant bookings.
Employee warning Data concerning internal warnings to one or more individuals due to actions that violate the specific organization's guidelines.
Wills Personal data regarding one or more persons wills.
Personal certificates

Each country issues a variety of official documents for purposes such as naming, marriage, birth, partnerships, and more. These documents are unique in both their type and name, serving as essential legal records for individuals within each respective nation.

These documents are found by searching for content unique to these certificates.

Education info

Educational diplomas, exam certificates, certificates, and other data that provide information about the education of one or more individuals.

Work absence Data concerning cases where an employee fails to work on scheduled days. 
Referral consent Data related to personal consent, where a person gives consent for their personal information to be shared with an organization or similar
Insurance info Data for insurance documents that describe how one or more persons are insured, such as home insurance policies and accident insurance policies. 
Location Data on the place of residence of a person
Criminal Behavior Data containing information about the criminal behavior of a person or police reports.

 

Here is an overview of all the different high level critical security information document classes:

Name

Description of Critial Security information document class 

Passwords & Secrets

Passwords and login information for and-user access to systems as well as keys used for encryption of communication and for machine-to-machine communication.

Source code Data that expose secrets and other information that can potentially help malicious actors get access to systems and data.
Log files

Log files from application systems or servers

Infrastructure config

Various infrastructure configuration information, including infrastructure automation such as Ansible scripts.

Vulnerability Assessments Documents assessing security of infrastructure and applications including assessing CVE vulnerabilities and results from penetration-testing.