Data & MoreNLP Engineeringgithub.com/dataandmore/ai-profiler

Know exactly what personal data is hiding in the files.

A multilingual NLP engine that scans documents already indexed in Elasticsearch, detects names, places and dates, then classifies sensitive language into GDPR special categories and writes every finding back onto the document. The data never leaves the customer's own search cluster.

Poll, profile, persist. Repeat.

15+dedicated language models

3detection algorithms

9sensitive categories

1ktask queue depth

Section 01Overview

One engine, two front doors

The same detection core (handle_doc) powers a high-volume background pipeline and a live request endpoint. Batch work flows left to right through the scheduler, a bounded queue and a pool of spaCy workers; the realtime endpoint feeds a single piece of text straight into the core and returns JSON.

Batch data path Realtime endpoint Worker / result Detection core

Section 02The job

Find the sensitive data, label it, hand it back

The Profiler reads documents already indexed in Elasticsearch. For each one it detects names, places and dates, then looks for language that reveals special categories of personal data: health, religion, politics, sexual orientation, ethnicity, criminal history, union membership and employment actions. Every match is tagged with a type, the matched text and its position, then saved back onto the document, so compliance work starts from facts rather than guesswork.

Section 03Two ways in

A pipeline and an endpoint

Mode A

Scheduled pipeline

A background thread continuously polls Elasticsearch for documents flagged DS_Status: REQUEST_AI and feeds them to a pool of worker processes. This is how bulk archives get profiled.

Mode B

REST endpoint

A POST /profile-text route profiles a single piece of text on demand, returning colour-coded matches as JSON. A /health route reports liveness.

Section 04The batch pipeline

From flagged document to finished profile

On startup the app waits for the Elasticsearch cluster to turn healthy, then launches the scheduler and the worker pool. The cycle below repeats forever.

01 Scheduler queries Elasticsearch. Every cycle it scans the data index for documents where DS_Status = REQUEST_AI, paging 50 at a time with search-after.
02 Each document becomes a task. Documents are wrapped as tasks and pushed onto a shared multiprocessing queue. If the queue fills, the scheduler pauses, providing natural back-pressure so memory stays bounded.
03 Workers pick up tasks. A pool of spaCy worker processes (2 by default) pulls tasks in parallel, runs the detection core and merges any pre-existing labels already on the document.
04 Detection core runs. The text is screened, language-routed and passed through up to three detection algorithms. This is handle_doc, expanded below.
05 Results written back. Findings are de-duplicated and bulk-updated onto the document, and the status is set to FINISHED. Even on error the status flips to FINISHED, so nothing is reprocessed endlessly.

Written back

Per document

DS_EntryType_Count
DS_EntryType_List
DS_EntryType_Values
DS_EntryType_Index
DS_ProfiledAt
DS_Status = FINISHED

Back-pressure on a 1000-deep queue keeps the whole pipeline within a fixed memory envelope, no matter how large the archive.

Section 05Inside the core

Screen, route, then match

Before any model runs, handle_doc filters out work it should not do, then picks the right tool for the language. Only after that do the matchers fire.

Step 1

Decide what to process

Only whitelisted document types are profiled (doc, docx, pdf, eml, txt and more). Structured formats like json, xml and log are skipped. Text longer than 20,000 characters is flagged and truncated, and unknown-language content is dropped.

Step 2

Pick the language strategy

If a language has grammatical dependency patterns, the whole text is analysed at once. Otherwise the text is split into sentences and each is analysed in turn, with the multilingual model as a universal fallback.

Section 06Detection algorithms

Three lenses on the same text

Each finding carries a prefixed label so downstream systems know how it was found. The three lenses run over the same sentence and their results are merged.

01 · NER

Named entities

spaCy's statistical model picks out people, organisations, places, dates and times. Names of the right shape (two to three distinct alphabetic words, no digits) are promoted to full-name findings.

labels: S_PER, S_PER_FULL, S_ORG, S_GPE, S_LOC, S_DATE

02 · Keywords

Phrase matching

A lemma-aware phrase matcher scans for terms from per-customer dictionaries of sensitive vocabulary, so it catches inflected forms rather than only exact strings.

labels: S_K_<TYPE>

03 · Grammar

Dependency matching

Grammatical patterns confirm a real claim: a person or allowed pronoun is the subject, and a sensitive keyword is the object. This cuts false positives by demanding context, not just a word.

labels: S_S_<TYPE>

Section 07What it detects

Special categories and identifiers

The exact categories are driven by per-customer dictionaries loaded from Elasticsearch, so each subscription can enable only the types it cares about.

GDPR special categories

Sensitive personal data

Health & symptomsReligious orientationPolitical orientation Sexual orientationEthnic originCriminal behaviour Union membershipEmployee terminationEmployee warning

Named entities

Personal identifiers

PersonFull nameOrganisationLocation Place / GPEDateTimeNationality / group

Section 08Reach

Built for scale and reach

15+language models

3detection methods

1ktask queue depth

9sensitive categories

Dedicated models ship for English, Danish, German, Dutch, French, Italian, Spanish, Swedish, Norwegian, Finnish, Polish, Portuguese, Lithuanian, Croatian (also serving Serbian, Bosnian and Montenegrin) and Ukrainian, with a multilingual model covering everything else and a universal sentence splitter underneath.

Section 09Under the hood

The stack

Web layer	Flask + Gunicorn :8000
NLP engine	spaCy 3.8 + Torch (CPU)
Concurrency	multiprocessing pool
Data store	Elasticsearch 8.11
Dictionaries	per-company, via IAM service
Packaging	Docker, python:3.12
Resilience	process recycle every 12h
Noise control	false-positive suppression lists

In one line

Poll, profile, persist. Repeat.

The AI Profiler turns a quiet archive of documents into a labelled map of personal data, all inside the customer's own search cluster, so compliance work starts from facts rather than guesswork.

github.com/dataandmore/ai-profiler

Repository: github.com/dataandmore/ai-profiler. Document set 2026-05-27. Component descriptions reflect the standard behaviour of the named libraries (spaCy, Flask, Elasticsearch) and the documented design of the Profiler service. Where the implementation differs in detail, edit this document to match the source.