Data & MoreNLP Engineeringgithub.com/dataandmore/ai-profiler

Know exactly what personal data is hiding in the files.

A multilingual NLP engine that scans documents already indexed in Elasticsearch, detects names, places and dates, then classifies sensitive language into GDPR special categories and writes every finding back onto the document. The data never leaves the customer's own search cluster.

Poll, profile, persist. Repeat.

15+dedicated language models
3detection algorithms
9sensitive categories
1ktask queue depth
Section 01Overview

One engine, two front doors

The same detection core (handle_doc) powers a high-volume background pipeline and a live request endpoint. Batch work flows left to right through the scheduler, a bounded queue and a pool of spaCy workers; the realtime endpoint feeds a single piece of text straight into the core and returns JSON.

SOURCE SCHEDULER QUEUE WORKERS CORE SINK Elasticsearch index: data DS_Status: REQUEST_AI Scheduler polls every cycle search-after, 50/page Queue maxsize 1000 Worker pool spaCy spaCy SPACY_WORKERS: 2 handle_doc the detection core 1 · guardrails 2 · language routing 3 · matchers Elasticsearch DS_Status: FINISHED POST /profile-text realtime, single text on demand JSON back
Batch data path Realtime endpoint Worker / result Detection core
Section 02The job

Find the sensitive data, label it, hand it back

The Profiler reads documents already indexed in Elasticsearch. For each one it detects names, places and dates, then looks for language that reveals special categories of personal data: health, religion, politics, sexual orientation, ethnicity, criminal history, union membership and employment actions. Every match is tagged with a type, the matched text and its position, then saved back onto the document, so compliance work starts from facts rather than guesswork.

Section 03Two ways in

A pipeline and an endpoint

Mode A

Scheduled pipeline

A background thread continuously polls Elasticsearch for documents flagged DS_Status: REQUEST_AI and feeds them to a pool of worker processes. This is how bulk archives get profiled.

Mode B

REST endpoint

A POST /profile-text route profiles a single piece of text on demand, returning colour-coded matches as JSON. A /health route reports liveness.

Section 04The batch pipeline

From flagged document to finished profile

On startup the app waits for the Elasticsearch cluster to turn healthy, then launches the scheduler and the worker pool. The cycle below repeats forever.

  • 01  Scheduler queries Elasticsearch. Every cycle it scans the data index for documents where DS_Status = REQUEST_AI, paging 50 at a time with search-after.
  • 02  Each document becomes a task. Documents are wrapped as tasks and pushed onto a shared multiprocessing queue. If the queue fills, the scheduler pauses, providing natural back-pressure so memory stays bounded.
  • 03  Workers pick up tasks. A pool of spaCy worker processes (2 by default) pulls tasks in parallel, runs the detection core and merges any pre-existing labels already on the document.
  • 04  Detection core runs. The text is screened, language-routed and passed through up to three detection algorithms. This is handle_doc, expanded below.
  • 05  Results written back. Findings are de-duplicated and bulk-updated onto the document, and the status is set to FINISHED. Even on error the status flips to FINISHED, so nothing is reprocessed endlessly.
Written back

Per document

DS_EntryType_Count
DS_EntryType_List
DS_EntryType_Values
DS_EntryType_Index
DS_ProfiledAt
DS_Status = FINISHED
Back-pressure on a 1000-deep queue keeps the whole pipeline within a fixed memory envelope, no matter how large the archive.
Section 05Inside the core

Screen, route, then match

Before any model runs, handle_doc filters out work it should not do, then picks the right tool for the language. Only after that do the matchers fire.

document text + detected language Guardrails whitelisted types only skip json · xml · log truncate > 20,000 chars drop unknown language Routing has dependency patterns? yes → whole text no → split sentences Matchers NER keyword phrases dependency grammar de-duplicate + merge labels → write back
Step 1

Decide what to process

Only whitelisted document types are profiled (doc, docx, pdf, eml, txt and more). Structured formats like json, xml and log are skipped. Text longer than 20,000 characters is flagged and truncated, and unknown-language content is dropped.

Step 2

Pick the language strategy

If a language has grammatical dependency patterns, the whole text is analysed at once. Otherwise the text is split into sentences and each is analysed in turn, with the multilingual model as a universal fallback.

Section 06Detection algorithms

Three lenses on the same text

Each finding carries a prefixed label so downstream systems know how it was found. The three lenses run over the same sentence and their results are merged.

"Jane was treated for diabetes." one sentence, three lenses Named entity recognition statistical model · people, dates Keyword phrase matcher lemma-aware · catches inflections Dependency grammar subject + sensitive object S_PER · S_PER_FULL · S_DATE "Jane" promoted to a name finding S_K_HEALTH "diabetes" from the health dictionary S_S_HEALTH subject "Jane" + object "diabetes" confirmed
01 · NER

Named entities

spaCy's statistical model picks out people, organisations, places, dates and times. Names of the right shape (two to three distinct alphabetic words, no digits) are promoted to full-name findings.

labels: S_PER, S_PER_FULL, S_ORG, S_GPE, S_LOC, S_DATE
02 · Keywords

Phrase matching

A lemma-aware phrase matcher scans for terms from per-customer dictionaries of sensitive vocabulary, so it catches inflected forms rather than only exact strings.

labels: S_K_<TYPE>
03 · Grammar

Dependency matching

Grammatical patterns confirm a real claim: a person or allowed pronoun is the subject, and a sensitive keyword is the object. This cuts false positives by demanding context, not just a word.

labels: S_S_<TYPE>
Section 07What it detects

Special categories and identifiers

The exact categories are driven by per-customer dictionaries loaded from Elasticsearch, so each subscription can enable only the types it cares about.

GDPR special categories

Sensitive personal data

Health & symptomsReligious orientationPolitical orientation Sexual orientationEthnic originCriminal behaviour Union membershipEmployee terminationEmployee warning
Named entities

Personal identifiers

PersonFull nameOrganisationLocation Place / GPEDateTimeNationality / group
Section 08Reach

Built for scale and reach

15+language models
3detection methods
1ktask queue depth
9sensitive categories

Dedicated models ship for English, Danish, German, Dutch, French, Italian, Spanish, Swedish, Norwegian, Finnish, Polish, Portuguese, Lithuanian, Croatian (also serving Serbian, Bosnian and Montenegrin) and Ukrainian, with a multilingual model covering everything else and a universal sentence splitter underneath.

Section 09Under the hood

The stack

Web layerFlask + Gunicorn :8000
NLP enginespaCy 3.8 + Torch (CPU)
Concurrencymultiprocessing pool
Data storeElasticsearch 8.11
Dictionariesper-company, via IAM service
PackagingDocker, python:3.12
Resilienceprocess recycle every 12h
Noise controlfalse-positive suppression lists
In one line

Poll, profile, persist. Repeat.

The AI Profiler turns a quiet archive of documents into a labelled map of personal data, all inside the customer's own search cluster, so compliance work starts from facts rather than guesswork.

github.com/dataandmore/ai-profiler