Data & MoreNLP Engineeringgithub.com/dataandmore/ai-profiler
Know exactly what personal data is hiding in the files.
A multilingual NLP engine that scans documents already indexed in Elasticsearch, detects names, places and dates, then classifies sensitive language into GDPR special categories and writes every finding back onto the document. The data never leaves the customer's own search cluster.
Poll, profile, persist. Repeat.
15+dedicated language models
3detection algorithms
9sensitive categories
1ktask queue depth
Section 01Overview
One engine, two front doors
The same detection core (handle_doc) powers a high-volume background pipeline and a live request endpoint. Batch work flows left to right through the scheduler, a bounded queue and a pool of spaCy workers; the realtime endpoint feeds a single piece of text straight into the core and returns JSON.
Batch data pathRealtime endpointWorker / resultDetection core
Section 02The job
Find the sensitive data, label it, hand it back
The Profiler reads documents already indexed in Elasticsearch. For each one it detects names, places and dates, then looks for language that reveals special categories of personal data: health, religion, politics, sexual orientation, ethnicity, criminal history, union membership and employment actions. Every match is tagged with a type, the matched text and its position, then saved back onto the document, so compliance work starts from facts rather than guesswork.
Section 03Two ways in
A pipeline and an endpoint
Mode A
Scheduled pipeline
A background thread continuously polls Elasticsearch for documents flagged DS_Status: REQUEST_AI and feeds them to a pool of worker processes. This is how bulk archives get profiled.
Mode B
REST endpoint
A POST /profile-text route profiles a single piece of text on demand, returning colour-coded matches as JSON. A /health route reports liveness.
Section 04The batch pipeline
From flagged document to finished profile
On startup the app waits for the Elasticsearch cluster to turn healthy, then launches the scheduler and the worker pool. The cycle below repeats forever.
01 Scheduler queries Elasticsearch. Every cycle it scans the data index for documents where DS_Status = REQUEST_AI, paging 50 at a time with search-after.
02 Each document becomes a task. Documents are wrapped as tasks and pushed onto a shared multiprocessing queue. If the queue fills, the scheduler pauses, providing natural back-pressure so memory stays bounded.
03 Workers pick up tasks. A pool of spaCy worker processes (2 by default) pulls tasks in parallel, runs the detection core and merges any pre-existing labels already on the document.
04 Detection core runs. The text is screened, language-routed and passed through up to three detection algorithms. This is handle_doc, expanded below.
05 Results written back. Findings are de-duplicated and bulk-updated onto the document, and the status is set to FINISHED. Even on error the status flips to FINISHED, so nothing is reprocessed endlessly.
Back-pressure on a 1000-deep queue keeps the whole pipeline within a fixed memory envelope, no matter how large the archive.
Section 05Inside the core
Screen, route, then match
Before any model runs, handle_doc filters out work it should not do, then picks the right tool for the language. Only after that do the matchers fire.
Step 1
Decide what to process
Only whitelisted document types are profiled (doc, docx, pdf, eml, txt and more). Structured formats like json, xml and log are skipped. Text longer than 20,000 characters is flagged and truncated, and unknown-language content is dropped.
Step 2
Pick the language strategy
If a language has grammatical dependency patterns, the whole text is analysed at once. Otherwise the text is split into sentences and each is analysed in turn, with the multilingual model as a universal fallback.
Section 06Detection algorithms
Three lenses on the same text
Each finding carries a prefixed label so downstream systems know how it was found. The three lenses run over the same sentence and their results are merged.
01 · NER
Named entities
spaCy's statistical model picks out people, organisations, places, dates and times. Names of the right shape (two to three distinct alphabetic words, no digits) are promoted to full-name findings.
A lemma-aware phrase matcher scans for terms from per-customer dictionaries of sensitive vocabulary, so it catches inflected forms rather than only exact strings.
labels: S_K_<TYPE>
03 · Grammar
Dependency matching
Grammatical patterns confirm a real claim: a person or allowed pronoun is the subject, and a sensitive keyword is the object. This cuts false positives by demanding context, not just a word.
labels: S_S_<TYPE>
Section 07What it detects
Special categories and identifiers
The exact categories are driven by per-customer dictionaries loaded from Elasticsearch, so each subscription can enable only the types it cares about.
PersonFull nameOrganisationLocationPlace / GPEDateTimeNationality / group
Section 08Reach
Built for scale and reach
15+language models
3detection methods
1ktask queue depth
9sensitive categories
Dedicated models ship for English, Danish, German, Dutch, French, Italian, Spanish, Swedish, Norwegian, Finnish, Polish, Portuguese, Lithuanian, Croatian (also serving Serbian, Bosnian and Montenegrin) and Ukrainian, with a multilingual model covering everything else and a universal sentence splitter underneath.
Section 09Under the hood
The stack
Web layer
Flask + Gunicorn :8000
NLP engine
spaCy 3.8 + Torch (CPU)
Concurrency
multiprocessing pool
Data store
Elasticsearch 8.11
Dictionaries
per-company, via IAM service
Packaging
Docker, python:3.12
Resilience
process recycle every 12h
Noise control
false-positive suppression lists
In one line
Poll, profile, persist. Repeat.
The AI Profiler turns a quiet archive of documents into a labelled map of personal data, all inside the customer's own search cluster, so compliance work starts from facts rather than guesswork.