Data & MoreEngineeringDeep dive index

The whole platform, on one map.

Data & More is a multi-tenant GDPR and data-governance platform. It connects to an organisation's email, file storage and chat systems, reads everything it finds, classifies every document for personal and sensitive data, and helps the data owner act on it. This page is the map: what the platform does, how the parts fit, and where to read the detail.

You cannot govern what you have not first read, classified, and understood.

~42microservices
2core languages: Python, Java
7pipeline stages, source to report
1shared store: Elasticsearch
Section 01How it works

Classify, Verify, Delete, always on

The platform runs continuously, not on a fixed quarterly cadence. The process has a name, Classify, Verify, Delete: the platform classifies what sits in the archive, the data owner verifies the findings on their own ground, and the result is deletion (or archive, edit or restrict) carried out in the source system. Sources and exceptions are the one-time configuration that feeds the loop.

ALWAYS ON VERIFY CLASSIFY DELETE
Classify, the platform's job Verify, the data owner's job Delete, the continuous goal
Where the data comes from

Connected sources

Each source is wired through one of the platform's ingestion connectors. The cycle starts here and writes findings back here too.

Office 365ExchangeSharePointOneDrive TeamsOutlookGmailGoogle DriveFile shares
What Delete means in practice

Three ways to delete

Deletion is never automatic. The data owner reviews each finding locally and picks one of three actions, recorded in the audit log.

  • Restrict, the record stays put but is flagged for restricted access.
  • Archive (retention), the record is moved to an archive with a defined retention rule.
  • Delete, the record is permanently removed from the source system.
Section 02Conceptual model

How the parts fit together

Every request enters through one TLS-terminating NGINX proxy and is routed to the application tier: the Vue client, the main Flask API, and the IAM, analytics and LLM services. The API hands slow work to an asynchronous backbone of Celery workers over RabbitMQ, which also carries the scan, ingest and enforce events for the Java tier. The heavy lifting (crawling sources, extracting text, enforcing policy) runs in that Java tier. Underneath it all sits the data layer, with Elasticsearch as the document store every service shares.

EDGE APP TIER MESSAGING STORE SERVICES INGESTION CONCEPTUAL MODEL not 100% accurate NGINX TLS 1.2/1.3 · reverse proxy · rate limit · IP allowlist Client Vue 3 SPA 281 components · 24+ locales API Flask 3 · Celery · :8000 the central hub IAM Flask · JWT · :5000 LDAP / Active Directory Analytics Flask · pandas · :6000 reports, charts, PDF async work sync HTTP Celery workers reindex · bulk ops · alerts RabbitMQ task broker · v4.2 Task Worker async job runner writes results Elasticsearch 9.x · shared document store PostgreSQL 17.x · IAM · pgvector Backup daily backups · 180d reads · writes writes writes Services java_core scan · ingest · enforce java_profiler regex · NER · FastText Enforcer policy actions OCR image · text · MRZ Data Subject Mgr person lookup feeds Ingestion Graph Ingestion Microsoft Graph EWS Ingestion Exchange Web Services SP Ingestion SharePoint Google Ingestion Workspace · Drive Web Ingestion URLs · scraping
Request and data path Persisted store / output Sub-component inside a tier

The full service estate, gateway, tenancy and delivery model are covered in Platform Architecture below.

Section 03End to end

From a raw source to an enforced policy

A document makes the same journey every time. A source is configured once; from then on the platform crawls it, extracts its text, profiles it for personal data, checks it against the tenant's policies, acts on the verdict, and reports the result. Cheap, deterministic stages run first; the expensive AI and enforcement steps run only on what reaches them.

INGEST PROFILE VALIDATE ENFORCE REPORT 1 · Ingest java_core SCAN + INGEST collector · graph ews · google Apache Tika · OCR 2 · Profile classify content Logic Profile java_profiler AI Profile ai-profiler (spaCy) 3 · Validate PolicyValidator retention rules sensitivity class decide action 4 · Enforce PolicyEnforcer delete move · archive tag · no-op 5 · Report analytics dashboards alerts · notify PDF / Excel all stages read and write Elasticsearch
Main pipeline path Shared store, touched at every stage Output to the user

What the platform does, in six verbs

01 · Connect

Connect

Wire in email, file storage and chat through one connector per source.

02 · Scan

Scan & read

Crawl each source and extract text, including OCR for images and scans.

03 · Classify

Classify

Profile every document for personal and sensitive data against the taxonomy.

04 · Decide

Decide

Check findings against the tenant's retention and sensitivity policies.

05 · Act

Act

Edit, archive, restrict or delete, always confirmed by the data owner.

06 · Report

Report

Dashboards, alerts and exportable reports for evidence and audit.

Section 04Read the detail

Inside this deep dive

The map above stays deliberately shallow. Four focused field manuals carry the detail, each one owning a distinct layer of the platform with no overlap between them.

Where this fitsStart here

One map, four field manuals.

This overview is the entry point to the Deep Dive section. Read it for the shape of the whole platform, then follow a link above to the layer you need. Each deep dive is self-contained and assumes only what is on this page.

deep-dive · support.dataandmore.com/en/knowledge/deep-dive