Data & MoreEngineeringPlatform field manual

Forty-two services,
one governed document.

Data & More is a multi-tenant GDPR and data-governance platform. It connects to an organisation's email, file storage and chat systems, scans what it finds, profiles every document for personal data, and enforces retention and deletion policies. This is how the pieces fit together: a polyglot microservices estate with a single source of truth at its centre.

You cannot govern what you have not first read, classified, and understood.

~42microservices

2core languages (Python, Java)

7pipeline stages, source to report

1shared store: Elasticsearch

Section 01Architecture

The platform at a glance

Every request enters through one TLS-terminating NGINX proxy and is routed to the application tier: the Vue client, the main Flask API, and the IAM, analytics and LLM services. The API hands slow work to an asynchronous backbone (Celery over RabbitMQ for tasks, Kafka for high-throughput document events). The heavy lifting (crawling sources, extracting text, enforcing policy) runs in the Java tier. Underneath it all sits the data layer, with Elasticsearch as the document store every service shares.

Request and data path Persisted store / output Sub-component inside a tier

The monorepo dm_main is the orchestration layer: each service lives in its own git repository, included as a submodule, and is wired together by roughly 42 Docker Compose files (one per environment and tier).

Section 02Data flow

From a raw source to an enforced policy

A document makes the same journey every time. A source is configured once; from then on the platform crawls it, extracts its text, profiles it for personal data, checks it against the tenant's policies, acts on the verdict, and reports the result. Cheap, deterministic stages run first; the expensive AI and enforcement steps run only on what reaches them.

Main pipeline path Shared store, touched at every stage Output to the user

Section 03Frontend

The Vue 3 single-page app

The client is a custom-built single-page application, no off-the-shelf component library, styled with its own SCSS design system. It is where administrators connect sources, build classification rules, manage policies, read reports, and talk to documents through the RAG chat module. Role-based views adapt the interface to each persona.

What ships in it

Multi-tenant dashboard with role-based views for Admin, Adminlite, AD, PM and DPO.
Document browser with tree and grid layouts, plus PDF, Word and JSON preview.
Classification builder for tags, dictionaries, algorithms and document classes.
Policy management for GDPR retention and deletion rules.
Chat / LLM module, a RAG assistant over indexed documents with streaming responses.
Source wizards for Exchange, SharePoint, Google, IMAP and file shares.

Frontend stack

Framework	Vue 3.5
Build	Vite 6 · TS 5.9
State	Vuex 4.1
Router	Vue Router 4.6
i18n	24+ languages
Charts	Chart.js
HTTP	Axios + JWT

~400 routes · 281 components · 35+ API modules

Section 04Python tier

The application and AI services

The Python tier holds the platform's brain: the central API, identity, analytics, the LLM and chat stack, OCR, data-loss prevention, and the spaCy-based AI profiler reviewed elsewhere. Most run Flask behind Gunicorn; notifications run on FastAPI.

Service	Framework	Port	Purpose
api	Flask 3 · Celery	8000	Main REST API: documents, policies, sources, config
iam	Flask 3.1 · Gunicorn	5000	Auth, JWT, LDAP/AD, role-based access
analytics	Flask · pandas	6000	Reporting, scikit-learn, Plotly, PDF generation
chat-api	Flask · LlamaIndex	8000	RAG pipeline, embeddings, multi-provider LLM
llm-management	Flask (async)	8000	LLM orchestration, pgvector, OpenAI/Ollama/Claude
ocr	Flask · Gunicorn	8080	EasyOCR, Tesseract, YOLO, PyMuPDF, MRZ
dlp	Flask · Gunicorn	8000	Data loss prevention, webhook and regex scanning
py-dm-notify	FastAPI · Uvicorn	8002	Real-time notifications, Redis pub/sub, email
ai-profiler	Flask · Gunicorn	n/a	spaCy NER and dictionary profiling, 11+ languages
alerts	Celery worker	n/a	Background alert processing over RabbitMQ

Hub

Main API

Python 3.12, Flask 3, Celery 5.4, SQLAlchemy 2. Owns document and source CRUD, policies, users, billing (Stripe, HubSpot), external integrations, and Celery workers for reindexing and bulk jobs.

Identity

IAM

JWT authentication with RSA key pairs, LDAP and Active Directory for enterprise SSO, role-based access control, and Alembic-managed migrations. Tenancy is scoped here.

Insight

Analytics

Statistical and ML-based reporting on scanned data, Plotly charts exported to static images, PDF reports via WeasyPrint, and a local Ollama option for AI-assisted analysis.

Section 05Java tier

The heavyweight processing engine

java_core

Scan, ingest, enforce

Java 11, running as Kafka consumers in distinct modes. It is the workhorse that touches the actual data.

SCAN crawls mailboxes, sites and shares.
INGEST extracts text with Apache Tika into Elasticsearch.
DELTA handles incremental updates.
CLEANUP removes stale data.

Enforcement lives here too: PolicyValidator checks documents against rules, PolicyEnforcer executes delete, move, tag, revert or no-op.

Service	Framework	Java	Purpose
java_core	Kafka consumer	11	Scanning, ingestion, policy enforcement
java_profiler	Spring Boot 2.7	17	Regex and NER entity extraction, language detection (FastText)
collector	Spring Boot	17	Document collection and format conversion
graph-ingestion	Spring Boot	17	Microsoft Graph ingestion, management, enforcement
ews	Spring Boot	25	Exchange Web Services integration
google	Spring Boot	n/a	Google Workspace integration
known-persons	Spring Boot	17	Person and entity database and lookup

java_profiler exposes /profile-text, /profile/{fileId} and /language (FastText lid.176, 176 languages). Its pipeline runs tokenization, pre-tokenization matching, entity matching, validation, scoring and tagging.

Section 06Data & infrastructure

Where everything is kept

Elasticsearch 9.3

The core data layer and the one store every service shares: scanned documents, profiles, tags, policies, audit logs and dictionaries. Daily S3 backups with 180-day retention and monthly SLM snapshots.

PostgreSQL 17.6

The relational database for IAM (users, roles, tenants), subscriptions and DLP audit data, with the pgvector extension storing LLM embeddings.

RabbitMQ & Kafka

RabbitMQ brokers Celery task queues for the API, alerts and notifications. Kafka carries the high-throughput scan, ingest and enforce events that the Java tier consumes.

Redis

Caching, session storage and pub/sub for real-time notifications, used by py-dm-notify and for DLP deduplication.

AWS S3

Elasticsearch backup storage, SSL certificate distribution and offline subscription bundles, organised per deployment, company and workspace.

AWS ECR

The container registry. Images are tagged by environment (dev02, l3, l5, l7, stage) and pulled into each tier's Docker Compose stack.

Section 07Networking & security

One door, scoped by tenant

NGINX routing

HTTPS only, TLS 1.2/1.3 with HSTS, OCSP stapling and modern ciphers. Admin tools are reachable on VPN ranges only.

/	client:9000
/api/	api:8000
/api/auth/	iam:5000
/api/dashboard/	analytics:6000
/api/llm-management/	llm:8000
/api/dlp/	dlp:8000
/kibana/ · /flower/	VPN only

Authentication flow

Headers enforce X-Frame-Options: DENY, X-Content-Type-Options: nosniff and XSS protection, with a 10 MB upload limit and Microsoft Graph webhook ranges allowlisted for DLP.

Section 08Deployment

Environments and delivery

Environment	Compose file	Purpose
dev01/02/03	docker-compose.dev.yml	Full stack, ~30 services
local	docker-compose.local.yml	Lightweight, remote infra
L2 to L7	docker-compose.l{n}.yml	Production customer tiers
stage	docker-compose.stage.yml	Pre-production validation
on-prem	docker-compose.bosch.yml	Customer-specific on-premises

Pipeline and watch

GitHub Actions run per-service workflows; PRs open per repository.
Ansible provisions and configures the servers.
Upgrade scripts carry versioned migrations.
Kibana for ES dashboards and logs.
Portainer for container management.
Flower for Celery queue monitoring.
Metricbeat for system and ES metrics.

Section 09At a glance

The whole stack on one page

Layer	Technology
Frontend	Vue 3, Vite, TypeScript, Vuex, SCSS, Chart.js, Axios
API gateway	NGINX 1.29, TLS termination, routing, IP allowlisting
REST APIs	Flask 3.x, FastAPI, Gunicorn, Uvicorn
Async tasks	Celery 5.x, RabbitMQ 4.2
Event streaming	Apache Kafka, Zookeeper
Processing	Java 11/17, Spring Boot, Apache Tika
ML / NLP	spaCy 3.8, FastText, sentence-transformers
LLM / RAG	LlamaIndex, OpenAI, Ollama, Claude, pgvector
OCR	EasyOCR, Tesseract, YOLO, PyMuPDF
Search / storage	Elasticsearch 9.3
Relational DB	PostgreSQL 17.6 with pgvector
Auth	JWT (RSA), LDAP / Active Directory, Google OAuth
Infrastructure	Docker Compose, Ansible, GitHub Actions, AWS ECR / S3
Monitoring	Kibana, Portainer, Flower, Metricbeat

Section 10How it holds together

Communication and tenancy

Each service talks the right way

Synchronous HTTP carries client to API traffic. RabbitMQ and Celery carry async tasks. Kafka streams the scan, ingest and enforce events. Redis handles pub/sub for notifications. Elasticsearch is the shared substrate every service reads and writes, and Docker internal DNS provides service discovery on a common network.

One platform, many tenants

Every organisation is a tenant identified by company_id. JWT tokens carry that context, Elasticsearch indices are tenant-scoped, IAM manages each tenant's users and roles, and Stripe tracks per-tenant billing. A superadmin can impersonate any tenant for support, through a separate token mechanism.

In one lineThe shape of it

Connect, scan, classify, decide, act, report.

From a configured mailbox to an enforced retention rule, every document follows one path through a polyglot estate held together by a single shared store. The architecture is plural by design; the source of truth is not.

~42 services · Python + Java core · Elasticsearch at the centre

Data & More platform field manual. Generated from a codebase exploration dated 2026-05-27. Versions and service details reflect that snapshot of dm_main and its submodules; where the implementation has since moved on, edit this document to match the source. Component descriptions reflect the standard behaviour of the named technologies (Flask, Spring Boot, Elasticsearch, spaCy, Kafka, and the rest).

Forty-two services,one governed document.