Data & More is a multi-tenant GDPR and data-governance platform. It connects to an organisation's email, file storage and chat systems, scans what it finds, profiles every document for personal data, and enforces retention and deletion policies. This is how the pieces fit together: a polyglot microservices estate with a single source of truth at its centre.
You cannot govern what you have not first read, classified, and understood.
~42microservices
2core languages (Python, Java)
7pipeline stages, source to report
1shared store: Elasticsearch
Section 01Architecture
The platform at a glance
Every request enters through one TLS-terminating NGINX proxy and is routed to the application tier: the Vue client, the main Flask API, and the IAM, analytics and LLM services. The API hands slow work to an asynchronous backbone (Celery over RabbitMQ for tasks, Kafka for high-throughput document events). The heavy lifting (crawling sources, extracting text, enforcing policy) runs in the Java tier. Underneath it all sits the data layer, with Elasticsearch as the document store every service shares.
Request and data pathPersisted store / outputSub-component inside a tier
The monorepo dm_main is the orchestration layer: each service lives in its own git repository, included as a submodule, and is wired together by roughly 42 Docker Compose files (one per environment and tier).
Section 02Data flow
From a raw source to an enforced policy
A document makes the same journey every time. A source is configured once; from then on the platform crawls it, extracts its text, profiles it for personal data, checks it against the tenant's policies, acts on the verdict, and reports the result. Cheap, deterministic stages run first; the expensive AI and enforcement steps run only on what reaches them.
Main pipeline pathShared store, touched at every stageOutput to the user
Section 03Frontend
The Vue 3 single-page app
The client is a custom-built single-page application, no off-the-shelf component library, styled with its own SCSS design system. It is where administrators connect sources, build classification rules, manage policies, read reports, and talk to documents through the RAG chat module. Role-based views adapt the interface to each persona.
What ships in it
Multi-tenant dashboard with role-based views for Admin, Adminlite, AD, PM and DPO.
Document browser with tree and grid layouts, plus PDF, Word and JSON preview.
Classification builder for tags, dictionaries, algorithms and document classes.
Policy management for GDPR retention and deletion rules.
Chat / LLM module, a RAG assistant over indexed documents with streaming responses.
Source wizards for Exchange, SharePoint, Google, IMAP and file shares.
Frontend stack
Framework
Vue 3.5
Build
Vite 6 · TS 5.9
State
Vuex 4.1
Router
Vue Router 4.6
i18n
24+ languages
Charts
Chart.js
HTTP
Axios + JWT
~400 routes · 281 components · 35+ API modules
Section 04Python tier
The application and AI services
The Python tier holds the platform's brain: the central API, identity, analytics, the LLM and chat stack, OCR, data-loss prevention, and the spaCy-based AI profiler reviewed elsewhere. Most run Flask behind Gunicorn; notifications run on FastAPI.
Service
Framework
Port
Purpose
api
Flask 3 · Celery
8000
Main REST API: documents, policies, sources, config
iam
Flask 3.1 · Gunicorn
5000
Auth, JWT, LDAP/AD, role-based access
analytics
Flask · pandas
6000
Reporting, scikit-learn, Plotly, PDF generation
chat-api
Flask · LlamaIndex
8000
RAG pipeline, embeddings, multi-provider LLM
llm-management
Flask (async)
8000
LLM orchestration, pgvector, OpenAI/Ollama/Claude
ocr
Flask · Gunicorn
8080
EasyOCR, Tesseract, YOLO, PyMuPDF, MRZ
dlp
Flask · Gunicorn
8000
Data loss prevention, webhook and regex scanning
py-dm-notify
FastAPI · Uvicorn
8002
Real-time notifications, Redis pub/sub, email
ai-profiler
Flask · Gunicorn
n/a
spaCy NER and dictionary profiling, 11+ languages
alerts
Celery worker
n/a
Background alert processing over RabbitMQ
Hub
Main API
Python 3.12, Flask 3, Celery 5.4, SQLAlchemy 2. Owns document and source CRUD, policies, users, billing (Stripe, HubSpot), external integrations, and Celery workers for reindexing and bulk jobs.
Identity
IAM
JWT authentication with RSA key pairs, LDAP and Active Directory for enterprise SSO, role-based access control, and Alembic-managed migrations. Tenancy is scoped here.
Insight
Analytics
Statistical and ML-based reporting on scanned data, Plotly charts exported to static images, PDF reports via WeasyPrint, and a local Ollama option for AI-assisted analysis.
Section 05Java tier
The heavyweight processing engine
java_core
Scan, ingest, enforce
Java 11, running as Kafka consumers in distinct modes. It is the workhorse that touches the actual data.
SCAN crawls mailboxes, sites and shares.
INGEST extracts text with Apache Tika into Elasticsearch.
DELTA handles incremental updates.
CLEANUP removes stale data.
Enforcement lives here too: PolicyValidator checks documents against rules, PolicyEnforcer executes delete, move, tag, revert or no-op.
Service
Framework
Java
Purpose
java_core
Kafka consumer
11
Scanning, ingestion, policy enforcement
java_profiler
Spring Boot 2.7
17
Regex and NER entity extraction, language detection (FastText)
collector
Spring Boot
17
Document collection and format conversion
graph-ingestion
Spring Boot
17
Microsoft Graph ingestion, management, enforcement
ews
Spring Boot
25
Exchange Web Services integration
google
Spring Boot
n/a
Google Workspace integration
known-persons
Spring Boot
17
Person and entity database and lookup
java_profiler exposes /profile-text, /profile/{fileId} and /language (FastText lid.176, 176 languages). Its pipeline runs tokenization, pre-tokenization matching, entity matching, validation, scoring and tagging.
Section 06Data & infrastructure
Where everything is kept
Elasticsearch 9.3
The core data layer and the one store every service shares: scanned documents, profiles, tags, policies, audit logs and dictionaries. Daily S3 backups with 180-day retention and monthly SLM snapshots.
PostgreSQL 17.6
The relational database for IAM (users, roles, tenants), subscriptions and DLP audit data, with the pgvector extension storing LLM embeddings.
RabbitMQ & Kafka
RabbitMQ brokers Celery task queues for the API, alerts and notifications. Kafka carries the high-throughput scan, ingest and enforce events that the Java tier consumes.
Redis
Caching, session storage and pub/sub for real-time notifications, used by py-dm-notify and for DLP deduplication.
AWS S3
Elasticsearch backup storage, SSL certificate distribution and offline subscription bundles, organised per deployment, company and workspace.
AWS ECR
The container registry. Images are tagged by environment (dev02, l3, l5, l7, stage) and pulled into each tier's Docker Compose stack.
Section 07Networking & security
One door, scoped by tenant
NGINX routing
HTTPS only, TLS 1.2/1.3 with HSTS, OCSP stapling and modern ciphers. Admin tools are reachable on VPN ranges only.
/
client:9000
/api/
api:8000
/api/auth/
iam:5000
/api/dashboard/
analytics:6000
/api/llm-management/
llm:8000
/api/dlp/
dlp:8000
/kibana/ · /flower/
VPN only
Authentication flow
Headers enforce X-Frame-Options: DENY, X-Content-Type-Options: nosniff and XSS protection, with a 10 MB upload limit and Microsoft Graph webhook ranges allowlisted for DLP.
Section 08Deployment
Environments and delivery
Environment
Compose file
Purpose
dev01/02/03
docker-compose.dev.yml
Full stack, ~30 services
local
docker-compose.local.yml
Lightweight, remote infra
L2 to L7
docker-compose.l{n}.yml
Production customer tiers
stage
docker-compose.stage.yml
Pre-production validation
on-prem
docker-compose.bosch.yml
Customer-specific on-premises
Pipeline and watch
GitHub Actions run per-service workflows; PRs open per repository.
Synchronous HTTP carries client to API traffic. RabbitMQ and Celery carry async tasks. Kafka streams the scan, ingest and enforce events. Redis handles pub/sub for notifications. Elasticsearch is the shared substrate every service reads and writes, and Docker internal DNS provides service discovery on a common network.
One platform, many tenants
Every organisation is a tenant identified by company_id. JWT tokens carry that context, Elasticsearch indices are tenant-scoped, IAM manages each tenant's users and roles, and Stripe tracks per-tenant billing. A superadmin can impersonate any tenant for support, through a separate token mechanism.
In one lineThe shape of it
Connect, scan, classify, decide, act, report.
From a configured mailbox to an enforced retention rule, every document follows one path through a polyglot estate held together by a single shared store. The architecture is plural by design; the source of truth is not.
~42 services · Python + Java core · Elasticsearch at the centre