Data & MoreEngineeringPlatform field manual

Forty-two services,
one governed document.

Data & More is a multi-tenant GDPR and data-governance platform. It connects to an organisation's email, file storage and chat systems, scans what it finds, profiles every document for personal data, and enforces retention and deletion policies. This is how the pieces fit together: a polyglot microservices estate with a single source of truth at its centre.

You cannot govern what you have not first read, classified, and understood.

~42microservices
2core languages (Python, Java)
7pipeline stages, source to report
1shared store: Elasticsearch
Section 01Architecture

The platform at a glance

Every request enters through one TLS-terminating NGINX proxy and is routed to the application tier: the Vue client, the main Flask API, and the IAM, analytics and LLM services. The API hands slow work to an asynchronous backbone (Celery over RabbitMQ for tasks, Kafka for high-throughput document events). The heavy lifting (crawling sources, extracting text, enforcing policy) runs in the Java tier. Underneath it all sits the data layer, with Elasticsearch as the document store every service shares.

EDGE APP TIER MESSAGING PROCESSING DATA LAYER NGINX TLS 1.2/1.3 · reverse proxy · rate limit · IP allowlist Client Vue 3 SPA 281 components · 24+ locales API Flask 3 · Celery · :8000 the central hub IAM Flask · JWT · :5000 LDAP / Active Directory Analytics Flask · pandas · :6000 reports, charts, PDF async work Celery workers reindex · bulk ops · alerts RabbitMQ task broker · v4.2 Kafka high-throughput events scan · ingest · enforce Java services java_core scan · ingest · enforce java_profiler regex · NER · FastText collector format convert graph · ews Microsoft 365 google Workspace known- persons Elasticsearch 9.x · shared document store PostgreSQL 17.x · IAM · pgvector AWS S3 daily backups · 180d
Request and data path Persisted store / output Sub-component inside a tier
The monorepo dm_main is the orchestration layer: each service lives in its own git repository, included as a submodule, and is wired together by roughly 42 Docker Compose files (one per environment and tier).
Section 02Data flow

From a raw source to an enforced policy

A document makes the same journey every time. A source is configured once; from then on the platform crawls it, extracts its text, profiles it for personal data, checks it against the tenant's policies, acts on the verdict, and reports the result. Cheap, deterministic stages run first; the expensive AI and enforcement steps run only on what reaches them.

CONFIGURE COLLECT INGEST PROFILE EVALUATE ENFORCE REPORT 1 · Configure Client UI wizard Exchange · SharePoint Google · IMAP file shares 2 · Collect java_core SCAN collector · graph ews · google crawl sources 3 · Ingest java_core INGEST Apache Tika OCR for scans text to ES 4 · Profile classify content java_profiler ai-profiler (spaCy) known-persons 5 · Evaluate PolicyValidator retention rules sensitivity class decide action 6 · Enforce PolicyEnforcer delete move · archive tag · no-op 7 · Report analytics dashboards alerts · notify PDF / Excel all stages read and write Elasticsearch
Main pipeline path Shared store, touched at every stage Output to the user
Section 03Frontend

The Vue 3 single-page app

The client is a custom-built single-page application, no off-the-shelf component library, styled with its own SCSS design system. It is where administrators connect sources, build classification rules, manage policies, read reports, and talk to documents through the RAG chat module. Role-based views adapt the interface to each persona.

What ships in it

  • Multi-tenant dashboard with role-based views for Admin, Adminlite, AD, PM and DPO.
  • Document browser with tree and grid layouts, plus PDF, Word and JSON preview.
  • Classification builder for tags, dictionaries, algorithms and document classes.
  • Policy management for GDPR retention and deletion rules.
  • Chat / LLM module, a RAG assistant over indexed documents with streaming responses.
  • Source wizards for Exchange, SharePoint, Google, IMAP and file shares.
Frontend stack
FrameworkVue 3.5
BuildVite 6 · TS 5.9
StateVuex 4.1
RouterVue Router 4.6
i18n24+ languages
ChartsChart.js
HTTPAxios + JWT
~400 routes · 281 components · 35+ API modules
Section 04Python tier

The application and AI services

The Python tier holds the platform's brain: the central API, identity, analytics, the LLM and chat stack, OCR, data-loss prevention, and the spaCy-based AI profiler reviewed elsewhere. Most run Flask behind Gunicorn; notifications run on FastAPI.

ServiceFrameworkPortPurpose
apiFlask 3 · Celery8000Main REST API: documents, policies, sources, config
iamFlask 3.1 · Gunicorn5000Auth, JWT, LDAP/AD, role-based access
analyticsFlask · pandas6000Reporting, scikit-learn, Plotly, PDF generation
chat-apiFlask · LlamaIndex8000RAG pipeline, embeddings, multi-provider LLM
llm-managementFlask (async)8000LLM orchestration, pgvector, OpenAI/Ollama/Claude
ocrFlask · Gunicorn8080EasyOCR, Tesseract, YOLO, PyMuPDF, MRZ
dlpFlask · Gunicorn8000Data loss prevention, webhook and regex scanning
py-dm-notifyFastAPI · Uvicorn8002Real-time notifications, Redis pub/sub, email
ai-profilerFlask · Gunicornn/aspaCy NER and dictionary profiling, 11+ languages
alertsCelery workern/aBackground alert processing over RabbitMQ
Hub

Main API

Python 3.12, Flask 3, Celery 5.4, SQLAlchemy 2. Owns document and source CRUD, policies, users, billing (Stripe, HubSpot), external integrations, and Celery workers for reindexing and bulk jobs.

Identity

IAM

JWT authentication with RSA key pairs, LDAP and Active Directory for enterprise SSO, role-based access control, and Alembic-managed migrations. Tenancy is scoped here.

Insight

Analytics

Statistical and ML-based reporting on scanned data, Plotly charts exported to static images, PDF reports via WeasyPrint, and a local Ollama option for AI-assisted analysis.

Section 05Java tier

The heavyweight processing engine

java_core

Scan, ingest, enforce

Java 11, running as Kafka consumers in distinct modes. It is the workhorse that touches the actual data.

  • SCAN crawls mailboxes, sites and shares.
  • INGEST extracts text with Apache Tika into Elasticsearch.
  • DELTA handles incremental updates.
  • CLEANUP removes stale data.
Enforcement lives here too: PolicyValidator checks documents against rules, PolicyEnforcer executes delete, move, tag, revert or no-op.
ServiceFrameworkJavaPurpose
java_coreKafka consumer11Scanning, ingestion, policy enforcement
java_profilerSpring Boot 2.717Regex and NER entity extraction, language detection (FastText)
collectorSpring Boot17Document collection and format conversion
graph-ingestionSpring Boot17Microsoft Graph ingestion, management, enforcement
ewsSpring Boot25Exchange Web Services integration
googleSpring Bootn/aGoogle Workspace integration
known-personsSpring Boot17Person and entity database and lookup
java_profiler exposes /profile-text, /profile/{fileId} and /language (FastText lid.176, 176 languages). Its pipeline runs tokenization, pre-tokenization matching, entity matching, validation, scoring and tagging.
Section 06Data & infrastructure

Where everything is kept

Elasticsearch 9.3

The core data layer and the one store every service shares: scanned documents, profiles, tags, policies, audit logs and dictionaries. Daily S3 backups with 180-day retention and monthly SLM snapshots.

PostgreSQL 17.6

The relational database for IAM (users, roles, tenants), subscriptions and DLP audit data, with the pgvector extension storing LLM embeddings.

RabbitMQ & Kafka

RabbitMQ brokers Celery task queues for the API, alerts and notifications. Kafka carries the high-throughput scan, ingest and enforce events that the Java tier consumes.

Redis

Caching, session storage and pub/sub for real-time notifications, used by py-dm-notify and for DLP deduplication.

AWS S3

Elasticsearch backup storage, SSL certificate distribution and offline subscription bundles, organised per deployment, company and workspace.

AWS ECR

The container registry. Images are tagged by environment (dev02, l3, l5, l7, stage) and pulled into each tier's Docker Compose stack.

Section 07Networking & security

One door, scoped by tenant

NGINX routing

HTTPS only, TLS 1.2/1.3 with HSTS, OCSP stapling and modern ciphers. Admin tools are reachable on VPN ranges only.

/client:9000
/api/api:8000
/api/auth/iam:5000
/api/dashboard/analytics:6000
/api/llm-management/llm:8000
/api/dlp/dlp:8000
/kibana/ · /flower/VPN only

Authentication flow

Client Vue SPA IAM /api/auth LDAP / AD or local DB credentials validate On success JWT access + refresh token returned Axios attaches token to every API call company_id scoped in claims · auto-refresh on 401
Headers enforce X-Frame-Options: DENY, X-Content-Type-Options: nosniff and XSS protection, with a 10 MB upload limit and Microsoft Graph webhook ranges allowlisted for DLP.
Section 08Deployment

Environments and delivery

EnvironmentCompose filePurpose
dev01/02/03docker-compose.dev.ymlFull stack, ~30 services
localdocker-compose.local.ymlLightweight, remote infra
L2 to L7docker-compose.l{n}.ymlProduction customer tiers
stagedocker-compose.stage.ymlPre-production validation
on-premdocker-compose.bosch.ymlCustomer-specific on-premises

Pipeline and watch

  • GitHub Actions run per-service workflows; PRs open per repository.
  • Ansible provisions and configures the servers.
  • Upgrade scripts carry versioned migrations.
  • Kibana for ES dashboards and logs.
  • Portainer for container management.
  • Flower for Celery queue monitoring.
  • Metricbeat for system and ES metrics.
Section 09At a glance

The whole stack on one page

LayerTechnology
FrontendVue 3, Vite, TypeScript, Vuex, SCSS, Chart.js, Axios
API gatewayNGINX 1.29, TLS termination, routing, IP allowlisting
REST APIsFlask 3.x, FastAPI, Gunicorn, Uvicorn
Async tasksCelery 5.x, RabbitMQ 4.2
Event streamingApache Kafka, Zookeeper
ProcessingJava 11/17, Spring Boot, Apache Tika
ML / NLPspaCy 3.8, FastText, sentence-transformers
LLM / RAGLlamaIndex, OpenAI, Ollama, Claude, pgvector
OCREasyOCR, Tesseract, YOLO, PyMuPDF
Search / storageElasticsearch 9.3
Relational DBPostgreSQL 17.6 with pgvector
AuthJWT (RSA), LDAP / Active Directory, Google OAuth
InfrastructureDocker Compose, Ansible, GitHub Actions, AWS ECR / S3
MonitoringKibana, Portainer, Flower, Metricbeat
Section 10How it holds together

Communication and tenancy

Each service talks the right way

Synchronous HTTP carries client to API traffic. RabbitMQ and Celery carry async tasks. Kafka streams the scan, ingest and enforce events. Redis handles pub/sub for notifications. Elasticsearch is the shared substrate every service reads and writes, and Docker internal DNS provides service discovery on a common network.

One platform, many tenants

Every organisation is a tenant identified by company_id. JWT tokens carry that context, Elasticsearch indices are tenant-scoped, IAM manages each tenant's users and roles, and Stripe tracks per-tenant billing. A superadmin can impersonate any tenant for support, through a separate token mechanism.

In one lineThe shape of it

Connect, scan, classify, decide, act, report.

From a configured mailbox to an enforced retention rule, every document follows one path through a polyglot estate held together by a single shared store. The architecture is plural by design; the source of truth is not.

~42 services · Python + Java core · Elasticsearch at the centre