Platform Architecture

Data Pipeline Overview

How data moves through the Data & More GDPR compliance platform: from source ingestion through automated classification to policy enforcement and audit.

1
External Data Sources
Origin of customer data to be governed
Microsoft 365
Outlook / SharePoint / OneDrive / Teams
Primary enterprise data source covering email, documents, file storage, and team communication channels.
ProtoEWS / Graph API
DataEmails, attachments, documents, chat
AuthOAuth 2.0 / Azure AD
Google Workspace
Gmail / Drive
ProtoGoogle APIs
AuthOAuth 2.0 / Service Account
Network Fileshare
SMB / NFS
DataFiles, directories, permissions
Azure AD
Users / Groups
ProtoMicrosoft Graph API
DataIdentities, group memberships
Websites
HTTP crawl
DataWeb pages, published content
API calls and crawling OAuth tokens Incremental sync
2
Connector Services
Source-specific ingestion adapters
ews
Java / Spring Boot
Exchange Web Services connector for Outlook, OneDrive, SharePoint, and Teams. Supports crawl, move, delete, validate, and revert actions.
LangJava (Spring Boot)
OutputElasticsearch
graph-ingestion
Scala / Maven
MS Graph API with three sub-services: graph-management, dm-graph-ingestion, graph-enforcer.
ConnectsES + RabbitMQ
google
Python
Google Workspace ingestion: Gmail and Google Drive scanning.
OutputElasticsearch
fileshare-service
Java
Network file scanning via SMB/NFS.
OutputElasticsearch
collector
Java / Gradle
Universal batch collector for generic source types.
OutputAPI then Elasticsearch
website-source
Python
HTTP/HTTPS web crawling service.
OutputES + RabbitMQ
HTTP batch ingestion Direct ES indexing
3
Core API Layer
Central orchestrator and auth
api
Python / Flask :8000
REST API with 20+ blueprint modules: documents, sources, policies, tags, reports, users, configuration, and more.
Port8000
ConnectsES, PostgreSQL, RabbitMQ, IAM
task-worker
Python / Celery
Async job processor via RabbitMQ broker.
BrokerRabbitMQ (AMQP :5672)
MonitorFlower (:5555)
iam
Python / Flask :5000
Identity and Access Management with JWT, company-scoped isolation.
Port5000
ConnectsPostgreSQL
dlp
Python / Flask + Celery :5050
Data Loss Prevention service with dedicated PostgreSQL database (dlp_dam) and RabbitMQ connection.
Port5050
ConnectsES, PostgreSQL, RabbitMQ
chat-api
Python
LLM-powered document Q&A for chat-based GDPR compliance.
ConnectsES, IAM, LLM-management
Celery tasks via AMQP :5672 Periodic scheduled jobs
4
Processing Workers
Celery tasks consumed from RabbitMQ
Classification
Celery task
PII detection, sensitivity scoring, language identification.
Tasksclassification_base, global_classification
OutputDS_PiiScore, DS_SensitivityScore
ai-profiler
Python / spaCy / Celery
NLP-based entity extraction and document profiling.
OutputES profiler results
Policy Enforcement
Celery task
Retention rules: delete, archive, quarantine with full audit.
Tasksenforce, update_policies, purge
Auditdeleted_data ES index
Tag Assignment
Celery task
Delta tagging and document class assignment.
OutputDS_Tags, DS_DocumentClass
ocr
Python
Text extraction from scanned documents and images.
OutputEnriched document in ES
DataSubject Manager
Java
VIP and entity matching against known persons database.
OutputES data subject metadata
ES / PG Sync
Celery task
Bidirectional Elasticsearch to PostgreSQL synchronization.
Taskses2postgre, syncpostgres
Backup Worker
Celery task
Daily S3 exports as JSON batches (750 docs/file), 180-day retention.
TargetS3 (180-day retention)
REST :9200 SQL :5432 AMQP :5672 S3 API
5
Data Stores
Persistence and messaging infrastructure
Elasticsearch 9.x
Primary index :9200
Document search, classification data, audit logs, and system events. The central nervous system of the platform.
data sources accounts tags policies deleted_data analytics algorithms document_classes logs history
Port9200 (HTTP), 9300 (transport)
Version9.3.1
BackupMonthly SLM snapshots to S3
PostgreSQL 17.x
Relational state :5432
Users, configuration, DLP state, IAM accounts, PowerBI datasets, and vector embeddings.
default dlp_dam iam powerbi vector
Version17.6 (ECR)
Volumepostgres_data
RabbitMQ 4.x
Message broker :5672
Celery task queue over AMQP. Management UI on :15672.
Version4.2.2
Volumerabbitmq_4
AWS S3
Backup storage
Daily JSON index exports per company/workspace with 180-day retention and monthly ES snapshots.
Retention180 days (auto-cleanup)
Health queries Email notifications Metrics collection
6
Monitoring and Alerting
Observability and system health
Management Tool | Alerts
Python / 40+ alert types
Monitors ingestion speed, policy progress, ES index health, certificate expiry, and Azure sync status.
ConnectsES, RabbitMQ
py-dm-notify
Python
Email delivery for compliance notifications and reports.
analytics
Python / Flask
Compliance metrics aggregation and KPI reporting.
ConnectsElasticsearch
flower
Web UI :5555
Real-time Celery worker and task dashboard.
metricbeat
Elastic Stack
Collects system and container metrics for monitoring ES.
HTTP / JSON REST JWT auth via IAM
7
Frontend and User Interface
User-facing applications
client
Vue 3 / TypeScript / Vite :9000
Primary web application. Document search, classification management, policy administration, compliance reports, and case logs.
StackVue 3, TypeScript, Vite, Vuex, Chart.js
chat-client
Frontend
LLM chat interface for document Q&A.
Backendchat-api
nginx
Reverse proxy :443 :80
SSL termination via Let's Encrypt. Routes: / to client, /api to api, /iam to iam.
Version1.24.0
Document Lifecycle
Source Connector crawls API indexes to ES Workers classify Tags and policies applied Policy enforcement Audit trail S3 backup
Key Connections
API ↔ ES REST :9200
API ↔ PG SQL :5432
API → RabbitMQ AMQP :5672
Workers ↔ ES REST :9200
Workers → PG SQL :5432
Workers → S3 AWS SDK
Client → API HTTP :8000
API → IAM JWT :5000
NGINX → * :443 reverse proxy
Docker network dm (internal DNS)