Data & More — Data Pipeline Architecture

External Data Sources

Origin of customer data to be governed

Microsoft 365

Outlook / SharePoint / OneDrive / Teams

Primary enterprise data source covering email, documents, file storage, and team communication channels.

Google Workspace

Gmail / Drive

Network Fileshare

SMB / NFS

Azure AD

Users / Groups

Websites

HTTP crawl

API calls and crawling OAuth tokens Incremental sync

Connector Services

Source-specific ingestion adapters

ews

Java / Spring Boot

Exchange Web Services connector for Outlook, OneDrive, SharePoint, and Teams. Supports crawl, move, delete, validate, and revert actions.

graph-ingestion

Scala / Maven

MS Graph API with three sub-services: graph-management, dm-graph-ingestion, graph-enforcer.

google

Python

Google Workspace ingestion: Gmail and Google Drive scanning.

fileshare-service

Java

Network file scanning via SMB/NFS.

collector

Java / Gradle

Universal batch collector for generic source types.

website-source

Python

HTTP/HTTPS web crawling service.

HTTP batch ingestion Direct ES indexing

Core API Layer

Central orchestrator and auth

api

Python / Flask :8000

REST API with 20+ blueprint modules: documents, sources, policies, tags, reports, users, configuration, and more.

task-worker

Python / Celery

Async job processor via RabbitMQ broker.

iam

Python / Flask :5000

Identity and Access Management with JWT, company-scoped isolation.

dlp

Python / Flask + Celery :5050

Data Loss Prevention service with dedicated PostgreSQL database (dlp_dam) and RabbitMQ connection.

chat-api

Python

LLM-powered document Q&A for chat-based GDPR compliance.

Celery tasks via AMQP :5672 Periodic scheduled jobs

Processing Workers

Celery tasks consumed from RabbitMQ

Classification

Celery task

PII detection, sensitivity scoring, language identification.

ai-profiler

Python / spaCy / Celery

NLP-based entity extraction and document profiling.

Policy Enforcement

Celery task

Retention rules: delete, archive, quarantine with full audit.

Tag Assignment

Celery task

Delta tagging and document class assignment.

ocr

Python

Text extraction from scanned documents and images.

DataSubject Manager

Java

VIP and entity matching against known persons database.

ES / PG Sync

Celery task

Bidirectional Elasticsearch to PostgreSQL synchronization.

Backup Worker

Celery task

Daily S3 exports as JSON batches (750 docs/file), 180-day retention.

REST :9200 SQL :5432 AMQP :5672 S3 API

Data Stores

Persistence and messaging infrastructure

Elasticsearch 9.x

Primary index :9200

Document search, classification data, audit logs, and system events. The central nervous system of the platform.

data sources accounts tags policies deleted_data analytics algorithms document_classes logs history

PostgreSQL 17.x

Relational state :5432

Users, configuration, DLP state, IAM accounts, PowerBI datasets, and vector embeddings.

default dlp_dam iam powerbi vector

RabbitMQ 4.x

Message broker :5672

Celery task queue over AMQP. Management UI on :15672.

AWS S3

Backup storage

Daily JSON index exports per company/workspace with 180-day retention and monthly ES snapshots.

Health queries Email notifications Metrics collection

Monitoring and Alerting

Observability and system health

Management Tool | Alerts

Python / 40+ alert types

Monitors ingestion speed, policy progress, ES index health, certificate expiry, and Azure sync status.

py-dm-notify

Python

Email delivery for compliance notifications and reports.

analytics

Python / Flask

Compliance metrics aggregation and KPI reporting.

flower

Web UI :5555

Real-time Celery worker and task dashboard.

metricbeat

Elastic Stack

Collects system and container metrics for monitoring ES.

HTTP / JSON REST JWT auth via IAM

Frontend and User Interface

User-facing applications

client

Vue 3 / TypeScript / Vite :9000

Primary web application. Document search, classification management, policy administration, compliance reports, and case logs.

chat-client

Frontend

LLM chat interface for document Q&A.

nginx

Reverse proxy :443 :80

SSL termination via Let's Encrypt. Routes: / to client, /api to api, /iam to iam.