A staged OCR pipeline that turns a raw PDF into clean text and document-level signals. Each pass is deliberate: a free text layer where it exists, a fast Tesseract baseline, a combined vision worker for signatures, faces and ID detection, a specialist MRZ reader, and EasyOCR as the deep-learning last mile.
Because you can't garble text you read with the right tool the first time.
5sequential passes
3detectors in one worker
2OCR engines, fast then deep
1page image, loaded once
Section 01Overview
The pipeline at a glance
Pages move left to right. The MRZ pass only fires when the ID-or-not detector says a page is an identity document; everything else continues to the rich OCR pass. Each stage emits annotations the next stage can use as hints.
Main data pathConditional branch — ID onlyVision signalDocument annotation
Section 02Rationale
Why layer the work at all
A real document corpus is heterogeneous: born-digital PDFs, scanned letters with a coffee stain, multilingual contracts, and passports with a machine-readable zone all land in the same inbox. The pipeline routes each page through cheap, deterministic passes first and reserves expensive deep-learning OCR for the cases that earn it. Vision signals ride alongside the text, so a consumer can ask is this contract signed or is this upload actually a passport without re-reading the file.
Pass 01PDF reader
The front door
Two PDFs that look identical to a person can be wildly different inside: one a born-digital export with a perfect text layer, the other a phone photo flattened to PDF at 72 DPI and rotated four degrees. The reader normalises that asymmetry so downstream OCR never has to.
What it does
Rasterises each page to a standard 300 DPI so engines see consistent character sizes.
Extracts the embedded text layer when present — short-circuiting OCR entirely for born-digital files.
Deskews, dewarps and fixes orientation, then crops scanner-bed borders.
Splits multi-page documents into a per-page stream the rest of the pipeline handles independently.
Unique advantage
Free text wins. A born-digital PDF skips OCR — instant, perfect, error-free.
Consistent input. Every later model can assume an upright, sane-DPI image.
One source of truth. The image rendered here is reused by every later worker — no double rasterisation.
Pass 02Tesseract
The fast baseline
Tesseract is the workhorse: fast, CPU-friendly, 100+ languages, and structured output — word boxes, line boxes, per-word confidence. For the long tail of clean office documents it solves the problem outright.
What it does
Runs the LSTM recogniser over each page and emits hOCR / TSV with words, lines and boxes.
Attaches a per-word confidence score that decides where EasyOCR needs a second pass.
Returns reading order and layout, so paragraphs reconstruct cleanly.
Unique advantage
Speed and cost. Pure CPU, no GPU dependency — scales horizontally, stays cheap.
Deterministic. Same input, same output: easy to test, cache and diff in CI.
Layout-aware. Word boxes and reading order are first-class outputs.
Confidence is a routing signal. Low-confidence regions become EasyOCR's input.
Tesseract is deliberately the first OCR pass — good enough for most pages. The expensive passes only run where it falls short.
Pass 03Combined worker
Signatures, faces, and one decision
The combined worker is the vision lane. Rather than running three jobs that each reload the page image, allocate a tensor and warm a model, it runs three detectors over the same in-memory image in one pass — producing document-level signals that text alone can't answer.
Detector A
Signature
Most contracts only matter once countersigned. A small CNN scans for ink-like strokes that read as a handwritten signature and returns boxes plus a score.
Turns is this contract executed into a boolean.
Box plus page index lets a UI jump to the signed line.
The nearest text line — the printed name — can be paired to the box.
Detector B
YuNet
YuNet is a tiny face detector built to run in real time on commodity hardware. Here it isn't about faces — it's about portraits as a document feature.
A face in the corner is a strong cue for an ID, passport or licence.
About 1 ms per crop on CPU — cheap enough to run on every page.
Presence of a face can drive redaction or special handling.
Detector C
ID or not
A binary classifier that reads the whole page and answers one question: is this an identity document? Its job is to decide which pages reach the MRZ specialist.
Avoids running MRZ on every page — most have none.
Uses YuNet's face hit and Tesseract's text as features.
A small head over a small backbone: fast, easy to calibrate.
Why combined? Loading the image, colour-converting, resizing and warming a model account for most of any detector's latency. Running all three over one shared image — one process, one tight loop — collapses that overhead and keeps the annotations self-consistent, because all three saw the same pixels.
Pass 04MRZ
The passport specialist
The machine-readable zone at the foot of a passport or ID card is a strict, ICAO-9303 format: a fixed character set (A–Z 0–9 <), fixed positions, and built-in check digits. A generic OCR will read it but garble O/0 or 1/I. This pass is purpose-built for the strip.
What it does
Runs only when the ID-or-not classifier flags the page — so it never wastes work.
Uses an MRZ-tuned recogniser and grammar that only emit valid MRZ characters.
Parses the fields: type, issuing country, surname, given names, document number, nationality, date of birth, sex, expiry.
Verifies check digits — a forged or mis-read field is caught by arithmetic, not by guessing.
Unique advantage
Structured, not free text. KYC code consumes a typed record, not a string.
Self-validating. Check digits give a guarantee generic OCR can't match.
Narrow domain, high accuracy. Small character set, fixed layout — a specialist wins by a wide margin.
Pass 05EasyOCR
The deep-learning last mile
EasyOCR is a deep-learning OCR — a CRAFT text detector plus a CRNN recogniser on PyTorch. Heavier than Tesseract, but it shines exactly where Tesseract struggles: low-contrast or stylized fonts, curved or rotated text, photos of receipts, and many non-Latin scripts. Here it is the fallback and the rich-OCR pass.
What it does
Re-reads the regions where Tesseract returned low confidence — surgical, not whole-page.
Handles non-Latin scripts a given Tesseract deployment isn't configured for.
Provides a second opinion: agreement between the two engines raises confidence sharply.
Unique advantage
Robustness. CNN recognition copes with photos, perspective and unusual fonts.
Coverage. 80+ languages out of the box.
Targeted. Only runs where Tesseract wasn't confident, so the slow path stays small.
Cheap first, expensive only where needed — that is what keeps the pipeline fast on average without losing the long tail.
Section 03Routing
The decision a page makes
Section 04At a glance
What each pass adds
Pass
Output
Cost
When it shines
PDF reader
normalized images, text layer
very low
born-digital PDFs
Tesseract
word boxes + confidence
low · CPU
clean office documents
Signature
signed flag + boxes
low
contracts, compliance
YuNet
face boxes, count
very low
spotting portraits / IDs
ID-or-not
isID, document type
low
routing to MRZ
MRZ
typed record + check digits
medium
passports, national IDs
EasyOCR
text + confidence
high · GPU-friendly
stylized, noisy, multilingual
Section 05Principles
What the pipeline believes
Cheap first, expensive only when earned
Every pass exists because the one before it can't handle a specific failure mode. Tesseract carries the bulk; EasyOCR is invoked only for the words it couldn't read; MRZ only when the page is truly an ID.
Signals, not just text
OCR is necessary but rarely sufficient. The combined worker emits document-level signals — signed, portrait present, identity document — that the business logic downstream actually needs.
Specialists beat generalists in narrow domains
A passport MRZ is a tiny, strict format with check digits. A specialist parser is more accurate, and self-validating, in a way no generic OCR can be on the same strip.
Share the work
The combined worker exists because loading and preprocessing the image is the slow part. Three detectors over one in-memory image collapse that overhead and keep the annotations consistent.