Data & MoreEngineeringgithub.com/dataandmore/ocr

One document, five
passes, zero guesswork.

A staged OCR pipeline that turns a raw PDF into clean text and document-level signals. Each pass is deliberate: a free text layer where it exists, a fast Tesseract baseline, a combined vision worker for signatures, faces and ID detection, a specialist MRZ reader, and EasyOCR as the deep-learning last mile.

Because you can't garble text you read with the right tool the first time.

5sequential passes
3detectors in one worker
2OCR engines, fast then deep
1page image, loaded once
Section 01Overview

The pipeline at a glance

Pages move left to right. The MRZ pass only fires when the ID-or-not detector says a page is an identity document; everything else continues to the rich OCR pass. Each stage emits annotations the next stage can use as hints.

INPUT FAST TEXT PARALLEL VISION SPECIALIST DEEP OCR PDF reader pdfium / PyMuPDF pages → images + text deskew · 300 DPI Tesseract LSTM · fast baseline OCR word boxes · confidence layout · reading order cheap · deterministic Combined worker three detectors, one image load Signature signed? where? YuNet face box portrait? ID or not classifier route? if ID always · rich OCR MRZ passport / ID strip P<UTODOE<<JANE<< L898902C36UTO7408 ICAO 9303 parser check digits verified EasyOCR CRAFT + CRNN stylized · noisy 80+ languages last mile
Main data path Conditional branch — ID only Vision signal Document annotation
Section 02Rationale

Why layer the work at all

A real document corpus is heterogeneous: born-digital PDFs, scanned letters with a coffee stain, multilingual contracts, and passports with a machine-readable zone all land in the same inbox. The pipeline routes each page through cheap, deterministic passes first and reserves expensive deep-learning OCR for the cases that earn it. Vision signals ride alongside the text, so a consumer can ask is this contract signed or is this upload actually a passport without re-reading the file.

Pass 01PDF reader

The front door

Two PDFs that look identical to a person can be wildly different inside: one a born-digital export with a perfect text layer, the other a phone photo flattened to PDF at 72 DPI and rotated four degrees. The reader normalises that asymmetry so downstream OCR never has to.

What it does

  • Rasterises each page to a standard 300 DPI so engines see consistent character sizes.
  • Extracts the embedded text layer when present — short-circuiting OCR entirely for born-digital files.
  • Deskews, dewarps and fixes orientation, then crops scanner-bed borders.
  • Splits multi-page documents into a per-page stream the rest of the pipeline handles independently.

Unique advantage

  • Free text wins. A born-digital PDF skips OCR — instant, perfect, error-free.
  • Consistent input. Every later model can assume an upright, sane-DPI image.
  • One source of truth. The image rendered here is reused by every later worker — no double rasterisation.
tilted raw scan deskewed 300 DPI
Pass 02Tesseract

The fast baseline

Invoice No. 2026-0427 Total: 12,480.00 Due 30 May 2026 VAT 19283746 word boxes + confidence

Tesseract is the workhorse: fast, CPU-friendly, 100+ languages, and structured output — word boxes, line boxes, per-word confidence. For the long tail of clean office documents it solves the problem outright.

What it does

  • Runs the LSTM recogniser over each page and emits hOCR / TSV with words, lines and boxes.
  • Attaches a per-word confidence score that decides where EasyOCR needs a second pass.
  • Returns reading order and layout, so paragraphs reconstruct cleanly.

Unique advantage

  • Speed and cost. Pure CPU, no GPU dependency — scales horizontally, stays cheap.
  • Deterministic. Same input, same output: easy to test, cache and diff in CI.
  • Layout-aware. Word boxes and reading order are first-class outputs.
  • Confidence is a routing signal. Low-confidence regions become EasyOCR's input.
Tesseract is deliberately the first OCR pass — good enough for most pages. The expensive passes only run where it falls short.
Pass 03Combined worker

Signatures, faces, and one decision

The combined worker is the vision lane. Rather than running three jobs that each reload the page image, allocate a tensor and warm a model, it runs three detectors over the same in-memory image in one pass — producing document-level signals that text alone can't answer.

page image · loaded once Signature detector CNN over the page · boxes + score YuNet face detector tiny, fast, runs on CPU ID-or-not classifier binary head · is this an ID? signed = true box=(14,130,96,38) · p=0.93 faces = 1 portrait · score 0.98 isID = true routes to the MRZ pass one image, one batch three detectors share decode + preprocess ~3× faster than three separate workers
Detector A

Signature

Most contracts only matter once countersigned. A small CNN scans for ink-like strokes that read as a handwritten signature and returns boxes plus a score.

  • Turns is this contract executed into a boolean.
  • Box plus page index lets a UI jump to the signed line.
  • The nearest text line — the printed name — can be paired to the box.
Detector B

YuNet

YuNet is a tiny face detector built to run in real time on commodity hardware. Here it isn't about faces — it's about portraits as a document feature.

  • A face in the corner is a strong cue for an ID, passport or licence.
  • About 1 ms per crop on CPU — cheap enough to run on every page.
  • Presence of a face can drive redaction or special handling.
Detector C

ID or not

A binary classifier that reads the whole page and answers one question: is this an identity document? Its job is to decide which pages reach the MRZ specialist.

  • Avoids running MRZ on every page — most have none.
  • Uses YuNet's face hit and Tesseract's text as features.
  • A small head over a small backbone: fast, easy to calibrate.
Why combined? Loading the image, colour-converting, resizing and warming a model account for most of any detector's latency. Running all three over one shared image — one process, one tight loop — collapses that overhead and keeps the annotations self-consistent, because all three saw the same pixels.
Pass 04MRZ

The passport specialist

The machine-readable zone at the foot of a passport or ID card is a strict, ICAO-9303 format: a fixed character set (A–Z 0–9 <), fixed positions, and built-in check digits. A generic OCR will read it but garble O/0 or 1/I. This pass is purpose-built for the strip.

What it does

  • Runs only when the ID-or-not classifier flags the page — so it never wastes work.
  • Uses an MRZ-tuned recogniser and grammar that only emit valid MRZ characters.
  • Parses the fields: type, issuing country, surname, given names, document number, nationality, date of birth, sex, expiry.
  • Verifies check digits — a forged or mis-read field is caught by arithmetic, not by guessing.

Unique advantage

  • Structured, not free text. KYC code consumes a typed record, not a string.
  • Self-validating. Check digits give a guarantee generic OCR can't match.
  • Narrow domain, high accuracy. Small character set, fixed layout — a specialist wins by a wide margin.
UTOPIA PASSPORT SURNAME: DOE GIVEN: JANE DOB: 1985-04-12 P<UTODOE<<JANE<<<<<<<<<<<< L898902C36UTO7408122F12<<06 parser → typed record · check ✓
Pass 05EasyOCR

The deep-learning last mile

Tesseract · confidence 0.41 |nv01ce N0. 2O26-O427 stylized font, low contrast EasyOCR · confidence 0.96 Invoice No. 2026-0427 CRAFT detector + CRNN

EasyOCR is a deep-learning OCR — a CRAFT text detector plus a CRNN recogniser on PyTorch. Heavier than Tesseract, but it shines exactly where Tesseract struggles: low-contrast or stylized fonts, curved or rotated text, photos of receipts, and many non-Latin scripts. Here it is the fallback and the rich-OCR pass.

What it does

  • Re-reads the regions where Tesseract returned low confidence — surgical, not whole-page.
  • Handles non-Latin scripts a given Tesseract deployment isn't configured for.
  • Provides a second opinion: agreement between the two engines raises confidence sharply.

Unique advantage

  • Robustness. CNN recognition copes with photos, perspective and unusual fonts.
  • Coverage. 80+ languages out of the box.
  • Targeted. Only runs where Tesseract wasn't confident, so the slow path stays small.
Cheap first, expensive only where needed — that is what keeps the pipeline fast on average without losing the long tail.
Section 03Routing

The decision a page makes

Render page PDF reader Tesseract pass words + confidence Combined worker signature · YuNet · ID-or-not Low-confidence regions? decide if EasyOCR is needed MRZ pass only if isID = true EasyOCR pass surgical, then global Final record text + signals + MRZ
Section 04At a glance

What each pass adds

Pass Output Cost When it shines
PDF reader normalized images, text layer very low born-digital PDFs
Tesseract word boxes + confidence low · CPU clean office documents
Signature signed flag + boxes low contracts, compliance
YuNet face boxes, count very low spotting portraits / IDs
ID-or-not isID, document type low routing to MRZ
MRZ typed record + check digits medium passports, national IDs
EasyOCR text + confidence high · GPU-friendly stylized, noisy, multilingual
Section 05Principles

What the pipeline believes

Cheap first, expensive only when earned

Every pass exists because the one before it can't handle a specific failure mode. Tesseract carries the bulk; EasyOCR is invoked only for the words it couldn't read; MRZ only when the page is truly an ID.

Signals, not just text

OCR is necessary but rarely sufficient. The combined worker emits document-level signals — signed, portrait present, identity document — that the business logic downstream actually needs.

Specialists beat generalists in narrow domains

A passport MRZ is a tiny, strict format with check digits. A specialist parser is more accurate, and self-validating, in a way no generic OCR can be on the same strip.

Share the work

The combined worker exists because loading and preprocessing the image is the slow part. Three detectors over one in-memory image collapse that overhead and keep the annotations consistent.