Data & MoreEngineeringgithub.com/dataandmore/ocr

One document, five
passes, zero guesswork.

A staged OCR pipeline that turns a raw PDF into clean text and document-level signals. Each pass is deliberate: a free text layer where it exists, a fast Tesseract baseline, a combined vision worker for signatures, faces and ID detection, a specialist MRZ reader, and EasyOCR as the deep-learning last mile.

Because you can't garble text you read with the right tool the first time.

5sequential passes

3detectors in one worker

2OCR engines, fast then deep

1page image, loaded once

Section 01Overview

The pipeline at a glance

Pages move left to right. The MRZ pass only fires when the ID-or-not detector says a page is an identity document; everything else continues to the rich OCR pass. Each stage emits annotations the next stage can use as hints.

Main data path Conditional branch — ID only Vision signal Document annotation

Section 02Rationale

Why layer the work at all

A real document corpus is heterogeneous: born-digital PDFs, scanned letters with a coffee stain, multilingual contracts, and passports with a machine-readable zone all land in the same inbox. The pipeline routes each page through cheap, deterministic passes first and reserves expensive deep-learning OCR for the cases that earn it. Vision signals ride alongside the text, so a consumer can ask is this contract signed or is this upload actually a passport without re-reading the file.

Pass 01PDF reader

The front door

Two PDFs that look identical to a person can be wildly different inside: one a born-digital export with a perfect text layer, the other a phone photo flattened to PDF at 72 DPI and rotated four degrees. The reader normalises that asymmetry so downstream OCR never has to.

What it does

Rasterises each page to a standard 300 DPI so engines see consistent character sizes.
Extracts the embedded text layer when present — short-circuiting OCR entirely for born-digital files.
Deskews, dewarps and fixes orientation, then crops scanner-bed borders.
Splits multi-page documents into a per-page stream the rest of the pipeline handles independently.

Unique advantage

Free text wins. A born-digital PDF skips OCR — instant, perfect, error-free.
Consistent input. Every later model can assume an upright, sane-DPI image.
One source of truth. The image rendered here is reused by every later worker — no double rasterisation.

Pass 02Tesseract

The fast baseline

Tesseract is the workhorse: fast, CPU-friendly, 100+ languages, and structured output — word boxes, line boxes, per-word confidence. For the long tail of clean office documents it solves the problem outright.

What it does

Runs the LSTM recogniser over each page and emits hOCR / TSV with words, lines and boxes.
Attaches a per-word confidence score that decides where EasyOCR needs a second pass.
Returns reading order and layout, so paragraphs reconstruct cleanly.

Unique advantage

Speed and cost. Pure CPU, no GPU dependency — scales horizontally, stays cheap.
Deterministic. Same input, same output: easy to test, cache and diff in CI.
Layout-aware. Word boxes and reading order are first-class outputs.
Confidence is a routing signal. Low-confidence regions become EasyOCR's input.

Tesseract is deliberately the first OCR pass — good enough for most pages. The expensive passes only run where it falls short.

Pass 03Combined worker

Signatures, faces, and one decision

The combined worker is the vision lane. Rather than running three jobs that each reload the page image, allocate a tensor and warm a model, it runs three detectors over the same in-memory image in one pass — producing document-level signals that text alone can't answer.

Detector A

Signature

Most contracts only matter once countersigned. A small CNN scans for ink-like strokes that read as a handwritten signature and returns boxes plus a score.

Turns is this contract executed into a boolean.
Box plus page index lets a UI jump to the signed line.
The nearest text line — the printed name — can be paired to the box.

Detector B

YuNet

YuNet is a tiny face detector built to run in real time on commodity hardware. Here it isn't about faces — it's about portraits as a document feature.

A face in the corner is a strong cue for an ID, passport or licence.
About 1 ms per crop on CPU — cheap enough to run on every page.
Presence of a face can drive redaction or special handling.

Detector C

ID or not

A binary classifier that reads the whole page and answers one question: is this an identity document? Its job is to decide which pages reach the MRZ specialist.

Avoids running MRZ on every page — most have none.
Uses YuNet's face hit and Tesseract's text as features.
A small head over a small backbone: fast, easy to calibrate.

Why combined? Loading the image, colour-converting, resizing and warming a model account for most of any detector's latency. Running all three over one shared image — one process, one tight loop — collapses that overhead and keeps the annotations self-consistent, because all three saw the same pixels.

Pass 04MRZ

The passport specialist

The machine-readable zone at the foot of a passport or ID card is a strict, ICAO-9303 format: a fixed character set (A–Z 0–9 <), fixed positions, and built-in check digits. A generic OCR will read it but garble O/0 or 1/I. This pass is purpose-built for the strip.

What it does

Runs only when the ID-or-not classifier flags the page — so it never wastes work.
Uses an MRZ-tuned recogniser and grammar that only emit valid MRZ characters.
Parses the fields: type, issuing country, surname, given names, document number, nationality, date of birth, sex, expiry.
Verifies check digits — a forged or mis-read field is caught by arithmetic, not by guessing.

Unique advantage

Structured, not free text. KYC code consumes a typed record, not a string.
Self-validating. Check digits give a guarantee generic OCR can't match.
Narrow domain, high accuracy. Small character set, fixed layout — a specialist wins by a wide margin.

Pass 05EasyOCR

The deep-learning last mile

EasyOCR is a deep-learning OCR — a CRAFT text detector plus a CRNN recogniser on PyTorch. Heavier than Tesseract, but it shines exactly where Tesseract struggles: low-contrast or stylized fonts, curved or rotated text, photos of receipts, and many non-Latin scripts. Here it is the fallback and the rich-OCR pass.

What it does

Re-reads the regions where Tesseract returned low confidence — surgical, not whole-page.
Handles non-Latin scripts a given Tesseract deployment isn't configured for.
Provides a second opinion: agreement between the two engines raises confidence sharply.

Unique advantage

Robustness. CNN recognition copes with photos, perspective and unusual fonts.
Coverage. 80+ languages out of the box.
Targeted. Only runs where Tesseract wasn't confident, so the slow path stays small.

Cheap first, expensive only where needed — that is what keeps the pipeline fast on average without losing the long tail.

Section 03Routing

The decision a page makes

Section 04At a glance

What each pass adds

Pass	Output	Cost	When it shines
PDF reader	normalized images, text layer	very low	born-digital PDFs
Tesseract	word boxes + confidence	low · CPU	clean office documents
Signature	signed flag + boxes	low	contracts, compliance
YuNet	face boxes, count	very low	spotting portraits / IDs
ID-or-not	isID, document type	low	routing to MRZ
MRZ	typed record + check digits	medium	passports, national IDs
EasyOCR	text + confidence	high · GPU-friendly	stylized, noisy, multilingual

Section 05Principles

What the pipeline believes

Cheap first, expensive only when earned

Every pass exists because the one before it can't handle a specific failure mode. Tesseract carries the bulk; EasyOCR is invoked only for the words it couldn't read; MRZ only when the page is truly an ID.

Signals, not just text

OCR is necessary but rarely sufficient. The combined worker emits document-level signals — signed, portrait present, identity document — that the business logic downstream actually needs.

Specialists beat generalists in narrow domains

A passport MRZ is a tiny, strict format with check digits. A specialist parser is more accurate, and self-validating, in a way no generic OCR can be on the same strip.

Share the work

The combined worker exists because loading and preprocessing the image is the slow part. Three detectors over one in-memory image collapse that overhead and keep the annotations consistent.

Repository: github.com/dataandmore/ocr. Document set 2026-05-27. Component descriptions reflect the standard behaviour of the named libraries — Tesseract, OpenCV YuNet, EasyOCR — and the typical role of an ICAO 9303 MRZ parser. The repository was not publicly readable at the time of writing; where the implementation differs in detail, edit this document to match the source.

One document, fivepasses, zero guesswork.

The pipeline at a glance

Why layer the work at all

The front door

What it does

Unique advantage

The fast baseline

What it does

Unique advantage

Signatures, faces, and one decision

Signature

YuNet

ID or not

The passport specialist

What it does

Unique advantage

The deep-learning last mile

What it does

Unique advantage

The decision a page makes

What each pass adds

What the pipeline believes

Cheap first, expensive only when earned

Signals, not just text

Specialists beat generalists in narrow domains

Share the work

One document, five
passes, zero guesswork.