Breton-French OCR Pipeline — From Old Books to AI Translator

The Pipeline

Each book goes through 7 stages, from scanned PDF to training corpus.

📄

Extract

PDF → PNG at 300 DPI via PyMuPDF

→

✨

Enhance

CLAHE, DocRes (CVPR 2024), PreP-OCR

→

👁️

OCR

Bilingual extraction via VLM (GPT-5.2, Gemini, Claude)

→

🔍

Review

Automated quality control + human correction

→

📊

Evaluate

WER & CER against gold references

→

📦

Corpus

Deduplication, merge → final JSONL

→

🤖

m2m100

Fine-tuning the translation model

The Corpus

10 digitized historical books, spanning 80 years of Breton publications. Browse the corpus ↗

0 bilingual pairs

0 books

0 processed pages

1865–1944 time period

Book	Author	Year	Type	Pages	PDF
Manuel Breton-Français	Toullec	1865	Lexicon	87	📄
Colloque Français et Breton	Le Lourec	1884	Phrasebook	72	📄
Lexique Breton-Français	Normant	1902	Lexicon	71	📄
Vocabulaire Français-Breton	Le Gonidec	1919	Dictionary	313	📄
Geriadur Gallek ha Brezonek	Anonymous	1927	Lexicon	22	📄
Cours élémentaire de Breton	Roparz	1930	Textbook	31	📄
Le Français par le Breton	Le Bozec	1933	Textbook	78	📄
Yez hon Tadoù	Seite	1940	Course	96	📄
Ker Vreiz — 1er Cours de Breton	Daniel	1944	Textbook	37	📄

corpus/bozec_methode_1933.jsonl

📊 Performance Evaluation Metrics in progress

Error rates (CER / WER) per book and language, measured against manually annotated gold references, with translation evaluation based on sacrebleu and chrF2.

Book	Post-OCR — Breton		Post-OCR — French		Post AI-correction
Book	CER	WER	CER	WER	CER	WER
Toullec — Lexique	—	—	—	—	—	—
Colloque Lourec	1.1%	1.3%	1.1%	1.1%	—	—
Normant — Lexique	—	—	—	—	—	—
Le Gonidec — Vocabulaire	—	—	—	—	—	—
Geriadur — Medical Lexicon	5.2%	7.6%	5.9%	6.8%	—	—
Roparz — Elementary Course	—	—	—	—	—	—
Bozec — Method	6.6%	13.3%	4.8%	9.6%	—	—
Yez hon Tadou	—	—	—	—	—	—
Daniel — Ker Vreiz	—	—	—	—	—	—

Error Typology

Distribution of errors between silences (missing pairs) and noise (extra pairs).

Book	Post-OCR		Post AI-correction
Book	Silences	Noise	Silences	Noise
Toullec — Lexique	—	—	—	—
Colloque Lourec	92.3%	7.7%	—	—
Normant — Lexique	—	—	—	—
Le Gonidec — Vocabulaire	—	—	—	—
Geriadur — Medical Lexicon	0.5%	0.0%	—	—
Roparz — Elementary Course	—	—	—	—
Bozec — Method	83.3%	16.7%	—	—
Yez hon Tadou	—	—	—	—
Daniel — Ker Vreiz	—	—	—	—

m2m100 Training

The extracted corpus feeds the training of Breton→French neural machine translation models.

🌍 m2m100 (Meta)

Base multilingual model (418M parameters) covering 100 languages.

⚙️ madlad400 (QLoRa)

T5-style encoder-decoder architecture.

📈 Objective

Fine-tuning on our corpus to specialize Breton→French translation.

Tech Stack

🐍

Python 3.11

Modular pipeline, unified CLI

👁️

Vision LMs

GPT-5.2, Gemini 3.1, Claude Sonnet 4

🔥

PyTorch

DocRes, PreP-OCR

🤗

Hugging Face

m2m100, tokenizers, Trainer

📐

CVPR 2024

DocRes — document restoration

⚡

Batch API

Gemini Batch — 50% cost reduction

About

Morgane Bona-Pellissier is a master's student in Natural Language Processing at Université Paris Nanterre after completing a PhD in Translation Studies (University of Geneva, 2023). Her research focuses on neural machine translation and under-resourced and minoritized languages; she speaks Catalan and studies Breton.

LinkedIn ↗ Personal website (soon)