Automated extraction of Breton-French bilingual corpora from historical books (1865–1944) using Vision Language Models, then training neural machine translation models.
Each book goes through 7 stages, from scanned PDF to training corpus.
PDF → PNG at 300 DPI via PyMuPDF
CLAHE, DocRes (CVPR 2024), PreP-OCR
Bilingual extraction via VLM (GPT-5.2, Gemini, Claude)
Automated quality control + human correction
WER & CER against gold references
Deduplication, merge → final JSONL
Fine-tuning the translation model
10 digitized historical books, spanning 80 years of Breton publications. Browse the corpus ↗
| Book | Author | Year | Type | Pages | |
|---|---|---|---|---|---|
| Manuel Breton-Français | Toullec | 1865 | Lexicon | 87 | 📄 |
| Colloque Français et Breton | Le Lourec | 1884 | Phrasebook | 72 | 📄 |
| Lexique Breton-Français | Normant | 1902 | Lexicon | 71 | 📄 |
| Vocabulaire Français-Breton | Le Gonidec | 1919 | Dictionary | 313 | 📄 |
| Geriadur Gallek ha Brezonek | Anonymous | 1927 | Lexicon | 22 | 📄 |
| Cours élémentaire de Breton | Roparz | 1930 | Textbook | 31 | 📄 |
| Le Français par le Breton | Le Bozec | 1933 | Textbook | 78 | 📄 |
| Yez hon Tadoù | Seite | 1940 | Course | 96 | 📄 |
| Ker Vreiz — 1er Cours de Breton | Daniel | 1944 | Textbook | 37 | 📄 |
Error rates (CER / WER) per book and language, measured against manually annotated gold references.
| Book | Post-OCR — Breton | Post-OCR — French | Post AI-correction | |||
|---|---|---|---|---|---|---|
| CER | WER | CER | WER | CER | WER | |
| Toullec — Lexique | — | — | — | — | — | — |
| Colloque Lourec | 1.1% | 1.3% | 1.1% | 1.1% | — | — |
| Normant — Lexique | — | — | — | — | — | — |
| Le Gonidec — Vocabulaire | — | — | — | — | — | — |
| Geriadur — Medical Lexicon | 5.2% | 7.6% | 5.9% | 6.8% | — | — |
| Roparz — Elementary Course | — | — | — | — | — | — |
| Bozec — Method | 6.6% | 13.3% | 4.8% | 9.6% | — | — |
| Yez hon Tadou | — | — | — | — | — | — |
| Daniel — Ker Vreiz | — | — | — | — | — | — |
Distribution of errors between silences (missing pairs) and noise (extra pairs).
| Book | Post-OCR | Post AI-correction | ||
|---|---|---|---|---|
| Silences | Noise | Silences | Noise | |
| Toullec — Lexique | — | — | — | — |
| Colloque Lourec | 92.3% | 7.7% | — | — |
| Normant — Lexique | — | — | — | — |
| Le Gonidec — Vocabulaire | — | — | — | — |
| Geriadur — Medical Lexicon | 0.5% | 0.0% | — | — |
| Roparz — Elementary Course | — | — | — | — |
| Bozec — Method | 83.3% | 16.7% | — | — |
| Yez hon Tadou | — | — | — | — |
| Daniel — Ker Vreiz | — | — | — | — |
The extracted corpus feeds the training of Breton↔French neural machine translation models.
Base multilingual model (418M parameters) covering 100 languages. Fine-tuned on our corpus to specialize the Breton-French pair.
Variant already pre-trained for Breton by Loïc Grobol. Additional training with our historical data.
Build a high-performing translator for Breton, capable of handling historical spelling variations and specialized vocabulary.
Modular pipeline, unified CLI
GPT-5.2, Gemini 3.1, Claude Sonnet 4
DocRes, PreP-OCR
m2m100, tokenizers, Trainer
DocRes — document restoration
Gemini Batch — 50% cost reduction
Morgane Bona-Pellissier is a master's student in Natural Language Processing at Université Paris Nanterre after completing a PhD in Translation Studies (University of Geneva, 2023). Her research focuses on neural machine translation and under-resourced and minoritized languages; she speaks Catalan and studies Breton.