From old books
to AI translator

Automated extraction of Breton-French bilingual corpora from historical books (1865–1944) using Vision Language Models, then training neural machine translation models.

Illustration: from an old book to digital data

The Pipeline

Each book goes through 7 stages, from scanned PDF to training corpus.

📄

Extract

PDF → PNG at 300 DPI via PyMuPDF

Enhance

CLAHE, DocRes (CVPR 2024), PreP-OCR

👁️

OCR

Bilingual extraction via VLM (GPT-5.2, Gemini, Claude)

🔍

Review

Automated quality control + human correction

📊

Evaluate

WER & CER against gold references

📦

Corpus

Deduplication, merge → final JSONL

🤖

m2m100

Fine-tuning the translation model

The Corpus

10 digitized historical books, spanning 80 years of Breton publications. Browse the corpus ↗

0 bilingual pairs
0 books
0 processed pages
1865–1944 time period
Book Author Year Type Pages PDF
Manuel Breton-FrançaisToullec1865Lexicon87📄
Colloque Français et BretonLe Lourec1884Phrasebook72📄
Lexique Breton-FrançaisNormant1902Lexicon71📄
Vocabulaire Français-BretonLe Gonidec1919Dictionary313📄
Geriadur Gallek ha BrezonekAnonymous1927Lexicon22📄
Cours élémentaire de BretonRoparz1930Textbook31📄
Le Français par le BretonLe Bozec1933Textbook78📄
Yez hon TadoùSeite1940Course96📄
Ker Vreiz — 1er Cours de BretonDaniel1944Textbook37📄
corpus/bozec_methode_1933.jsonl

📊 Performance Evaluation Metrics in progress

Error rates (CER / WER) per book and language, measured against manually annotated gold references.

Book Post-OCR — Breton Post-OCR — French Post AI-correction
CER WER CER WER CER WER
Toullec — Lexique
Colloque Lourec1.1%1.3%1.1%1.1%
Normant — Lexique
Le Gonidec — Vocabulaire
Geriadur — Medical Lexicon5.2%7.6%5.9%6.8%
Roparz — Elementary Course
Bozec — Method6.6%13.3%4.8%9.6%
Yez hon Tadou
Daniel — Ker Vreiz

Error Typology

Distribution of errors between silences (missing pairs) and noise (extra pairs).

Book Post-OCR Post AI-correction
Silences Noise Silences Noise
Toullec — Lexique
Colloque Lourec92.3%7.7%
Normant — Lexique
Le Gonidec — Vocabulaire
Geriadur — Medical Lexicon0.5%0.0%
Roparz — Elementary Course
Bozec — Method83.3%16.7%
Yez hon Tadou
Daniel — Ker Vreiz

m2m100 Training

The extracted corpus feeds the training of Breton↔French neural machine translation models.

🌍 m2m100 (Meta)

Base multilingual model (418M parameters) covering 100 languages. Fine-tuned on our corpus to specialize the Breton-French pair.

⚙️ m2m100_br_fr

Variant already pre-trained for Breton by Loïc Grobol. Additional training with our historical data.

📈 Objective

Build a high-performing translator for Breton, capable of handling historical spelling variations and specialized vocabulary.

Tech Stack

🐍

Python 3.11

Modular pipeline, unified CLI

👁️

Vision LMs

GPT-5.2, Gemini 3.1, Claude Sonnet 4

🔥

PyTorch

DocRes, PreP-OCR

🤗

Hugging Face

m2m100, tokenizers, Trainer

📐

CVPR 2024

DocRes — document restoration

Batch API

Gemini Batch — 50% cost reduction

About

Morgane Bona-Pellissier is a master's student in Natural Language Processing at Université Paris Nanterre after completing a PhD in Translation Studies (University of Geneva, 2023). Her research focuses on neural machine translation and under-resourced and minoritized languages; she speaks Catalan and studies Breton.