Devlog #20 — Automating the EN→UK i18n Pipeline: Python, Machine Translation & Parallel Trees

Manually translating 200+ simulation pages into Ukrainian was never going to scale. This post describes the automation pipeline we built: HTML structure diffing, segment-level machine translation, post-edit workflow, hreflang injection, and the parallel /uk/ directory tree that mirrors every English simulation page.

The Problem: Two Parallel Content Trees

mysimulator.uk serves two full content trees: English (root) and Ukrainian (/uk/). Every simulation has a corresponding Ukrainian page. When we add a new simulation or update content on the English side, the Ukrainian version needs to stay in sync — or the site becomes inconsistent for Ukrainian users.

Initially this was a manual process. It worked for the first 30 simulations. By simulation 80 it was a bottleneck. By 150 it was untenable. The solution: a Python pipeline that detects changes in English HTML, extracts translatable text nodes, translates them, and writes back structured Ukrainian HTML.

Pipeline Overview

1
Diff detection

Compare the English index.html against the existing /uk/index.html by hashing the text content of each translatable element. Only changed segments are queued for translation — unchanged text is left as-is to avoid re-translating human-edited copy.

2
HTML parsing & text extraction

BeautifulSoup 4 parses the source HTML. Text nodes inside h1, h2, h3, p, li, meta[name=description], title, and alt attributes are extracted as a flat list of segments. Structure is preserved via XPath-like slot IDs.

3
Machine translation

Segments are batched and sent to the translation API in groups of 50. Technical terms — simulation names, algorithm names, proper nouns, units — are protected via a glossary and XML-tag placeholders so they pass through untranslated.

4
Post-edit review queue

Translated segments with a confidence score below 0.85 are written to a review_queue.json file. A human reviewer can open this file, correct entries, and re-run the pipeline. Reviewed segments are cached in translation_memory.json so they're never re-translated.

5
HTML reconstruction

Translated segments are injected back into the HTML structure using the slot IDs from step 2. URL attributes (href, src, action) are rewritten to resolve relative paths correctly under /uk/slug/index.html.

6
Hreflang & canonical injection

The script inserts correct <link rel="canonical"> and <link rel="alternate" hreflang> tags for both the English and Ukrainian versions, pointing each to the other.

# translate_sims.py — simplified core loop
from bs4 import BeautifulSoup
import json, hashlib, pathlib

TRANS_TAGS = {'h1', 'h2', 'h3', 'p', 'li', 'title'}

def extract_segments(html_path: pathlib.Path) -> list[dict]:
    soup = BeautifulSoup(html_path.read_text('utf-8'), 'html.parser')
    segments = []
    for i, tag in enumerate(soup.find_all(TRANS_TAGS)):
        text = tag.get_text(strip=True)
        if len(text) > 3:
            segments.append({
                'id': f'{tag.name}_{i}',
                'text': text,
                'hash': hashlib.sha1(text.encode()).hexdigest()[:8],
            })
    return segments

def apply_translations(source_html: str, translations: dict) -> str:
    soup = BeautifulSoup(source_html, 'html.parser')
    for i, tag in enumerate(soup.find_all(TRANS_TAGS)):
        key = f'{tag.name}_{i}'
        if key in translations:
            tag.clear()
            tag.append(translations[key])
    return str(soup)

Handling Technical Terms: Glossary Protection

Physics and maths terms don't translate well — "Navier-Stokes", "Runge-Kutta", "leaky integrate-and-fire" — all need to pass through unchanged. Our glossary approach wraps protected terms in XML-style placeholder tags before translation:

# Input:  "The Runge-Kutta RK4 integrator solves stiff ODEs."
# After protection:
#   "The <P1/> integrator solves stiff ODEs."
#   protected_map = {'P1': 'Runge-Kutta RK4'}
# After MT:
#   "Інтегратор <P1/> розв'язує жорсткі ЗДР."
# After restoration:
#   "Інтегратор Runge-Kutta RK4 розв'язує жорсткі ЗДР."

Sync Checking in CI

A GitHub Actions workflow runs on every push to main. It compares the hash manifest of English pages against the Ukrainian manifest and reports any pages that are out of sync:

# .github/workflows/i18n-sync-check.yml (excerpt)
jobs:
  sync-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Check EN/UK sync
        run: python _check_sync.py --report --fail-on-drift

Lessons learned: The hardest part wasn't the translation itself — it was preserving HTML structure faithfully. BeautifulSoup's tree serialisation occasionally collapses self-closing tags and silently changes attribute order. We switched to a slot-based reconstruction strategy (rather than in-tree mutation) to get byte-stable output.