The Problem: Two Parallel Content Trees
mysimulator.uk serves two full content trees:
English (root) and Ukrainian
(/uk/). Every simulation has a corresponding Ukrainian
page. When we add a new simulation or update content on the English
side, the Ukrainian version needs to stay in sync — or the site
becomes inconsistent for Ukrainian users.
Initially this was a manual process. It worked for the first 30 simulations. By simulation 80 it was a bottleneck. By 150 it was untenable. The solution: a Python pipeline that detects changes in English HTML, extracts translatable text nodes, translates them, and writes back structured Ukrainian HTML.
Pipeline Overview
Compare the English index.html against the existing
/uk/index.html by hashing the text content of each
translatable element. Only changed segments are queued for
translation — unchanged text is left as-is to avoid
re-translating human-edited copy.
BeautifulSoup 4 parses the source HTML. Text nodes
inside h1, h2, h3,
p, li,
meta[name=description], title, and
alt attributes are extracted as a flat list of
segments. Structure is preserved via XPath-like slot IDs.
Segments are batched and sent to the translation API in groups of 50. Technical terms — simulation names, algorithm names, proper nouns, units — are protected via a glossary and XML-tag placeholders so they pass through untranslated.
Translated segments with a confidence score below 0.85 are
written to a review_queue.json file. A human
reviewer can open this file, correct entries, and re-run the
pipeline. Reviewed segments are cached in
translation_memory.json so they're never
re-translated.
Translated segments are injected back into the HTML structure
using the slot IDs from step 2. URL attributes
(href, src, action) are
rewritten to resolve relative paths correctly under
/uk/slug/index.html.
The script inserts correct
<link rel="canonical"> and
<link rel="alternate" hreflang> tags for both
the English and Ukrainian versions, pointing each to the other.
# translate_sims.py — simplified core loop
from bs4 import BeautifulSoup
import json, hashlib, pathlib
TRANS_TAGS = {'h1', 'h2', 'h3', 'p', 'li', 'title'}
def extract_segments(html_path: pathlib.Path) -> list[dict]:
soup = BeautifulSoup(html_path.read_text('utf-8'), 'html.parser')
segments = []
for i, tag in enumerate(soup.find_all(TRANS_TAGS)):
text = tag.get_text(strip=True)
if len(text) > 3:
segments.append({
'id': f'{tag.name}_{i}',
'text': text,
'hash': hashlib.sha1(text.encode()).hexdigest()[:8],
})
return segments
def apply_translations(source_html: str, translations: dict) -> str:
soup = BeautifulSoup(source_html, 'html.parser')
for i, tag in enumerate(soup.find_all(TRANS_TAGS)):
key = f'{tag.name}_{i}'
if key in translations:
tag.clear()
tag.append(translations[key])
return str(soup)
Handling Technical Terms: Glossary Protection
Physics and maths terms don't translate well — "Navier-Stokes", "Runge-Kutta", "leaky integrate-and-fire" — all need to pass through unchanged. Our glossary approach wraps protected terms in XML-style placeholder tags before translation:
# Input: "The Runge-Kutta RK4 integrator solves stiff ODEs."
# After protection:
# "The <P1/> integrator solves stiff ODEs."
# protected_map = {'P1': 'Runge-Kutta RK4'}
# After MT:
# "Інтегратор <P1/> розв'язує жорсткі ЗДР."
# After restoration:
# "Інтегратор Runge-Kutta RK4 розв'язує жорсткі ЗДР."
Sync Checking in CI
A GitHub Actions workflow runs on every push to main. It
compares the hash manifest of English pages against the Ukrainian
manifest and reports any pages that are out of sync:
# .github/workflows/i18n-sync-check.yml (excerpt)
jobs:
sync-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check EN/UK sync
run: python _check_sync.py --report --fail-on-drift
Lessons learned: The hardest part wasn't the
translation itself — it was preserving HTML structure faithfully.
BeautifulSoup's tree serialisation occasionally
collapses self-closing tags and silently changes attribute order. We
switched to a slot-based reconstruction strategy (rather than
in-tree mutation) to get byte-stable output.