Usul Pipeline

A resilient, cost-optimized ingestion + LLM-translation pipeline that turns scattered source texts into a clean, structured corpus.

Solo — data engineering + LLM orchestration2024 — presentLive in production3 min read

Pythonasyncio / httpxClaude (Anthropic)Supabase / PostgresPrompt cachingYAML configs

Source private · walkthrough on request

The problem

Building a multilingual library means ingesting hundreds of books from sources that all structure their HTML differently, full of footnote markers, page artifacts, and inconsistent numbering — then translating ~35,000 passages faithfully.

LLM translation at this scale is slow, expensive, and failure-prone: API calls time out, rate-limit, and occasionally drop or fabricate content. A run can take hours, so a crash three hours in must not lose work, and bad output must never silently reach the database.

Approach & key decisions

Crash-safe, resumable checkpoints

Every translated passage is checkpointed with an atomic write (temp file → rename, with a backup), so a crash or reboot resumes exactly where it stopped. Checkpoints also detect stale entries: if a re-parse shifts the source text, the cached translation is invalidated and redone rather than trusted.

Context-aware, narration-aware translation

A sliding three-passage context window keeps terminology consistent without re-sending whole chapters, and narration chains (isnād + matn) are grouped into a single call so a hadith isn't fragmented across requests. This trades a little prompt size for materially more consistent, coherent output.

Two-tier quality gating

A cheap deterministic linter catches truncation, untranslated Arabic, and house-style violations for free; a second LLM 'fidelity' pass flags omissions, additions, and distortions against the source. Flagged passages are auto-re-translated. Two tiers keep cost low while still catching the failures that matter in a high-stakes domain.

Cost control: prompt caching + dynamic budgets

A stable system + glossary prefix is prompt-cached for roughly a 90% reduction on the cached portion, and the output token budget scales with input size so giant chapters don't time out while small ones stay cheap. Concurrency uses a semaphore with per-chapter windowing to avoid latency cliffs on very large sections.

Three-tier glossary for terminology fidelity

A locked core glossary, domain glossaries, and per-book glossaries stack to enforce consistent translation of specialized terms — the difference between a plausible translation and a faithful one in this domain.

How it fits together

Usul Pipeline — ingestion & translationArchitecture

Sources

heterogeneous HTML

Scrape

httpx async + semaphore

Parse

vol → chapter → passage

Translate

Claude · context window

QA gate

lint + fidelity

Upload

Supabase / Postgres

⟲ Atomic checkpoints — resumable across the translate → QA → upload stages

Every stage from translate onward writes atomic, resumable checkpoints, so a multi-hour run survives a crash and continues exactly where it stopped.

Results

220 books processed and ~35,000+ passages translated and uploaded to the live library.
Roughly 90% cost reduction on the cached prompt prefix via prompt caching, with per-book cost tracked.
Long multi-hour runs survive crashes and resume with no lost work, thanks to atomic checkpointing.
Two independent QA tiers gate every passage before it reaches the database.

Honest limitations

There's no formal automated test suite yet — quality is currently validated by extensive manual audit scripts. Porting the highest-value audits into pytest is the obvious next step.
The per-source parsers share a lot of copy-pasted logic; the clear refactor is a small adapter interface — which is exactly the seed of the planned open-source ingestion framework.
It's operationally a personal pipeline; productizing it (clean SDK + CLI + docs) is on the roadmap.