Data engineering & LLM orchestration
Usul Pipeline
A resilient, cost-optimized ingestion + LLM-translation pipeline that turns scattered source texts into a clean, structured corpus.
- 220
- books processed
- 35k+
- passages translated
- ~90%
- prompt-cache savings
- 2-tier
- QA fidelity gates
The domain
It ingests classical Arabic scholarly texts from several heterogeneous web sources and produces faithful English translations — a setting where dropped or fabricated content is unacceptable, so quality gating matters as much as throughput.
01 · Problem
The problem
Building a multilingual library means ingesting hundreds of books from sources that all structure their HTML differently, full of footnote markers, page artifacts, and inconsistent numbering — then translating ~35,000 passages faithfully.
LLM translation at this scale is slow, expensive, and failure-prone: API calls time out, rate-limit, and occasionally drop or fabricate content. A run can take hours, so a crash three hours in must not lose work, and bad output must never silently reach the database.
02 · Approach
Approach & key decisions
Crash-safe, resumable checkpoints
Every translated passage is checkpointed with an atomic write (temp file → rename, with a backup), so a crash or reboot resumes exactly where it stopped. Checkpoints also detect stale entries: if a re-parse shifts the source text, the cached translation is invalidated and redone rather than trusted.
Context-aware, narration-aware translation
A sliding three-passage context window keeps terminology consistent without re-sending whole chapters, and narration chains (isnād + matn) are grouped into a single call so a hadith isn't fragmented across requests. This trades a little prompt size for materially more consistent, coherent output.
Two-tier quality gating
A cheap deterministic linter catches truncation, untranslated Arabic, and house-style violations for free; a second LLM 'fidelity' pass flags omissions, additions, and distortions against the source. Flagged passages are auto-re-translated. Two tiers keep cost low while still catching the failures that matter in a high-stakes domain.
Cost control: prompt caching + dynamic budgets
A stable system + glossary prefix is prompt-cached for roughly a 90% reduction on the cached portion, and the output token budget scales with input size so giant chapters don't time out while small ones stay cheap. Concurrency uses a semaphore with per-chapter windowing to avoid latency cliffs on very large sections.
Three-tier glossary for terminology fidelity
A locked core glossary, domain glossaries, and per-book glossaries stack to enforce consistent translation of specialized terms — the difference between a plausible translation and a faithful one in this domain.
03 · Architecture
How it fits together
04 · Results
Results
- 220 books processed and ~35,000+ passages translated and uploaded to the live library.
- Roughly 90% cost reduction on the cached prompt prefix via prompt caching, with per-book cost tracked.
- Long multi-hour runs survive crashes and resume with no lost work, thanks to atomic checkpointing.
- Two independent QA tiers gate every passage before it reaches the database.
05 · Tradeoffs
Honest limitations
- There's no formal automated test suite yet — quality is currently validated by extensive manual audit scripts. Porting the highest-value audits into pytest is the obvious next step.
- The per-source parsers share a lot of copy-pasted logic; the clear refactor is a small adapter interface — which is exactly the seed of the planned open-source ingestion framework.
- It's operationally a personal pipeline; productizing it (clean SDK + CLI + docs) is on the roadmap.
06 · Next
What I'd do next
- Extract the ingestion core into a config-driven, source-agnostic open-source framework with a typed SDK and CLI.
- Add a translation-QA eval set so fidelity changes are measured, not assumed.