Skip to content
← Work

Data engineering & LLM orchestration

Usul Pipeline

A resilient, cost-optimized ingestion + LLM-translation pipeline that turns scattered source texts into a clean, structured corpus.

Solo — data engineering + LLM orchestration2024 — presentLive in production3 min read
Pythonasyncio / httpxClaude (Anthropic)Supabase / PostgresPrompt cachingYAML configs
Source private · walkthrough on request
220
books processed
35k+
passages translated
~90%
prompt-cache savings
2-tier
QA fidelity gates

The domain

It ingests classical Arabic scholarly texts from several heterogeneous web sources and produces faithful English translations — a setting where dropped or fabricated content is unacceptable, so quality gating matters as much as throughput.

01 · Problem

The problem

Building a multilingual library means ingesting hundreds of books from sources that all structure their HTML differently, full of footnote markers, page artifacts, and inconsistent numbering — then translating ~35,000 passages faithfully.

LLM translation at this scale is slow, expensive, and failure-prone: API calls time out, rate-limit, and occasionally drop or fabricate content. A run can take hours, so a crash three hours in must not lose work, and bad output must never silently reach the database.

02 · Approach

Approach & key decisions

Crash-safe, resumable checkpoints

Every translated passage is checkpointed with an atomic write (temp file → rename, with a backup), so a crash or reboot resumes exactly where it stopped. Checkpoints also detect stale entries: if a re-parse shifts the source text, the cached translation is invalidated and redone rather than trusted.

Context-aware, narration-aware translation

A sliding three-passage context window keeps terminology consistent without re-sending whole chapters, and narration chains (isnād + matn) are grouped into a single call so a hadith isn't fragmented across requests. This trades a little prompt size for materially more consistent, coherent output.

Two-tier quality gating

A cheap deterministic linter catches truncation, untranslated Arabic, and house-style violations for free; a second LLM 'fidelity' pass flags omissions, additions, and distortions against the source. Flagged passages are auto-re-translated. Two tiers keep cost low while still catching the failures that matter in a high-stakes domain.

Cost control: prompt caching + dynamic budgets

A stable system + glossary prefix is prompt-cached for roughly a 90% reduction on the cached portion, and the output token budget scales with input size so giant chapters don't time out while small ones stay cheap. Concurrency uses a semaphore with per-chapter windowing to avoid latency cliffs on very large sections.

Three-tier glossary for terminology fidelity

A locked core glossary, domain glossaries, and per-book glossaries stack to enforce consistent translation of specialized terms — the difference between a plausible translation and a faithful one in this domain.

03 · Architecture

How it fits together

Usul Pipeline — ingestion & translationArchitecture
Sources
heterogeneous HTML
Scrape
httpx async + semaphore
Parse
vol → chapter → passage
Translate
Claude · context window
QA gate
lint + fidelity
Upload
Supabase / Postgres
⟲ Atomic checkpoints — resumable across the translate → QA → upload stages
Every stage from translate onward writes atomic, resumable checkpoints, so a multi-hour run survives a crash and continues exactly where it stopped.

04 · Results

Results

  • 220 books processed and ~35,000+ passages translated and uploaded to the live library.
  • Roughly 90% cost reduction on the cached prompt prefix via prompt caching, with per-book cost tracked.
  • Long multi-hour runs survive crashes and resume with no lost work, thanks to atomic checkpointing.
  • Two independent QA tiers gate every passage before it reaches the database.

05 · Tradeoffs

Honest limitations

  • There's no formal automated test suite yet — quality is currently validated by extensive manual audit scripts. Porting the highest-value audits into pytest is the obvious next step.
  • The per-source parsers share a lot of copy-pasted logic; the clear refactor is a small adapter interface — which is exactly the seed of the planned open-source ingestion framework.
  • It's operationally a personal pipeline; productizing it (clean SDK + CLI + docs) is on the roadmap.

06 · Next

What I'd do next

  • Extract the ingestion core into a config-driven, source-agnostic open-source framework with a typed SDK and CLI.
  • Add a translation-QA eval set so fidelity changes are measured, not assumed.