Skip to content
← Work

AI / retrieval systems

Citation-Grounded RAG

A retrieval-augmented answer engine where every claim is grounded in a specific source — hybrid retrieval, verifiable citations, an abstain path, all measured by an eval harness.

Solo — retrieval, generation, eval, and UI2026Live in production4 min read
Next.js 16TypeScriptClaude (native citations)OpenAI embeddingsBM25 + RRFVercel
266
passages, public-domain
0.91
recall@5 (hybrid)
1.00
faithfulness (LLM judge)
3/3
out-of-corpus abstained

The domain

The public demo answers over the U.S. founding documents — the Constitution, the Bill of Rights, and the Federalist Papers (public domain) — so every answer is externally checkable. The same engine is built to drop onto the Shia Library corpus, where citation and abstention matter most.

01 · Problem

The problem

Most 'chat with your docs' demos paste chunks into a prompt and hope. On high-stakes text that's dangerous: the model produces fluent, confident prose with invented '[1]' markers, and there's no way to tell a grounded answer from a hallucinated one — or to know when the honest answer is 'the sources don't say.'

I wanted the opposite defaults: every claim tied to a verifiable span, a system that declines rather than guesses, and — critically — numbers that prove it works, so changes are measurable instead of vibes.

02 · Approach

Approach & key decisions

Hybrid retrieval: BM25 + embeddings, fused with RRF

Lexical BM25 catches exact terms, names, and citation numbers ('Article I, Section 8'); dense embeddings catch paraphrase and concepts. Reciprocal Rank Fusion combines the two ranked lists without needing their scores to be comparable. The corpus sits behind a small interface, so the same pipeline runs in-process for the demo and over Postgres/pgvector in production.

Citations that can't be hallucinated

Each source is sent as an Anthropic document block with citations enabled, so the model returns text spans tied to a specific source with exact character offsets — not prompt-engineered '[1]'s. The UI renders those as inline chips that jump to the cited passage, so a reader can check any sentence in one click.

Abstention as a feature

The prompt instructs the model to return a sentinel when the retrieved sources don't support an answer, and the UI shows 'no supported answer' instead of guessing. A retrieval-confidence gate (top cosine ~0.65 for answerable vs ~0.15 for unanswerable questions) can pre-abstain before even spending a generation call.

Streaming with inline citations

Answers stream token-by-token over newline-delimited JSON; each Anthropic content block becomes a segment, so citation chips attach inline exactly where they belong as they arrive, rather than all bunched at the end.

Measured, not asserted — an eval harness

A golden set scores retrieval (recall@k, MRR) and answer faithfulness (an LLM judge), and the recall gate runs in CI so a regression fails the build. The harness doubles as an ablation: it's how I know dense, hybrid, rerank, and query-rewrite each help — and by how much.

03 · Architecture

How it fits together

Citation-Grounded RAG — query & answer flowArchitecture
Question
streamed end-to-end

Hybrid retrieval

BM25 lexical
exact terms · names · §-numbers
Dense embeddings
paraphrase · concepts
Reciprocal Rank Fusion
→ top-k (+ optional rerank)
Claude · document blocks, citations enabled
native cited spans, not made-up [1]s
Streamed cited answer
[n] chips jump to source
or Abstain
no supported evidence

Evaluation (CI-gated)

recall@k · MRR
retrieval ablation
faithfulness
LLM judge
abstain accuracy
out-of-corpus
Lexical and dense retrieval are fused with Reciprocal Rank Fusion; sources go to Claude as document blocks with citations enabled, so every answer span links to a verifiable source — or the engine abstains. An eval harness scores the whole path and gates CI.

04 · Results

Results

  • On a 22-question golden set (k=5): hybrid retrieval reaches recall@5 0.91 / MRR 0.81; adding an LLM reranker lifts that to 0.96 / 0.92. Answer faithfulness scores 1.00 and out-of-corpus questions abstain 3/3.
  • Native citations mean every sentence links to a verifiable source span — clicking a chip jumps to the exact passage.
  • Runs three ways from one codebase: fully offline (BM25 + a deterministic answer, no keys), dense + hybrid with an embeddings key, and live streamed cited answers with Claude.
  • CI fails the build if recall@5 regresses below the gate, so retrieval quality is protected, not assumed.

05 · Tradeoffs

Honest limitations

  • The honest ablation result: on this concept-heavy corpus, dense-alone (0.96) actually edged out naïve RRF hybrid (0.91) — unweighted fusion can dilute a strong retriever. Hybrid earns its keep on keyword, name, and citation-heavy queries (and on the real Arabic + English corpus); the clean fix for the dilution is the reranker.
  • Reranking and query rewriting each add an LLM call, so they're quality/latency tradeoffs I left as measured, opt-in stages rather than always-on.
  • The demo corpus is ~266 passages — chosen to be public and checkable, not to prove scale. The scale story lives in the Shia Library corpus this is built to plug into.

06 · Next

What I'd do next

  • Wire the production corpus path: hybrid FTS + pgvector + RRF over the real ~66k-passage corpus, with citations that deep-link to the source pages.
  • Promote the cross-encoder reranker and the retrieval-confidence pre-abstain gate into the default pipeline, behind a latency budget.