Coverage is the ceiling: what actually moved my retrieval numbers

I spent a week adding the retrieval upgrades everyone recommends to a grounded question-answering system — MMR for diversity, RAG-fusion, HyDE, multi-query expansion, a per-source diversity cap. Most did nothing. One made the system measurably worse. The two changes that actually moved the numbers weren't on anyone's list.

The thread connecting all of it: on a real corpus the ceiling is what the corpus contains, not how cleverly you rank it — and the only way to tell a real win from a plausible one is to measure end to end.

The ceiling is the corpus, not the ranker

My corpus is large but uneven — deep on some subjects, thin on others. No amount of re-ranking conjures a passage that isn't there. Once I scored retrieval against a golden set, the pattern was obvious: questions failed where the corpus was thin, and no ranker fixed those. The fashionable tricks were all fighting over the questions that were already answerable.

The win that looked like cleverness — and wasn't

The biggest single improvement was almost embarrassingly simple. One enormous book contributed so many passages that it flooded the top-k and crowded everything else out — so the model kept grounding its answer in that one source instead of corroborating across several. Capping how many passages any one book may contribute — to three — doubled the grounded-answer rate (26% → 52%) and lifted faithfulness from 0.48 to 0.65. One parameter, no extra model calls.

The same trick, a different corpus, the opposite result

Then I ported that cap to a second system — a public demo answering over the U.S. founding documents — fully expecting another free win. It lowered recall, from 0.91 to 0.86.

mode                     recall@5   MRR
hybrid (RRF)               0.909    0.809
hybrid + per-source cap    0.864    0.799   <- worse here

The reason is the whole point. On the first corpus the flooding book was mostly irrelevant, so capping it made room for better sources. On the founding documents the source that floods a query is usually the relevant one — ask about Federalist No. 10 and its own paragraphs should dominate. Capping it threw away the right answer. Same mechanism, opposite sign, because the corpora differ. So it ships measured and off by default — not deleted, just gated behind a number the eval can re-check.

recall@k lied to me

For a while I tuned against recall@k because it's cheap to compute. It kept disagreeing with reality. The per-book cap barely changed recall@k yet doubled grounding, because grounding depends on the mix of sources the model sees, not just whether the one relevant passage is somewhere in the list. The lesson I keep relearning: optimize the metric you actually care about, end to end, even when a cheaper proxy is sitting right there.

Two judges, because one of them lies

Faithfulness is itself scored by an LLM judge, so I run two — a strong one (Claude) and a cheap one (gpt-4o-mini) — both shown the full cited sources. On one obviously-grounded answer the cheap judge returned a flat 0.00 while the strong judge gave 1.00. That single disagreement is why the cheap judge is a counter-check, not the headline, and why CI gates on the reliable one.

What I'd tell past me

Build the eval harness before the clever retrieval. Almost every upgrade I assumed would help was either noise or corpus-specific, and a couple were negative. The MMR / RAG-fusion / HyDE / multi-query family all got rejected on this corpus — not because they're bad ideas, but because the numbers said no here. The harness is the part that compounds: it's what let the cap earn its place in one system and get benched in another, on evidence instead of vibes.