Skip to content
← Work

AI infrastructure

LLM Gateway

A provider-agnostic LLM gateway whose semantic cache is proven correct — precision, false-positive rate, and a CI gate the commercial tier doesn't ship.

Solo — gateway, cache, guard, eval, observability2026In progress6 min read
Next.js 16TypeScriptClaude Haiku (intent judge)OpenAI embeddingsOpenTelemetry gen_ai.*Vercel
1.00
cache precision (guarded)
0%
false-positive rate
41
adversarial eval pairs
$0
per cache hit

The domain

Unlike the rest of this portfolio, this one isn't about any corpus — it's the infrastructure layer that sits in front of any LLM app, which is the point: it shows range beyond retrieval. The codename is "veritas"; the eval set is everyday factual questions (capitals, boiling points, refund policies), chosen so the cache's correctness is externally checkable by anyone.

01 · Problem

The problem

A semantic cache turns a paraphrase into a free, instant answer — but two questions can embed close yet need opposite answers. The classic is negation: 'is X safe?' and 'is X not safe?' sit almost on top of each other in embedding space, so a cache that fires on that serves a confident, wrong, cached answer.

The whole industry reports a cache hit rate; almost nobody reports cache-hit precision. GPTCache, for one, publishes hit-ratio and recall but no precision — so it's structurally blind to its worst failure. A team shipping an off-the-shelf semantic cache at a default threshold is running an unmeasured, likely double-digit false-positive rate, and nobody in the commercial tier proves otherwise.

02 · Approach

Approach & key decisions

Two-tier cache, scoped so a paraphrase isn't a wildcard

Tier-1 is an exact match on a scoped hash (system prompt + query + model + sampling params) — free and 100% precise. Tier-2 is embedding cosine for paraphrases. The scope key is computed before any lookup, so the same question under a different system prompt or model is a different request, not a hit — scope isolation is itself a measured guard.

An adversarial eval that scores the metric everyone omits

A 41-pair golden set deliberately stacked with the hard cases — negations, scope-flips (today/tomorrow, adults/children, 2023/2024), and similar-but-different questions — each labelled by whether the cache should hit. The harness scores precision, recall, and false-positive rate of cache hits and sweeps a precision-recall curve, reproducing the published result that no fixed threshold separates correct from incorrect hits: on real embeddings the negations score higher than the legitimate paraphrases.

A two-tier guard — the difference between a hit rate and a correct hit rate

On the top semantic candidate: a cheap, keyless deterministic check first (negation, antonym pairs, scope-member and number mismatches), then a Claude Haiku judge confirms intent on whatever survives. The deterministic tier catches the lexical flips for free; the judge catches the semantic ones it can't ('avoid learning X', swapped unit conversions) — the same cheap-linter-then-LLM-fidelity-pass shape as my ingestion pipeline's QA gates.

An inverted CI gate, kept honest and keyless

Where a RAG harness gates on recall, this one gates on the false-positive rate: the build fails if the shipped pipeline would serve a wrong cached answer above a budget. CI runs keyless and deterministically by committing the golden-set embeddings and the judge's verdicts — the same committed-artifact trick my citation-grounded RAG uses for its corpus.

Observability and resilience, measured not asserted

Every request emits an OpenTelemetry gen_ai.*-shaped event; /api/metrics returns hit-rate by tier, TTFT and total latency as percentiles (not averages), dollars spent and saved, and a failover-rescued rate. A circuit breaker plus pre-first-token failover commits to a provider only once it yields its first event — a failure before that is a rescue; after it, the partial answer and the error are surfaced honestly, because once the response headers are on the wire clean failover is provably impossible.

03 · Architecture

How it fits together

LLM Gateway — request lifecycleArchitecture
Request
scoped key · system · query · model · params

Two-tier cache

Tier-1 exact
scoped hash · free · 100% precise
Tier-2 semantic
embedding cosine ≥ τ
Intent guard
deterministic flips → Haiku judge
Cache hit → replay
$0 · ~0 ms
or Miss → provider chain
anthropic → openai → mock
Pre-first-token failover
commit on first token, then partial + error

Evaluation (CI-gated on false-positive rate)

precision · recall
cache-hit decision
false-positive rate
negations must not hit
threshold sweep
guard dominates the curve
A scoped key gates every cache decision; Tier-1 exact and Tier-2 semantic (behind a two-tier intent guard) serve a hit at $0, or a miss streams from the provider chain with pre-first-token failover. An eval harness scores cache-hit precision and false-positive rate, and gates CI on the latter.

04 · Results

Results

  • On the 41-pair adversarial set with real OpenAI embeddings, the naïve semantic cache at threshold 0.92 has a 50% false-positive rate. The two-tier guard lifts cache-hit precision 0.37 → 0.70 (deterministic) → 1.00 (full guard) at equal recall, taking the false-positive rate to 0%.
  • The guard dominates the whole precision-recall curve — precision 1.00 / FP 0% at every threshold — so it can operate at a lower threshold (0.78) for recall 0.82 at 0% FP, where the raw cache's false-positive rate is 83%. The raw cache only reaches 0% FP at threshold 0.96, where recall collapses to 0.35.
  • Verified live end-to-end against real providers: a paraphrase is served from cache at $0 and ~0 ms, a near-identical negation is correctly blocked, and a forced provider outage fails over to the fallback before the first token (with the model correctly swapped).
  • Runs and gates green with no API keys: the gateway serves a deterministic mock, and the eval scores the cache decision over committed embeddings + guard verdicts — so CI catches a regression that would serve wrong answers, with no secrets.

05 · Tradeoffs

Honest limitations

  • The cache is deliberately conservative: at a high threshold it misses some legitimate paraphrases (a cost miss, not a correctness one). The guard doesn't hurt recall — it's what lets you safely lower the threshold to recover it — but recall is still a threshold choice that trades against how many candidates reach the judge.
  • The deterministic guard is a cheap pre-filter, not the authority: it catches lexical flips and intentionally leaves ambiguous negations ('without', 'can't') to the judge. The headline 1.00 precision is the full guard (deterministic + judge); the judge fails closed, so a judge outage is a cache miss, never a wrong answer.
  • Routing is deliberately unshipped. The honest reading of the RouterArena benchmark is that most 'smart' routers don't beat a single strong model; a transparent cascade exists but is off by default, because a routing-quality eval — does it beat both always-cheap and always-strong without collapsing to one model — is the bar to clear, and that's future work.
  • The eval set is 41 pairs — enough to make the point and gate CI, not a population estimate. The caches, rate limiter, and metrics ring are per-instance; a real deployment would back them with a shared store (Redis). Cost figures come from a dated pricing snapshot, so they're honest estimates.
  • The engine is complete, reviewed, and measured, but the public playground and deploy are still to come — so for now the proof lives in the eval harness and the source, not a hosted demo.

06 · Next

What I'd do next

  • Ship the interactive playground — type a query, then a paraphrase that hits at $0, then a near-identical negation that correctly misses, with an outage toggle that shows pre-first-token failover — and deploy it.
  • Grow the adversarial golden set well past 41 pairs and add a per-prompt learned threshold (vCache-style δ error bound) as a measured arm against the fixed threshold.
  • Back the cache and metrics with a shared store, and add a reproducible cost/latency benchmark so the savings story is quantified at scale.