Skip to content

Experiments

Measured, not assumed

Retrieval and embeddings questions I'm answering on my own corpus. I'm listing these openly with honest status — results get written up when they're real.

In progress

Embedding benchmark on the corpus

Which embedding model gives the best retrieval (recall@k, nDCG) on multilingual classical text — and is the most expensive one actually worth it?

Most projects pick an embedding model by reputation. I want a number on my own data before committing.

Planned

GraphRAG vs. vanilla RAG

Does modelling narrator/transmission relationships as a graph measurably beat flat vector retrieval for multi-hop scholarly questions?

GraphRAG is widely claimed to help; I want a reproducible, honest comparison on a real corpus rather than a vibe.

Planned

Domain-adapted embeddings

Can a fine-tuned embedding model on Arabic/Persian religious text beat off-the-shelf models on retrieval, and by how much?

General embeddings underperform on specialised, multilingual text. This tests whether adaptation pays for itself.

Planned

Quantization: recall vs. memory vs. latency

How much retrieval quality do I lose by compressing embeddings (PQ / binary) — and how much cheaper does serving get?

Storing and serving millions of full-precision vectors is expensive; the tradeoff curve should be measured, not guessed.

Methodology

How I run these: a fixed golden set of questions over the real corpus, retrieval scored with recall@k and nDCG, and answer quality checked for faithfulness against the cited sources. Results get written up only once the numbers are real — no cherry-picking, and the eval set is versioned so comparisons stay honest.