Experiments
Measured, not assumed
Retrieval and embeddings questions I'm answering on my own corpus. I'm listing these openly with honest status — results get written up when they're real.
Embedding benchmark on the corpus
Which embedding model gives the best retrieval (recall@k, nDCG) on multilingual classical text — and is the most expensive one actually worth it?
Most projects pick an embedding model by reputation. I want a number on my own data before committing.
GraphRAG vs. vanilla RAG
Does modelling narrator/transmission relationships as a graph measurably beat flat vector retrieval for multi-hop scholarly questions?
GraphRAG is widely claimed to help; I want a reproducible, honest comparison on a real corpus rather than a vibe.
Domain-adapted embeddings
Can a fine-tuned embedding model on Arabic/Persian religious text beat off-the-shelf models on retrieval, and by how much?
General embeddings underperform on specialised, multilingual text. This tests whether adaptation pays for itself.
Quantization: recall vs. memory vs. latency
How much retrieval quality do I lose by compressing embeddings (PQ / binary) — and how much cheaper does serving get?
Storing and serving millions of full-precision vectors is expensive; the tradeoff curve should be measured, not guessed.
Methodology
How I run these: a fixed golden set of questions over the real corpus, retrieval scored with recall@k and nDCG, and answer quality checked for faithfulness against the cited sources. Results get written up only once the numbers are real — no cherry-picking, and the eval set is versioned so comparisons stay honest.