Search & knowledge platform
Shia Library
Multilingual full-text + semantic search over a large classical-text corpus, served fast and access-controlled.
- ~66k
- passages indexed
- ~11.7k
- static pages
- 3-layer
- caching + RLS
- Live
- shialibrary.com
The domain
The corpus is classical Twelver Shia Islamic scholarship (hadith and related texts) in Arabic with English translations — a domain where search relevance is hard (Arabic orthography) and correctness is high-stakes.
01 · Problem
The problem
Searching classical Arabic text is deceptively hard. The same word appears with or without diacritics (harakat), with variant letterforms (alef and ya variants), and with elongation marks (tatweel) — so a naïve query for a term silently misses most of its real occurrences.
At the same time, keyword search alone can't answer conceptual questions, and pure semantic search loses exact matches on names, citations, and hadith numbers that scholars depend on. The content is also sensitive: some books are access-restricted, and the full corpus must not be trivially scrapeable.
And it has to be fast and cheap to serve a read-mostly corpus of tens of thousands of passages across ~11,700 pre-rendered pages.
02 · Approach
Approach & key decisions
Diacritic-insensitive Arabic search at the database layer
An IMMUTABLE `normalize_arabic()` SQL function strips harakat and tatweel and folds letter variants, backed by a GIN trigram index so normalized matching stays fast. A client-side mirror of the same normalization drives result highlighting, while the original glyphs are preserved verbatim for display. I chose SQL-level normalization (indexed, not per-query) so search stays fast as the corpus grows.
Hybrid retrieval: keyword + semantic, not either/or
Postgres `tsvector` full-text search handles exact terms and phrases; a semantic 'smart search' path calls an OpenAI embedding via a Supabase Edge Function for conceptual queries. Pure vector search missed exact names and citation numbers, and pure keyword missed paraphrase — so both paths exist and are chosen by query intent.
Access control + anti-scrape as a first-class concern
Row-level security separates anonymous, authenticated, and admin roles; restricted books are invisible to the anon role. Bulk reads go through a service-role path with anon `SELECT` revoked, so the full corpus isn't trivially scrapeable while public pages stay cacheable.
Three layers of caching with request de-duplication
`unstable_cache` (with `React.cache` to de-dupe within a request), the client router cache, and ISR each cover a different read pattern. The semantic path is additionally cached by query for a day so repeated questions don't re-pay the embedding cost. Public fetchers never read cookies, which keeps them edge-cacheable.
An in-app CMS, not raw database edits
A command-palette CMS edits passages in place with a full edit history and diffs, and cache invalidation by tag means edits apply with no visible card jump. Treating content operations as a product — auditable and zero-downtime — beats hand-editing rows.
03 · Architecture
How it fits together
Retrieval
Admin lane
04 · Results
Results
- ~66,000 passages are searchable across Arabic (diacritic-insensitive), English full-text, and semantic vector search.
- ~11,700 pages are statically pre-rendered, served from Vercel's edge with multi-layer caching.
- 47 test files (unit + accessibility + Playwright smoke) run in CI, alongside secret-scanning and a hardened Content-Security-Policy.
- Roughly 260 commits of clean, prefixed history — built and operated solo.
05 · Tradeoffs
Honest limitations
- It's written in JavaScript with JSDoc rather than TypeScript — a deliberate next step is migrating the data layer to TS for compile-time safety on Supabase queries.
- Test coverage is strongest at the unit level; end-to-end coverage of full search→save→read flows is thinner than I'd want.
- The interesting engineering here is search quality and access control, not raw scale — the corpus is tens of thousands of rows, not millions.
06 · Next
What I'd do next
- Add citation-grounded RAG on top of the existing retrieval, so questions get answered with inline, verifiable provenance.
- Promote the keyword + vector paths into one explicit hybrid ranker with reranking and a relevance eval set.