Search & knowledge platform

Shia Library

Multilingual full-text + semantic search over a large classical-text corpus, served fast and access-controlled.

Solo — full design, build, and operation2024 — presentLive in production3 min read

Next.js 15React 19Supabase / PostgrespgvectorOpenAI embeddingsMUIVercelPlaywright

Live siteSource private · walkthrough on request

~66k: passages indexed
~11.7k: static pages
3-layer: caching + RLS
Live: shialibrary.com

The domain

The corpus is classical Twelver Shia Islamic scholarship (hadith and related texts) in Arabic with English translations — a domain where search relevance is hard (Arabic orthography) and correctness is high-stakes.

01 · Problem

The problem

Searching classical Arabic text is deceptively hard. The same word appears with or without diacritics (harakat), with variant letterforms (alef and ya variants), and with elongation marks (tatweel) — so a naïve query for a term silently misses most of its real occurrences.

At the same time, keyword search alone can't answer conceptual questions, and pure semantic search loses exact matches on names, citations, and hadith numbers that scholars depend on. The content is also sensitive: some books are access-restricted, and the full corpus must not be trivially scrapeable.

And it has to be fast and cheap to serve a read-mostly corpus of tens of thousands of passages across ~11,700 pre-rendered pages.

02 · Approach

Approach & key decisions

Diacritic-insensitive Arabic search at the database layer

An IMMUTABLE `normalize_arabic()` SQL function strips harakat and tatweel and folds letter variants, backed by a GIN trigram index so normalized matching stays fast. A client-side mirror of the same normalization drives result highlighting, while the original glyphs are preserved verbatim for display. I chose SQL-level normalization (indexed, not per-query) so search stays fast as the corpus grows.

Hybrid retrieval: keyword + semantic, not either/or

Postgres `tsvector` full-text search handles exact terms and phrases; a semantic 'smart search' path calls an OpenAI embedding via a Supabase Edge Function for conceptual queries. Pure vector search missed exact names and citation numbers, and pure keyword missed paraphrase — so both paths exist and are chosen by query intent.

Access control + anti-scrape as a first-class concern

Row-level security separates anonymous, authenticated, and admin roles; restricted books are invisible to the anon role. Bulk reads go through a service-role path with anon `SELECT` revoked, so the full corpus isn't trivially scrapeable while public pages stay cacheable.

Three layers of caching with request de-duplication

`unstable_cache` (with `React.cache` to de-dupe within a request), the client router cache, and ISR each cover a different read pattern. The semantic path is additionally cached by query for a day so repeated questions don't re-pay the embedding cost. Public fetchers never read cookies, which keeps them edge-cacheable.

An in-app CMS, not raw database edits

A command-palette CMS edits passages in place with a full edit history and diffs, and cache invalidation by tag means edits apply with no visible card jump. Treating content operations as a product — auditable and zero-downtime — beats hand-editing rows.

03 · Architecture

How it fits together

Shia Library — request & data flowArchitecture

Browser

AR · EN UI

Next.js App Router

SSG / ISR

Vercel Edge

3-layer cache

Retrieval

Arabic search

normalize_arabic() + GIN trigram

English FTS

tsvector + GIN

Semantic

edge fn → embeddings

Supabase · Postgres + pgvector

RLS: anon / auth / admin

Admin lane

In-app CMS

command palette

Edit + history

full diffs

Invalidate

by cache tag

Reads use cached, cookie-free anon fetchers (edge-cacheable); bulk access goes through a service-role path with anonymous SELECT revoked to deter scraping.

04 · Results

Results

~66,000 passages are searchable across Arabic (diacritic-insensitive), English full-text, and semantic vector search.
~11,700 pages are statically pre-rendered, served from Vercel's edge with multi-layer caching.
47 test files (unit + accessibility + Playwright smoke) run in CI, alongside secret-scanning and a hardened Content-Security-Policy.
Roughly 260 commits of clean, prefixed history — built and operated solo.

05 · Tradeoffs

Honest limitations

It's written in JavaScript with JSDoc rather than TypeScript — a deliberate next step is migrating the data layer to TS for compile-time safety on Supabase queries.
Test coverage is strongest at the unit level; end-to-end coverage of full search→save→read flows is thinner than I'd want.
The interesting engineering here is search quality and access control, not raw scale — the corpus is tens of thousands of rows, not millions.

06 · Next

What I'd do next

Add citation-grounded RAG on top of the existing retrieval, so questions get answered with inline, verifiable provenance.
Promote the keyword + vector paths into one explicit hybrid ranker with reranking and a relevance eval set.

← All work