Making Arabic search actually find things

The first time I searched my own library for a word I knew was in there, it returned almost nothing. The text was full of the word — just not the way I'd typed it. That's the whole problem with Arabic search in one sentence: the same word is written many ways.

Why naïve search fails

Arabic carries optional diacritics (harakat) that mark short vowels, an elongation character (tatweel) used purely for typesetting, and several interchangeable letterforms — the alef and ya variants in particular. A passage might store a word fully vocalised; a user almost always types it bare. To a database doing exact or even tsvector matching, those are different strings.

Normalize once, at the database layer

The fix is to compare a normalized form of both the stored text and the query: strip harakat and tatweel, fold the variant letters to a canonical form, and match on that. The important decision is where this happens. Doing it per-query is slow and unindexable. Instead I made it an IMMUTABLE SQL function and built a trigram index over its output, so the normalized form is computed once and searched fast.

-- match on the normalized form, backed by a GIN trigram index
CREATE INDEX idx_passages_ar_norm
  ON passages USING gin (normalize_arabic(arabic) gin_trgm_ops);

SELECT id, arabic
FROM passages
WHERE normalize_arabic(arabic) ILIKE '%' || normalize_arabic($1) || '%';

The original glyphs are never modified — they're preserved verbatim for display, because scholars care about the exact vocalised text. Normalization is purely a matching concern.

Mirror the same logic on the client

Highlighting matches in the UI needs the samenormalization, or the highlight drifts off the real word. So the client carries a mirror of the normalization rules, applied only for highlight alignment. Keeping one source of truth for "what counts as the same word" — expressed twice, in SQL and in JS — is the part worth testing hard.

Keyword isn't enough either

Normalized matching nails exact terms, names, and citation numbers. It can't answer a conceptual question. For that there's a second, semantic path built on embeddings — and the two together (exact + conceptual) are what the library actually runs. But the unglamorous normalization layer is what made search trustworthy first.