
SageSearch version 3.0 uses an inverted index (term → document mapping) for fast lookups.
1. Tokenization
- Each document is split into tokens using regex patterns:
- English: alphanumeric sequences ([A-Za-z0-9_]+)
- Japanese: Hiragana, Katakana, Kanji ranges
- If case-insensitive, tokens are lowercased.
2. Index Building
- For each document:
- Count term frequency (TF) per term.
- Store in vocab: term → {doc_id: frequency}.
- Keep full document text and metadata (path, title) in a docs array.
3. Scoring (TF-IDF)
- TF (Term Frequency): How often a term appears in a document.
- IDF (Inverse Document Frequency): log((total_docs + 1) / (docs_with_term + 0.5)) — rare terms get higher weight.
- Score: TF × IDF per term, summed across query terms.
4. Search Process
- Tokenize the query.
- Look up each term in vocab to get candidate documents.
- Score each document (TF × IDF).
- Rank by score and return the top K results.
5. Fallback for Partial Matches
- If the index returns nothing (or for CJK queries), fall back to substring matching across all documents to catch partial tokens.
This keeps queries fast even on large collections, since you only scan documents that contain at least one query term.
User: Logged-in User Name



Leave a Reply