AppHaven

Overview of SageSearch’s Indexer

admin

November 17, 2025

SageSearch version 3.0 uses an inverted index (term → document mapping) for fast lookups.

1. Tokenization

Each document is split into tokens using regex patterns:
- English: alphanumeric sequences ([A-Za-z0-9_]+)
- Japanese: Hiragana, Katakana, Kanji ranges
If case-insensitive, tokens are lowercased.

2. Index Building

For each document:
- Count term frequency (TF) per term.
- Store in vocab: term → {doc_id: frequency}.
- Keep full document text and metadata (path, title) in a docs array.

3. Scoring (TF-IDF)

TF (Term Frequency): How often a term appears in a document.
IDF (Inverse Document Frequency): log((total_docs + 1) / (docs_with_term + 0.5)) — rare terms get higher weight.
Score: TF × IDF per term, summed across query terms.

4. Search Process

Tokenize the query.
Look up each term in vocab to get candidate documents.
Score each document (TF × IDF).
Rank by score and return the top K results.

5. Fallback for Partial Matches

If the index returns nothing (or for CJK queries), fall back to substring matching across all documents to catch partial tokens.

This keeps queries fast even on large collections, since you only scan documents that contain at least one query term.

User: Logged-in User Name

Productivity Tools, search technology, Text Search

Leave a Reply Cancel reply

Categories

Recent Posts

Overview of SageSearch’s IndexerNovember 17, 2025
Benefits of Using SageSearch ProMarch 20, 2025
Benefits of Using SageSearch LiteMarch 12, 2025

Tags

Social Links

Archive