Information Retrieval is the process of finding relevant documents or information from large collections based on user queries. IR systems power search engines, recommendation platforms, digital libraries, and modern AI applications.
Traditional Retrieval Methods
Boolean Retrieval
Principle: Uses Boolean operators (AND, OR, NOT) to combine query terms How It Works:
- Users specify exact Boolean logic: “machine AND learning NOT deep”
- System retrieves documents matching exactly this logic
- Simple but rigid
Limitations:
- No ranking of results
- All matching documents treated equally
- Difficult for users to formulate queries
- Vocabulary mismatch problems
Bag of Words (BoW) Model
Principle: Represents documents as unordered collections of words
How It Works:
- Count frequency of each word in documents
- Query is also represented as word frequencies
- Simple matching provides basic relevance
Limitations:
- Treats all words equally (ignores importance)
- No consideration of word order or context
- Common words (the, and, is) pollute relevance scores
TF-IDF (Term Frequency-Inverse Document Frequency)
Principle: Weight words by both frequency and rarity
Components:
- Term Frequency (TF): How often a term appears in a document
- Inverse Document Frequency (IDF): Rarity of term across all documents
- Score = TF × IDF: Balances frequency with uniqueness
Formula:
text
IDF(term) = log(total_documents / documents_containing_term) TF-IDF(term, doc) = TF(term, doc) × IDF(term)
Advantages:
- Simple and interpretable
- Effective for many traditional IR tasks
- Computationally efficient
Limitations:
- Biases toward longer documents (accumulate more terms)
- Assumes term independence (no semantic relationships)
- No handling of synonyms or semantic meaning
- Fails with vocabulary mismatch
BM25 (Okapi Best Matching 25)
Principle: Probabilistic ranking function improving upon TF-IDF
Key Improvements:
- Term Frequency Saturation:
- TF-IDF increases linearly with term repetition
- BM25 saturates after a threshold
- Additional occurrences add diminishing value
- Document Length Normalization:
- Longer documents naturally contain more terms
- BM25 penalizes longer documents to ensure fairness
- Parameter k1 controls saturation, parameter b controls length normalization
BM25 Formula:
Score(D,Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))
Where:
- qi = query term i
- f(qi,D) = frequency of qi in document D
- |D| = length of document D
- avgdl = average document length
- k1, b = tuning parameters (typically k1=1.2, b=0.75)
Advantages:
- Most effective traditional IR method
- Robust and widely adopted
- Balances multiple ranking factors
- Standard baseline for comparisons
Limitations:
- Still doesn’t capture semantic meaning
- Term independence assumption
- Hyperparameter tuning required
- Struggles with vocabulary mismatch
Inverted Index (Lexical Search Foundation)
Principle: Maps words to document IDs for fast lookup
Structure:
Word → [Document IDs where word appears]
"machine" → [Doc1, Doc3, Doc5, ...]
"learning" → [Doc2, Doc3, Doc4, ...]
How It Works:
- During indexing: Extract words from each document, build word→doc mapping
- During search: Look up query words, retrieve associated document lists
- Combine results based on query logic (AND, OR)
Advantages:
- Extremely fast for exact keyword matching
- Efficient storage and updates
- Foundation for BM25 and lexical search
Limitations:
- Only exact word matching
- No semantic understanding
- Doesn’t handle typos or variations
- Requires stopword removal and stemming
Sentence Transformers (SBERT)
BERT-based encoder optimized for sentence-level embeddings
Key Components:
- Transformer Encoder (e.g., BERT-base):
- Processes input tokens through 12 transformer layers
- Generates token-level embeddings (512 tokens × 768 dims)
- Pooling Layer:
- Aggregates token embeddings into fixed sentence embedding
- Common strategies:
- Mean pooling: Average all token embeddings
- Max pooling: Take maximum across each dimension
- [CLS] token: Use output of special classification token
- Siamese Network Training:
- Two identical SBERT networks process different sentences
- Trained with contrastive loss on sentence pairs
- Output: Similarity score between sentence embeddings
Process:
Sentence A: "The cat is sleeping"
↓
[BERT Layers] → Token Embeddings (8 tokens × 768)
↓
[Mean Pooling] → Sentence Embedding (768 dims)
Sentence B: "The feline is napping"
↓
[BERT Layers] → Token Embeddings (9 tokens × 768)
↓
[Mean Pooling] → Sentence Embedding (768 dims)
Cosine Similarity → 0.94 (very similar)