Information Retrieval is the process of finding relevant documents or information from large collections based on user queries. IR systems power search engines, recommendation platforms, digital libraries, and modern AI applications.

Traditional Retrieval Methods

Boolean Retrieval

Principle: Uses Boolean operators (AND, OR, NOT) to combine query terms How It Works:

  • Users specify exact Boolean logic: “machine AND learning NOT deep”
  • System retrieves documents matching exactly this logic
  • Simple but rigid

Limitations:

  • No ranking of results
  • All matching documents treated equally
  • Difficult for users to formulate queries
  • Vocabulary mismatch problems

Bag of Words (BoW) Model

Principle: Represents documents as unordered collections of words

How It Works:

  • Count frequency of each word in documents
  • Query is also represented as word frequencies
  • Simple matching provides basic relevance

Limitations:

  • Treats all words equally (ignores importance)
  • No consideration of word order or context
  • Common words (the, and, is) pollute relevance scores

TF-IDF (Term Frequency-Inverse Document Frequency)

Principle: Weight words by both frequency and rarity

Components:

  • Term Frequency (TF): How often a term appears in a document
  • Inverse Document Frequency (IDF): Rarity of term across all documents
  • Score = TF × IDF: Balances frequency with uniqueness

Formula:

text

IDF(term) = log(total_documents / documents_containing_term) TF-IDF(term, doc) = TF(term, doc) × IDF(term)

Advantages:

  • Simple and interpretable
  • Effective for many traditional IR tasks
  • Computationally efficient

Limitations:

  • Biases toward longer documents (accumulate more terms)
  • Assumes term independence (no semantic relationships)
  • No handling of synonyms or semantic meaning
  • Fails with vocabulary mismatch

BM25 (Okapi Best Matching 25)

Principle: Probabilistic ranking function improving upon TF-IDF

Key Improvements:

  1. Term Frequency Saturation:
    • TF-IDF increases linearly with term repetition
    • BM25 saturates after a threshold
    • Additional occurrences add diminishing value
  2. Document Length Normalization:
    • Longer documents naturally contain more terms
    • BM25 penalizes longer documents to ensure fairness
    • Parameter k1 controls saturation, parameter b controls length normalization

BM25 Formula:

Score(D,Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))

Where:
- qi = query term i
- f(qi,D) = frequency of qi in document D
- |D| = length of document D
- avgdl = average document length
- k1, b = tuning parameters (typically k1=1.2, b=0.75)

Advantages:

  • Most effective traditional IR method
  • Robust and widely adopted
  • Balances multiple ranking factors
  • Standard baseline for comparisons

Limitations:

  • Still doesn’t capture semantic meaning
  • Term independence assumption
  • Hyperparameter tuning required
  • Struggles with vocabulary mismatch

Inverted Index (Lexical Search Foundation)

Principle: Maps words to document IDs for fast lookup

Structure:

Word → [Document IDs where word appears]
"machine" → [Doc1, Doc3, Doc5, ...]
"learning" → [Doc2, Doc3, Doc4, ...]

How It Works:

  1. During indexing: Extract words from each document, build word→doc mapping
  2. During search: Look up query words, retrieve associated document lists
  3. Combine results based on query logic (AND, OR)

Advantages:

  • Extremely fast for exact keyword matching
  • Efficient storage and updates
  • Foundation for BM25 and lexical search

Limitations:

  • Only exact word matching
  • No semantic understanding
  • Doesn’t handle typos or variations
  • Requires stopword removal and stemming

Sentence Transformers (SBERT)

BERT-based encoder optimized for sentence-level embeddings

Key Components:

  1. Transformer Encoder (e.g., BERT-base):
    • Processes input tokens through 12 transformer layers
    • Generates token-level embeddings (512 tokens × 768 dims)
  2. Pooling Layer:
    • Aggregates token embeddings into fixed sentence embedding
    • Common strategies:
      • Mean pooling: Average all token embeddings
      • Max pooling: Take maximum across each dimension
      • [CLS] token: Use output of special classification token
  3. Siamese Network Training:
    • Two identical SBERT networks process different sentences
    • Trained with contrastive loss on sentence pairs
    • Output: Similarity score between sentence embeddings

Process:

Sentence A: "The cat is sleeping"
    ↓
[BERT Layers] → Token Embeddings (8 tokens × 768)
    ↓
[Mean Pooling] → Sentence Embedding (768 dims)

Sentence B: "The feline is napping"
    ↓
[BERT Layers] → Token Embeddings (9 tokens × 768)
    ↓
[Mean Pooling] → Sentence Embedding (768 dims)

Cosine Similarity → 0.94 (very similar)