Information Retrieval

Information Retrieval is the process of finding relevant documents or information from large collections based on user queries. IR systems power search engines, recommendation platforms, digital libraries, and modern AI applications.

Traditional Retrieval Methods

Boolean Retrieval

Principle: Uses Boolean operators (AND, OR, NOT) to combine query terms How It Works:

Users specify exact Boolean logic: “machine AND learning NOT deep”
System retrieves documents matching exactly this logic
Simple but rigid

Limitations:

No ranking of results
All matching documents treated equally
Difficult for users to formulate queries
Vocabulary mismatch problems

Bag of Words (BoW) Model

Principle: Represents documents as unordered collections of words

How It Works:

Count frequency of each word in documents
Query is also represented as word frequencies
Simple matching provides basic relevance

Limitations:

Treats all words equally (ignores importance)
No consideration of word order or context
Common words (the, and, is) pollute relevance scores

TF-IDF (Term Frequency-Inverse Document Frequency)

Principle: Weight words by both frequency and rarity

Components:

Term Frequency (TF): How often a term appears in a document
Inverse Document Frequency (IDF): Rarity of term across all documents
Score = TF × IDF: Balances frequency with uniqueness

Formula:

text

IDF(term) = log(total_documents / documents_containing_term) TF-IDF(term, doc) = TF(term, doc) × IDF(term)

Advantages:

Simple and interpretable
Effective for many traditional IR tasks
Computationally efficient

Limitations:

Biases toward longer documents (accumulate more terms)
Assumes term independence (no semantic relationships)
No handling of synonyms or semantic meaning
Fails with vocabulary mismatch

BM25 (Okapi Best Matching 25)

Principle: Probabilistic ranking function improving upon TF-IDF

Key Improvements:

Term Frequency Saturation:
- TF-IDF increases linearly with term repetition
- BM25 saturates after a threshold
- Additional occurrences add diminishing value
Document Length Normalization:
- Longer documents naturally contain more terms
- BM25 penalizes longer documents to ensure fairness
- Parameter k1 controls saturation, parameter b controls length normalization

BM25 Formula:

Score(D,Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))

Where:
- qi = query term i
- f(qi,D) = frequency of qi in document D
- |D| = length of document D
- avgdl = average document length
- k1, b = tuning parameters (typically k1=1.2, b=0.75)

Advantages:

Most effective traditional IR method
Robust and widely adopted
Balances multiple ranking factors
Standard baseline for comparisons

Limitations:

Still doesn’t capture semantic meaning
Term independence assumption
Hyperparameter tuning required
Struggles with vocabulary mismatch

Inverted Index (Lexical Search Foundation)

Principle: Maps words to document IDs for fast lookup

Structure:

Word → [Document IDs where word appears]
"machine" → [Doc1, Doc3, Doc5, ...]
"learning" → [Doc2, Doc3, Doc4, ...]

How It Works:

During indexing: Extract words from each document, build word→doc mapping
During search: Look up query words, retrieve associated document lists
Combine results based on query logic (AND, OR)

Advantages:

Extremely fast for exact keyword matching
Efficient storage and updates
Foundation for BM25 and lexical search

Limitations:

Only exact word matching
No semantic understanding
Doesn’t handle typos or variations
Requires stopword removal and stemming

Sentence Transformers (SBERT)

BERT-based encoder optimized for sentence-level embeddings

Key Components:

Transformer Encoder (e.g., BERT-base):
- Processes input tokens through 12 transformer layers
- Generates token-level embeddings (512 tokens × 768 dims)
Pooling Layer:
- Aggregates token embeddings into fixed sentence embedding
- Common strategies:
  - Mean pooling: Average all token embeddings
  - Max pooling: Take maximum across each dimension
  - [CLS] token: Use output of special classification token
Siamese Network Training:
- Two identical SBERT networks process different sentences
- Trained with contrastive loss on sentence pairs
- Output: Similarity score between sentence embeddings

Process:

Sentence A: "The cat is sleeping"
    ↓
[BERT Layers] → Token Embeddings (8 tokens × 768)
    ↓
[Mean Pooling] → Sentence Embedding (768 dims)

Sentence B: "The feline is napping"
    ↓
[BERT Layers] → Token Embeddings (9 tokens × 768)
    ↓
[Mean Pooling] → Sentence Embedding (768 dims)

Cosine Similarity → 0.94 (very similar)

Information Retrieval

Table of Contents

Traditional Retrieval Methods

Boolean Retrieval

Bag of Words (BoW) Model

TF-IDF (Term Frequency-Inverse Document Frequency)

BM25 (Okapi Best Matching 25)

Inverted Index (Lexical Search Foundation)

Sentence Transformers (SBERT)

Graph View

Table of Contents