Model Compression Techniques

1. Pruning

Concept: Remove unnecessary weights from the model.

Structured Pruning

  • Remove entire neurons, filters, or layers
  • Maintains model structure
  • Easy to optimize on standard hardware
  • Loss: 5-10% accuracy
Original: 100 neurons
Pruned (50%): 50 neurons
Speed improvement: ~2x
Accuracy loss: ~5-8%

Unstructured Pruning

  • Remove individual weights
  • Better compression ratios
  • Requires special hardware support
  • Loss: 2-5% accuracy
Original: 100 neurons × 100 weights = 10,000 weights
Pruned (80%): 2,000 weights remaining
Speed improvement: ~2-3x (with hardware support)
Accuracy loss: ~2-4%

Tools

  • Neural Magic - Pruned model zoo
  • PyTorch built-in pruning API

2. Quantization

Concept: Reduce numerical precision of weights and activations.

Common Quantization Schemes

SchemeBitsRangeLoss
FP3232-3.4e38 to 3.4e38Baseline
FP1616-65,500 to 65,500< 1%
BF1616-3.4e38 to 3.4e38< 1%
INT88-128 to 1272-5%
INT44-8 to 75-10%
FP88High-precision range1-3%

Benefits

  • 2-4x smaller model size
  • 2-4x faster inference
  • Lower memory usage
  • Reduced bandwidth

Trade-offs

  • Small accuracy loss (usually < 2%)
  • Requires careful calibration
  • Some hardware limitations

Quantization Methods

Post-Training Quantization (PTQ)

  • Apply after training
  • Fast, simple
  • Slight accuracy loss
  • Best choice for most applications

Quantization-Aware Training (QAT)

  • Train with quantization in mind
  • Better accuracy
  • More complex
  • Use when accuracy critical

Tools

  • GPTQ - Excellent LLM quantization
  • AWQ - Better than GPTQ in some cases
  • ollama - Easy quantization
  • vLLM - Built-in quantization support

3. Knowledge Distillation

Concept: Train a smaller model (student) to mimic a larger model (teacher).

Process

  1. Teacher Training: Train large, accurate model
  2. Temperature Scaling: Teacher generates soft targets (probabilities instead of hard labels)
  3. Student Training: Smaller model learns to match teacher’s outputs
  4. Fine-tuning: Optional fine-tuning for task-specific accuracy

Benefits

  • Smaller model maintains teacher’s quality
  • Better than training student from scratch
  • 5-10x smaller models possible

Trade-offs

  • Teacher model must be available
  • Training time (but inference is fast)
  • Limited by teacher quality

Example

Teacher: Llama 70B (70B params)
  ↓
Knowledge Distillation
  ↓
Student: Custom 7B model
  ↓
Result: 7B model with near-70B quality

4. Pseudo-Labeling

Concept: Use large model to generate labels for unlabeled data, then train smaller model.

Process

  1. Large model labels unlabeled data
  2. Smaller model trains on pseudo-labeled data
  3. Often combined with knowledge distillation

Best For

  • Limited labeled data
  • Domain-specific applications
  • Few-shot learning

Streaming & Real-Time Optimization

Streaming Inference

Concept: Generate output in chunks while still receiving input.

Benefits

  • Lower latency (first token appears quickly)
  • Better user experience
  • Processes input incrementally

Implementation

def streaming_inference(audio_stream):
    for audio_chunk in audio_stream:
        # Process chunk
        tokens = encode(audio_chunk)
 
        # Generate response incrementally
        for output_token in model.generate_streaming(tokens):
            yield output_token  # Send to user immediately

Metrics

  • Time-to-First-Token (TTFT): Time until first output
  • Tokens-Per-Second (TPS): Throughput of generation

Targets:

  • TTFT: < 100ms
  • TPS: > 10 tokens/sec

Delayed Sequence Modeling (DSM)

Concept: Buffer audio slightly before processing to optimize computation.

When Used

  • Streaming ASR: Buffer 100-300ms of audio before finalizing words
  • Streaming TTS: Buffer tokens for smoother prosody
  • Real-time Translation: Buffer words for better context

Trade-offs

  • Small latency increase
  • Better accuracy and naturalness
  • Improved output quality

Example

Without DSM:
Input: "Hel..." → Output: "Hel..."
Latency: low, but may need correction

With DSM (300ms buffer):
Input: "Hello..." (buffered) → Output: "Hello"
Latency: +300ms, but accurate first-time

Hardware Acceleration

Inference Servers

vLLM

  • Fast LLM serving
  • PagedAttention for memory efficiency
  • Batch processing support
  • Streaming support
# Install
pip install vllm
 
# Serve model
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-hf \
  --quantization awq  # Use AWQ quantization

Triton Server

  • Multi-framework serving
  • Model optimization
  • Dynamic batching
  • Ensemble support
# Create model repository
mkdir -p triton-models/voicellm/1
cp model.onnx triton-models/voicellm/1/
 
# Serve
docker run --gpus all -p 8000:8000 \
  nvcr.io/nvidia/tritonserver:latest \
  tritonserver --model-repository=/models

MLC-LLM

  • Optimized for edge and mobile
  • Quantized model support
  • Cross-platform compilation
# Install
pip install mlc-llm
 
# Compile model
mlc_llm compile models/llama-2-7b

LiteRT (formerly TensorFlow Lite)

  • Google’s edge inference runtime
  • Optimized for on-device
  • Small footprint (1-5MB for TTS model)
from litert import interpreter
 
# Load optimized model
interpreter = interpreter.Interpreter(
    model_path="model_quantized.tflite"
)
 
# Run inference
interpreter.invoke()

Specialized Hardware

HardwareBest ForSpeed-upCost
GPU (H100)Throughput, batch100x CPU$$$$
TPULarge batch training50-100x CPU$$$ (cloud only)
NPUEdge inference, streaming10-20x CPU$$$
CPUUniversal, low costBaseline$

Caching Strategies

Attention Cache (KV Cache)

Problem: Recomputing all previous tokens every step

Solution: Cache Key and Value matrices

Without KV cache:
Step 1: Compute KV for token 1 → Output token 2
Step 2: Recompute KV for tokens 1-2 → Output token 3
Step 3: Recompute KV for tokens 1-3 → Output token 4
Cost: O(n²) computation

With KV cache:
Step 1: Compute KV for token 1 → Output token 2, cache KV
Step 2: Reuse cached KV + compute new KV → Output token 3
Step 3: Reuse cached KVs + compute new KV → Output token 4
Cost: O(n) computation

Savings: 2-10x speedup depending on sequence length

Embedding Cache

Concept: Cache pre-computed embeddings for repeated inputs.

# Cache speaker embeddings
speaker_embedding_cache = {}
 
for speaker_id in speaker_ids:
    if speaker_id not in speaker_embedding_cache:
        speaker_embedding_cache[speaker_id] = \
            compute_speaker_embedding(speaker_audio)
 
    embedding = speaker_embedding_cache[speaker_id]
    # Use embedding for TTS

Response Cache

Concept: Cache common queries and responses.

response_cache = {}
 
def get_response(query):
    # Check cache
    if query in response_cache:
        return response_cache[query]
 
    # Compute if not cached
    response = model(query)
    response_cache[query] = response
    return response

Savings: 5-100x speedup for repeated queries


Batch Processing

Dynamic Batching

Process multiple requests together:

# Collect requests for 100ms
requests = queue.get_batch(timeout=0.1)
 
# Process together
outputs = model.batch_infer(requests)
 
# Send back
for req, output in zip(requests, outputs):
    req.respond(output)

Throughput Improvement: 5-20x (depending on batch size)

Batch Size Impact

Batch SizeLatencyThroughput
1100ms10 req/s
4120ms33 req/s
8150ms53 req/s
16200ms80 req/s
32300ms107 req/s

Sweet Spot: Usually 4-8 for real-time, 16-32 for batch


End-to-End Optimization Example

Scenario: Deploy Whisper + Llama + TTS

Original Setup

  • Whisper Medium (769M) + Llama 13B + Kokoro TTS
  • Cost: H100 @ $5.50/hr
  • Latency: 2-3 seconds
  • Throughput: 5 req/s

Optimizations

  1. Quantize Whisper

    • Medium → INT8
    • Size: 769M → ~192M
    • Speed: 1.3x faster
  2. Quantize Llama

    • 13B → INT8
    • Size: 26GB → 6.5GB
    • Speed: 2-3x faster
  3. Distill TTS

    • Use smaller Kokoro variant
    • Size: 82M → 40M
    • Speed: 1.5x faster
  4. Add Caching

    • Cache embeddings
    • Cache common responses
    • Speed: 2-3x faster on repeat queries
  5. Enable Streaming

    • Output first token in 100ms
    • Better UX

Result

MetricOriginalOptimizedImprovement
Model Size~30GB~7GB4.3x
GPU MemoryH100A100$$ savings
Latency (avg)2.5s0.8s3x faster
Latency (p99)4s1.2s3.3x faster
Throughput5 req/s15 req/s3x higher
Cost$5.50/hr$2.50/hr55% reduction

Monitoring Optimization

Key Metrics

metrics = {
    'model_size': '7GB',
    'inference_time': 0.8,  # seconds
    'throughput': 15,       # req/s
    'memory_used': 6.2,     # GB
    'gpu_util': 0.85,       # 85%
    'ttft': 0.1,            # 100ms
    'tps': 12.5,            # tokens/sec
}
 
# Alert if degradation
if metrics['inference_time'] > 1.0:
    alert("Inference time degraded!")

Profiling

import cProfile
import pstats
 
profiler = cProfile.Profile()
profiler.enable()
 
output = model(input)
 
profiler.disable()
stats = pstats.Stats(profiler)
stats.print_stats()  # Show where time is spent