Model Compression Techniques
1. Pruning
Concept: Remove unnecessary weights from the model.
Structured Pruning
- Remove entire neurons, filters, or layers
- Maintains model structure
- Easy to optimize on standard hardware
- Loss: 5-10% accuracy
Original: 100 neurons
Pruned (50%): 50 neurons
Speed improvement: ~2x
Accuracy loss: ~5-8%
Unstructured Pruning
- Remove individual weights
- Better compression ratios
- Requires special hardware support
- Loss: 2-5% accuracy
Original: 100 neurons × 100 weights = 10,000 weights
Pruned (80%): 2,000 weights remaining
Speed improvement: ~2-3x (with hardware support)
Accuracy loss: ~2-4%
Tools
- Neural Magic - Pruned model zoo
- PyTorch built-in pruning API
2. Quantization
Concept: Reduce numerical precision of weights and activations.
Common Quantization Schemes
| Scheme | Bits | Range | Loss |
|---|---|---|---|
| FP32 | 32 | -3.4e38 to 3.4e38 | Baseline |
| FP16 | 16 | -65,500 to 65,500 | < 1% |
| BF16 | 16 | -3.4e38 to 3.4e38 | < 1% |
| INT8 | 8 | -128 to 127 | 2-5% |
| INT4 | 4 | -8 to 7 | 5-10% |
| FP8 | 8 | High-precision range | 1-3% |
Benefits
- 2-4x smaller model size
- 2-4x faster inference
- Lower memory usage
- Reduced bandwidth
Trade-offs
- Small accuracy loss (usually < 2%)
- Requires careful calibration
- Some hardware limitations
Quantization Methods
Post-Training Quantization (PTQ)
- Apply after training
- Fast, simple
- Slight accuracy loss
- Best choice for most applications
Quantization-Aware Training (QAT)
- Train with quantization in mind
- Better accuracy
- More complex
- Use when accuracy critical
Tools
- GPTQ - Excellent LLM quantization
- AWQ - Better than GPTQ in some cases
- ollama - Easy quantization
- vLLM - Built-in quantization support
3. Knowledge Distillation
Concept: Train a smaller model (student) to mimic a larger model (teacher).
Process
- Teacher Training: Train large, accurate model
- Temperature Scaling: Teacher generates soft targets (probabilities instead of hard labels)
- Student Training: Smaller model learns to match teacher’s outputs
- Fine-tuning: Optional fine-tuning for task-specific accuracy
Benefits
- Smaller model maintains teacher’s quality
- Better than training student from scratch
- 5-10x smaller models possible
Trade-offs
- Teacher model must be available
- Training time (but inference is fast)
- Limited by teacher quality
Example
Teacher: Llama 70B (70B params)
↓
Knowledge Distillation
↓
Student: Custom 7B model
↓
Result: 7B model with near-70B quality
4. Pseudo-Labeling
Concept: Use large model to generate labels for unlabeled data, then train smaller model.
Process
- Large model labels unlabeled data
- Smaller model trains on pseudo-labeled data
- Often combined with knowledge distillation
Best For
- Limited labeled data
- Domain-specific applications
- Few-shot learning
Streaming & Real-Time Optimization
Streaming Inference
Concept: Generate output in chunks while still receiving input.
Benefits
- Lower latency (first token appears quickly)
- Better user experience
- Processes input incrementally
Implementation
def streaming_inference(audio_stream):
for audio_chunk in audio_stream:
# Process chunk
tokens = encode(audio_chunk)
# Generate response incrementally
for output_token in model.generate_streaming(tokens):
yield output_token # Send to user immediatelyMetrics
- Time-to-First-Token (TTFT): Time until first output
- Tokens-Per-Second (TPS): Throughput of generation
Targets:
- TTFT: < 100ms
- TPS: > 10 tokens/sec
Delayed Sequence Modeling (DSM)
Concept: Buffer audio slightly before processing to optimize computation.
When Used
- Streaming ASR: Buffer 100-300ms of audio before finalizing words
- Streaming TTS: Buffer tokens for smoother prosody
- Real-time Translation: Buffer words for better context
Trade-offs
- Small latency increase
- Better accuracy and naturalness
- Improved output quality
Example
Without DSM:
Input: "Hel..." → Output: "Hel..."
Latency: low, but may need correction
With DSM (300ms buffer):
Input: "Hello..." (buffered) → Output: "Hello"
Latency: +300ms, but accurate first-time
Hardware Acceleration
Inference Servers
vLLM
- Fast LLM serving
- PagedAttention for memory efficiency
- Batch processing support
- Streaming support
# Install
pip install vllm
# Serve model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--quantization awq # Use AWQ quantizationTriton Server
- Multi-framework serving
- Model optimization
- Dynamic batching
- Ensemble support
# Create model repository
mkdir -p triton-models/voicellm/1
cp model.onnx triton-models/voicellm/1/
# Serve
docker run --gpus all -p 8000:8000 \
nvcr.io/nvidia/tritonserver:latest \
tritonserver --model-repository=/modelsMLC-LLM
- Optimized for edge and mobile
- Quantized model support
- Cross-platform compilation
# Install
pip install mlc-llm
# Compile model
mlc_llm compile models/llama-2-7bLiteRT (formerly TensorFlow Lite)
- Google’s edge inference runtime
- Optimized for on-device
- Small footprint (1-5MB for TTS model)
from litert import interpreter
# Load optimized model
interpreter = interpreter.Interpreter(
model_path="model_quantized.tflite"
)
# Run inference
interpreter.invoke()Specialized Hardware
| Hardware | Best For | Speed-up | Cost |
|---|---|---|---|
| GPU (H100) | Throughput, batch | 100x CPU | $$$$ |
| TPU | Large batch training | 50-100x CPU | $$$ (cloud only) |
| NPU | Edge inference, streaming | 10-20x CPU | $$$ |
| CPU | Universal, low cost | Baseline | $ |
Caching Strategies
Attention Cache (KV Cache)
Problem: Recomputing all previous tokens every step
Solution: Cache Key and Value matrices
Without KV cache:
Step 1: Compute KV for token 1 → Output token 2
Step 2: Recompute KV for tokens 1-2 → Output token 3
Step 3: Recompute KV for tokens 1-3 → Output token 4
Cost: O(n²) computation
With KV cache:
Step 1: Compute KV for token 1 → Output token 2, cache KV
Step 2: Reuse cached KV + compute new KV → Output token 3
Step 3: Reuse cached KVs + compute new KV → Output token 4
Cost: O(n) computation
Savings: 2-10x speedup depending on sequence length
Embedding Cache
Concept: Cache pre-computed embeddings for repeated inputs.
# Cache speaker embeddings
speaker_embedding_cache = {}
for speaker_id in speaker_ids:
if speaker_id not in speaker_embedding_cache:
speaker_embedding_cache[speaker_id] = \
compute_speaker_embedding(speaker_audio)
embedding = speaker_embedding_cache[speaker_id]
# Use embedding for TTSResponse Cache
Concept: Cache common queries and responses.
response_cache = {}
def get_response(query):
# Check cache
if query in response_cache:
return response_cache[query]
# Compute if not cached
response = model(query)
response_cache[query] = response
return responseSavings: 5-100x speedup for repeated queries
Batch Processing
Dynamic Batching
Process multiple requests together:
# Collect requests for 100ms
requests = queue.get_batch(timeout=0.1)
# Process together
outputs = model.batch_infer(requests)
# Send back
for req, output in zip(requests, outputs):
req.respond(output)Throughput Improvement: 5-20x (depending on batch size)
Batch Size Impact
| Batch Size | Latency | Throughput |
|---|---|---|
| 1 | 100ms | 10 req/s |
| 4 | 120ms | 33 req/s |
| 8 | 150ms | 53 req/s |
| 16 | 200ms | 80 req/s |
| 32 | 300ms | 107 req/s |
Sweet Spot: Usually 4-8 for real-time, 16-32 for batch
End-to-End Optimization Example
Scenario: Deploy Whisper + Llama + TTS
Original Setup
- Whisper Medium (769M) + Llama 13B + Kokoro TTS
- Cost: H100 @ $5.50/hr
- Latency: 2-3 seconds
- Throughput: 5 req/s
Optimizations
-
Quantize Whisper
- Medium → INT8
- Size: 769M → ~192M
- Speed: 1.3x faster
-
Quantize Llama
- 13B → INT8
- Size: 26GB → 6.5GB
- Speed: 2-3x faster
-
Distill TTS
- Use smaller Kokoro variant
- Size: 82M → 40M
- Speed: 1.5x faster
-
Add Caching
- Cache embeddings
- Cache common responses
- Speed: 2-3x faster on repeat queries
-
Enable Streaming
- Output first token in 100ms
- Better UX
Result
| Metric | Original | Optimized | Improvement |
|---|---|---|---|
| Model Size | ~30GB | ~7GB | 4.3x |
| GPU Memory | H100 | A100 | $$ savings |
| Latency (avg) | 2.5s | 0.8s | 3x faster |
| Latency (p99) | 4s | 1.2s | 3.3x faster |
| Throughput | 5 req/s | 15 req/s | 3x higher |
| Cost | $5.50/hr | $2.50/hr | 55% reduction |
Monitoring Optimization
Key Metrics
metrics = {
'model_size': '7GB',
'inference_time': 0.8, # seconds
'throughput': 15, # req/s
'memory_used': 6.2, # GB
'gpu_util': 0.85, # 85%
'ttft': 0.1, # 100ms
'tps': 12.5, # tokens/sec
}
# Alert if degradation
if metrics['inference_time'] > 1.0:
alert("Inference time degraded!")Profiling
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
output = model(input)
profiler.disable()
stats = pstats.Stats(profiler)
stats.print_stats() # Show where time is spent