Inference Optimization - Speed & Cost Reduction

Model Compression Techniques

1. Pruning

Concept: Remove unnecessary weights from the model.

Structured Pruning

Remove entire neurons, filters, or layers
Maintains model structure
Easy to optimize on standard hardware
Loss: 5-10% accuracy

Original: 100 neurons
Pruned (50%): 50 neurons
Speed improvement: ~2x
Accuracy loss: ~5-8%

Unstructured Pruning

Remove individual weights
Better compression ratios
Requires special hardware support
Loss: 2-5% accuracy

Original: 100 neurons × 100 weights = 10,000 weights
Pruned (80%): 2,000 weights remaining
Speed improvement: ~2-3x (with hardware support)
Accuracy loss: ~2-4%

Tools

Neural Magic - Pruned model zoo
PyTorch built-in pruning API

2. Quantization

Concept: Reduce numerical precision of weights and activations.

Common Quantization Schemes

Scheme	Bits	Range	Loss
FP32	32	-3.4e38 to 3.4e38	Baseline
FP16	16	-65,500 to 65,500	< 1%
BF16	16	-3.4e38 to 3.4e38	< 1%
INT8	8	-128 to 127	2-5%
INT4	4	-8 to 7	5-10%
FP8	8	High-precision range	1-3%

Benefits

2-4x smaller model size
2-4x faster inference
Lower memory usage
Reduced bandwidth

Trade-offs

Small accuracy loss (usually < 2%)
Requires careful calibration
Some hardware limitations

Quantization Methods

Post-Training Quantization (PTQ)

Apply after training
Fast, simple
Slight accuracy loss
Best choice for most applications

Quantization-Aware Training (QAT)

Train with quantization in mind
Better accuracy
More complex
Use when accuracy critical

Tools

GPTQ - Excellent LLM quantization
AWQ - Better than GPTQ in some cases
ollama - Easy quantization
vLLM - Built-in quantization support

3. Knowledge Distillation

Concept: Train a smaller model (student) to mimic a larger model (teacher).

Process

Teacher Training: Train large, accurate model
Temperature Scaling: Teacher generates soft targets (probabilities instead of hard labels)
Student Training: Smaller model learns to match teacher’s outputs
Fine-tuning: Optional fine-tuning for task-specific accuracy

Benefits

Smaller model maintains teacher’s quality
Better than training student from scratch
5-10x smaller models possible

Trade-offs

Teacher model must be available
Training time (but inference is fast)
Limited by teacher quality

Example

Teacher: Llama 70B (70B params)
  ↓
Knowledge Distillation
  ↓
Student: Custom 7B model
  ↓
Result: 7B model with near-70B quality

4. Pseudo-Labeling

Concept: Use large model to generate labels for unlabeled data, then train smaller model.

Process

Large model labels unlabeled data
Smaller model trains on pseudo-labeled data
Often combined with knowledge distillation

Best For

Limited labeled data
Domain-specific applications
Few-shot learning

Streaming & Real-Time Optimization

Streaming Inference

Concept: Generate output in chunks while still receiving input.

Benefits

Lower latency (first token appears quickly)
Better user experience
Processes input incrementally

Implementation

def streaming_inference(audio_stream):
    for audio_chunk in audio_stream:
        # Process chunk
        tokens = encode(audio_chunk)
 
        # Generate response incrementally
        for output_token in model.generate_streaming(tokens):
            yield output_token  # Send to user immediately

Metrics

Time-to-First-Token (TTFT): Time until first output
Tokens-Per-Second (TPS): Throughput of generation

Targets:

TTFT: < 100ms
TPS: > 10 tokens/sec

Delayed Sequence Modeling (DSM)

Concept: Buffer audio slightly before processing to optimize computation.

When Used

Streaming ASR: Buffer 100-300ms of audio before finalizing words
Streaming TTS: Buffer tokens for smoother prosody
Real-time Translation: Buffer words for better context

Trade-offs

Small latency increase
Better accuracy and naturalness
Improved output quality

Example

Without DSM:
Input: "Hel..." → Output: "Hel..."
Latency: low, but may need correction

With DSM (300ms buffer):
Input: "Hello..." (buffered) → Output: "Hello"
Latency: +300ms, but accurate first-time

Hardware Acceleration

Inference Servers

vLLM

Fast LLM serving
PagedAttention for memory efficiency
Batch processing support
Streaming support

# Install
pip install vllm
 
# Serve model
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-hf \
  --quantization awq  # Use AWQ quantization

Triton Server

Multi-framework serving
Model optimization
Dynamic batching
Ensemble support

# Create model repository
mkdir -p triton-models/voicellm/1
cp model.onnx triton-models/voicellm/1/
 
# Serve
docker run --gpus all -p 8000:8000 \
  nvcr.io/nvidia/tritonserver:latest \
  tritonserver --model-repository=/models

MLC-LLM

Optimized for edge and mobile
Quantized model support
Cross-platform compilation

# Install
pip install mlc-llm
 
# Compile model
mlc_llm compile models/llama-2-7b

LiteRT (formerly TensorFlow Lite)

Google’s edge inference runtime
Optimized for on-device
Small footprint (1-5MB for TTS model)

from litert import interpreter
 
# Load optimized model
interpreter = interpreter.Interpreter(
    model_path="model_quantized.tflite"
)
 
# Run inference
interpreter.invoke()

Specialized Hardware

Hardware	Best For	Speed-up	Cost
GPU (H100)	Throughput, batch	100x CPU	$$$$
TPU	Large batch training	50-100x CPU	$$$ (cloud only)
NPU	Edge inference, streaming	10-20x CPU	$$$
CPU	Universal, low cost	Baseline	$

Caching Strategies

Attention Cache (KV Cache)

Problem: Recomputing all previous tokens every step

Solution: Cache Key and Value matrices

Without KV cache:
Step 1: Compute KV for token 1 → Output token 2
Step 2: Recompute KV for tokens 1-2 → Output token 3
Step 3: Recompute KV for tokens 1-3 → Output token 4
Cost: O(n²) computation

With KV cache:
Step 1: Compute KV for token 1 → Output token 2, cache KV
Step 2: Reuse cached KV + compute new KV → Output token 3
Step 3: Reuse cached KVs + compute new KV → Output token 4
Cost: O(n) computation

Savings: 2-10x speedup depending on sequence length

Embedding Cache

Concept: Cache pre-computed embeddings for repeated inputs.

# Cache speaker embeddings
speaker_embedding_cache = {}
 
for speaker_id in speaker_ids:
    if speaker_id not in speaker_embedding_cache:
        speaker_embedding_cache[speaker_id] = \
            compute_speaker_embedding(speaker_audio)
 
    embedding = speaker_embedding_cache[speaker_id]
    # Use embedding for TTS

Response Cache

Concept: Cache common queries and responses.

response_cache = {}
 
def get_response(query):
    # Check cache
    if query in response_cache:
        return response_cache[query]
 
    # Compute if not cached
    response = model(query)
    response_cache[query] = response
    return response

Savings: 5-100x speedup for repeated queries

Batch Processing

Dynamic Batching

Process multiple requests together:

# Collect requests for 100ms
requests = queue.get_batch(timeout=0.1)
 
# Process together
outputs = model.batch_infer(requests)
 
# Send back
for req, output in zip(requests, outputs):
    req.respond(output)

Throughput Improvement: 5-20x (depending on batch size)

Batch Size Impact

Batch Size	Latency	Throughput
1	100ms	10 req/s
4	120ms	33 req/s
8	150ms	53 req/s
16	200ms	80 req/s
32	300ms	107 req/s

Sweet Spot: Usually 4-8 for real-time, 16-32 for batch

End-to-End Optimization Example

Scenario: Deploy Whisper + Llama + TTS

Original Setup

Whisper Medium (769M) + Llama 13B + Kokoro TTS
Cost: H100 @ $5.50/hr
Latency: 2-3 seconds
Throughput: 5 req/s

Optimizations

Quantize Whisper
- Medium → INT8
- Size: 769M → ~192M
- Speed: 1.3x faster
Quantize Llama
- 13B → INT8
- Size: 26GB → 6.5GB
- Speed: 2-3x faster
Distill TTS
- Use smaller Kokoro variant
- Size: 82M → 40M
- Speed: 1.5x faster
Add Caching
- Cache embeddings
- Cache common responses
- Speed: 2-3x faster on repeat queries
Enable Streaming
- Output first token in 100ms
- Better UX

Result

Metric	Original	Optimized	Improvement
Model Size	~30GB	~7GB	4.3x
GPU Memory	H100	A100	$$ savings
Latency (avg)	2.5s	0.8s	3x faster
Latency (p99)	4s	1.2s	3.3x faster
Throughput	5 req/s	15 req/s	3x higher
Cost	$5.50/hr	$2.50/hr	55% reduction

Monitoring Optimization

Key Metrics

metrics = {
    'model_size': '7GB',
    'inference_time': 0.8,  # seconds
    'throughput': 15,       # req/s
    'memory_used': 6.2,     # GB
    'gpu_util': 0.85,       # 85%
    'ttft': 0.1,            # 100ms
    'tps': 12.5,            # tokens/sec
}
 
# Alert if degradation
if metrics['inference_time'] > 1.0:
    alert("Inference time degraded!")

Profiling

import cProfile
import pstats
 
profiler = cProfile.Profile()
profiler.enable()
 
output = model(input)
 
profiler.disable()
stats = pstats.Stats(profiler)
stats.print_stats()  # Show where time is spent

Inference Optimization - Speed & Cost Reduction

Table of Contents

Model Compression Techniques

1. Pruning

Structured Pruning

Unstructured Pruning

Tools

2. Quantization

Common Quantization Schemes

Benefits

Trade-offs

Quantization Methods

Tools

3. Knowledge Distillation

Process

Benefits

Trade-offs

Example

4. Pseudo-Labeling

Process

Best For

Streaming & Real-Time Optimization

Streaming Inference

Benefits

Implementation

Metrics

Delayed Sequence Modeling (DSM)

When Used

Trade-offs

Example

Hardware Acceleration

Inference Servers

vLLM

Triton Server

MLC-LLM

LiteRT (formerly TensorFlow Lite)

Specialized Hardware

Caching Strategies

Attention Cache (KV Cache)

Embedding Cache

Response Cache

Batch Processing

Dynamic Batching

Batch Size Impact

End-to-End Optimization Example

Scenario: Deploy Whisper + Llama + TTS

Original Setup

Optimizations

Result

Monitoring Optimization

Key Metrics

Profiling

Graph View

Table of Contents