Voice Agents Deployment
Overview
Deploying voice agents requires selecting appropriate hardware, cloud platforms, models, and scaling strategies. This guide covers GPU architectures, cloud platforms, cost analysis, and deployment patterns.
GPU Architecture Comparison
Different GPU architectures offer varying performance characteristics for voice model inference.
| Architecture | Release | Example GPUs | Tensor Gen | Special Precision | Approx Tensor TFLOPS | Notable Features |
|---|---|---|---|---|---|---|
| Turing | 2018 | T4, RTX 20s | 1st Gen | FP16, INT8 | ~65 (T4, FP16) | First Tensor Cores, RT cores for RTX |
| Ampere | 2020 | A100, A10G, RTX 30s | 3rd Gen | TF32, FP16, BF16, INT8 | ~312 (A100, FP16) | TF32 training mode, high memory |
| Ada Lovelace | 2022/23 | L4, L40S, RTX 40s | 4th Gen | FP16, BF16, INT8, sparsity | ~180 (L40S, FP16) | Better efficiency, higher clocks |
| Hopper | 2022 | H100, H200 | Transformer Engine | FP8, FP16, BF16, INT8 | ~1000 (H100, FP8) | Dynamic mixed-precision |
| Blackwell | 2024+ | B200 | Next Transformer Engine | FP4, FP8, FP16, INT8 | ~2000 (est. FP4/8) | Ultra-low precision for LLMs |
GPU Selection Guidelines
| Model Size | Recommended GPU | Use Case |
|---|---|---|
| Small (< 1B params) | T4, L4 | Inference, edge, low-cost |
| Medium (1-13B params) | A10G, L40S | Balanced inference |
| Large (13-70B params) | A100, H100 | High-throughput, fine-tuning |
| XL (70B+ params) | H100, H200, B200 | Multi-model deployment |
Cloud Deployment Platforms
Platform Comparison Matrix
| Feature | Lightning AI | Modal | Koyeb | RunPod | Fal AI | Cerebrium AI |
|---|---|---|---|---|---|---|
| GPU Types | H100, A100, L40S, T4, A10G | T4, L4, A10G, A100, L40S, H100 | RTX 4000, L4, A6000, L40S, A100, H100 | A4000, A5000, A6000, A100, H100 | A6000, A100, H100, H200, B200 | T4, L4, A10, A100, H100, H200 |
| Monthly Fee | 50/mo | $250 compute/month | $29/mo | None | None | None |
| Cold Start | 78–537s | 1–4s | <200ms | ~200ms–12s | ”Near-instant” | ~2s (GPU) |
| Docker Support | ✅ Custom Docker | ✅ Custom Docker | ✅ GitHub deploys | ✅ Full Docker | ✅ fal deploy | ✅ Custom Docker |
| VPC/Networking | ✅ Enterprise VPC | ❌ Internal mesh only | ❌ No external VPC | ✅ Internal VPC | ❌ No external VPC | Region isolation |
| Compliance | SOC2, HIPAA | HIPAA (BAA) | ❌ | SOC2/HIPAA (in progress) | Custom licenses | SOC2, HIPAA |
Pricing Comparison
| Provider | T4 $/hr | A100 $/hr | H100 $/hr |
|---|---|---|---|
| Lightning AI | $0.68 | $2.71 | $5.52 |
| Modal | $0.59 | $2.50 | $3.95 |
| Koyeb | $0.50 | $2.00 | $3.30 |
| RunPod | $0.40–0.58 | $1.33–2.72 | $2.17–4.47 |
| Fal AI | — | $0.99 | $1.89 |
| Cerebrium AI | $0.59 | $2.48 | $4.68 |
Platform Recommendations
For MVP/Testing
- Fal AI - Low cost, fast cold start
- Modal - Good balance, reliable
- RunPod - Affordable, flexible
For Production
- Lightning AI - Enterprise features, HIPAA
- Modal - Reliability, monitoring
- Cerebrium AI - Full-featured, compliance
For Prototyping
- Koyeb - Simple deployment, GitHub integration
- RunPod - Flexibility, pay-per-second
Enterprise GPU Hosting Summary
| Provider | H100 $/hr | A100 $/hr | T4 $/hr | REST API | WebSocket | Docker | Serverless GPU | Scale-to-Zero | VPC |
|---|---|---|---|---|---|---|---|---|---|
| Lightning AI | $0.42+ | $0.42+ | $0.42+ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Modal | $3.95 | $2.50 | $0.58 | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| Koyeb | $3.30 | $2.00 | — | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| RunPod | $2.17–4.47 | $1.33–2.72 | $0.40–0.58 | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| Fal AI | $1.89 | $0.99 | — | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| Cerebrium AI | $4.68 | $2.48 | $0.58 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Replicate | $5.49 | $5.04 | $0.81 | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| Anyscale | Contact | Contact | Contact | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| Together AI | Contact | Contact | Contact | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| Paperspace | $5.95 | $3.09–3.18 | $0.56 | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
LLM & Speech Model Specifications
Recommended Large Language Models
| Model | Parameters | GPU Memory | Context Length |
|---|---|---|---|
| Llama 3.1 8B Instruct | 8B | ~16GB | 128k tokens |
| Llama 3.1 70B Instruct | 70B | ~140GB | 128k tokens |
| Qwen 2.5 7B | 7B | ~14GB | 32k tokens |
Text-to-Speech Models
| Model | Parameters | GPU Memory | Notes |
|---|---|---|---|
| Kokoro | 82M | 326MB | FP32, very efficient |
| Fish Speech | Large | ~2-4GB | High quality |
| Higgs Audio | Large | ~4-8GB | Unified tokenizer |
Speech-to-Text Models
| Model | Parameters | GPU Memory | Notes |
|---|---|---|---|
| Whisper-Medium | 769M | ~5GB | FP16, balanced |
| Whisper-Large | 1.5B | ~10GB | FP16, high accuracy |
| Canary | Custom | ~4-6GB | Better on translation |
Deployment Cost Estimation
Hourly Cost Breakdown
| Provider | Configuration | Cost/Hour | 30-Day 24/7 Cost |
|---|---|---|---|
| Lightning AI | H100 (80GB), 26 CPU, 1513 TPS | $2.70/hr | $1,944 |
| Lightning AI | A100 (80GB), 30 CPU, 312 TPS | $1.55/hr | $1,116 |
| Together AI | Llama inference | $0.88/1M tokens | Variable |
| Together AI | Whisper STT | $0.09/hr | $64.80 |
| Modal | A100 (40GB) | $2.10/hr | $1,512 |
| Modal | H100 | 0.0473 CPU/hr | $2,844+ |
Cost Optimization Strategies
-
Model Selection
- Smaller models (8B vs 70B) save 4-5x cost
- Quantized models (INT8, FP8) save 2-3x cost
- Specialized models (Canary vs Whisper) often cheaper
-
Batching
- Process multiple requests together
- Can improve throughput by 5-10x
- Requires buffering (adds latency)
-
Caching
- Cache embeddings
- Cache model outputs for common queries
- Reduce redundant computation
-
Scaling Strategy
- Auto-scale during peak hours
- Scale to zero during off-peak
- Estimated savings: 30-50%
-
Request Optimization
- Compress audio before processing
- Use smaller quantized models for preprocessing
- Progressive quality increase (start with small model, upgrade if needed)
Voice Agent Architecture Patterns
Pattern 1: Synchronous Request-Response
User Input → Model Inference → User Output
(Wait for response, simple but has latency)
Use Case: One-shot queries, non-interactive Latency: 200ms - 2s Platforms: Any cloud provider
Pattern 2: Streaming (Server-Sent Events)
User Input → Streaming Model → Partial Outputs (streamed)
Use Case: Long-form generation, chat Latency: 0-50ms (first token), then ~100ms chunks Platforms: Lightning AI, Modal, Koyeb (WebSocket support)
Pattern 3: Full-Duplex (WebSocket)
User Audio ← → Model (simultaneous)
(Listen while speaking)
Use Case: Real-time conversation (Moshi-like) Latency: <200ms Platforms: Modal, Koyeb (requires WebSocket)
Deployment Considerations
Latency Requirements
| Application | Target Latency | RTF Needed |
|---|---|---|
| Batch processing | 10+ seconds | > 0.1 |
| Non-interactive | 1-5 seconds | 0.05-0.1 |
| Interactive chat | 500ms-1s | 0.01-0.05 |
| Real-time voice | <200ms | < 0.01 |
Throughput Planning
Example: Customer Support Bot
Peak load: 100 concurrent users
Avg request: 10 seconds
Throughput needed: ~10 requests/second
GPU capacity:
- H100 at peak: ~15-20 requests/second
- Recommendation: 1 H100 sufficient
- Redundancy: 2-3 H100s for high availability
High Availability Deployment
Load Balancer
├─ Region 1: 2× GPU nodes (active-active)
├─ Region 2: 2× GPU nodes (failover)
└─ Region 3: 1× GPU node (failover)
Benefits:
- Handles 2x peak traffic
- Survives region outage
- Auto-healing of failed nodes
Monitoring & Operations
Key Metrics
- Latency: p50, p95, p99 latency
- Throughput: Requests per second
- GPU Utilization: % of GPU compute used
- Memory: GPU memory usage
- Queue Length: Requests waiting
- Error Rate: % of failed requests
Alerting
Set up alerts for:
- P99 latency > threshold
- GPU utilization > 90%
- Error rate > 1%
- Queue length > 10
- Memory usage > 85%
Next Steps
- See 07-Evaluation-Testing.md for quality assurance
- See 08-Inference-Optimization.md for making inference faster/cheaper
- See 09-Resources.md for tools and platforms