Voice Agents Deployment

Overview

Deploying voice agents requires selecting appropriate hardware, cloud platforms, models, and scaling strategies. This guide covers GPU architectures, cloud platforms, cost analysis, and deployment patterns.

GPU Architecture Comparison

Different GPU architectures offer varying performance characteristics for voice model inference.

Architecture	Release	Example GPUs	Tensor Gen	Special Precision	Approx Tensor TFLOPS	Notable Features
Turing	2018	T4, RTX 20s	1st Gen	FP16, INT8	~65 (T4, FP16)	First Tensor Cores, RT cores for RTX
Ampere	2020	A100, A10G, RTX 30s	3rd Gen	TF32, FP16, BF16, INT8	~312 (A100, FP16)	TF32 training mode, high memory
Ada Lovelace	2022/23	L4, L40S, RTX 40s	4th Gen	FP16, BF16, INT8, sparsity	~180 (L40S, FP16)	Better efficiency, higher clocks
Hopper	2022	H100, H200	Transformer Engine	FP8, FP16, BF16, INT8	~1000 (H100, FP8)	Dynamic mixed-precision
Blackwell	2024+	B200	Next Transformer Engine	FP4, FP8, FP16, INT8	~2000 (est. FP4/8)	Ultra-low precision for LLMs

GPU Selection Guidelines

Model Size	Recommended GPU	Use Case
Small (< 1B params)	T4, L4	Inference, edge, low-cost
Medium (1-13B params)	A10G, L40S	Balanced inference
Large (13-70B params)	A100, H100	High-throughput, fine-tuning
XL (70B+ params)	H100, H200, B200	Multi-model deployment

Cloud Deployment Platforms

Platform Comparison Matrix

Feature	Lightning AI	Modal	Koyeb	RunPod	Fal AI	Cerebrium AI
GPU Types	H100, A100, L40S, T4, A10G	T4, L4, A10G, A100, L40S, H100	RTX 4000, L4, A6000, L40S, A100, H100	A4000, A5000, A6000, A100, H100	A6000, A100, H100, H200, B200	T4, L4, A10, A100, H100, H200
Monthly Fee	$50 f ree GP U - h rs,$ 50/mo	$250 compute/month	$29/mo	None	None	None
Cold Start	78–537s	1–4s	<200ms	~200ms–12s	”Near-instant”	~2s (GPU)
Docker Support	✅ Custom Docker	✅ Custom Docker	✅ GitHub deploys	✅ Full Docker	✅ fal deploy	✅ Custom Docker
VPC/Networking	✅ Enterprise VPC	❌ Internal mesh only	❌ No external VPC	✅ Internal VPC	❌ No external VPC	Region isolation
Compliance	SOC2, HIPAA	HIPAA (BAA)	❌	SOC2/HIPAA (in progress)	Custom licenses	SOC2, HIPAA

Pricing Comparison

Provider	T4 $/hr	A100 $/hr	H100 $/hr
Lightning AI	$0.68	$2.71	$5.52
Modal	$0.59	$2.50	$3.95
Koyeb	$0.50	$2.00	$3.30
RunPod	$0.40–0.58	$1.33–2.72	$2.17–4.47
Fal AI	—	$0.99	$1.89
Cerebrium AI	$0.59	$2.48	$4.68

Platform Recommendations

For MVP/Testing

Fal AI - Low cost, fast cold start
Modal - Good balance, reliable
RunPod - Affordable, flexible

For Production

Lightning AI - Enterprise features, HIPAA
Modal - Reliability, monitoring
Cerebrium AI - Full-featured, compliance

For Prototyping

Koyeb - Simple deployment, GitHub integration
RunPod - Flexibility, pay-per-second

Enterprise GPU Hosting Summary

Provider	H100 $/hr	A100 $/hr	T4 $/hr	REST API	WebSocket	Docker	Serverless GPU	Scale-to-Zero	VPC
Lightning AI	$0.42+	$0.42+	$0.42+	✅	✅	✅	✅	✅	✅
Modal	$3.95	$2.50	$0.58	✅	✅	✅	✅	✅	❌
Koyeb	$3.30	$2.00	—	✅	✅	✅	✅	✅	❌
RunPod	$2.17–4.47	$1.33–2.72	$0.40–0.58	✅	✅	✅	✅	✅	❌
Fal AI	$1.89	$0.99	—	✅	✅	✅	✅	✅	❌
Cerebrium AI	$4.68	$2.48	$0.58	✅	✅	✅	✅	✅	✅
Replicate	$5.49	$5.04	$0.81	✅	❌	✅	✅	✅	❌
Anyscale	Contact	Contact	Contact	✅	❌	✅	✅	✅	✅
Together AI	Contact	Contact	Contact	✅	❌	✅	✅	✅	✅
Paperspace	$5.95	$3.09–3.18	$0.56	✅	❌	✅	✅	✅	✅

LLM & Speech Model Specifications

Recommended Large Language Models

Model	Parameters	GPU Memory	Context Length
Llama 3.1 8B Instruct	8B	~16GB	128k tokens
Llama 3.1 70B Instruct	70B	~140GB	128k tokens
Qwen 2.5 7B	7B	~14GB	32k tokens

Text-to-Speech Models

Model	Parameters	GPU Memory	Notes
Kokoro	82M	326MB	FP32, very efficient
Fish Speech	Large	~2-4GB	High quality
Higgs Audio	Large	~4-8GB	Unified tokenizer

Speech-to-Text Models

Model	Parameters	GPU Memory	Notes
Whisper-Medium	769M	~5GB	FP16, balanced
Whisper-Large	1.5B	~10GB	FP16, high accuracy
Canary	Custom	~4-6GB	Better on translation

Deployment Cost Estimation

Hourly Cost Breakdown

Provider	Configuration	Cost/Hour	30-Day 24/7 Cost
Lightning AI	H100 (80GB), 26 CPU, 1513 TPS	$2.70/hr	$1,944
Lightning AI	A100 (80GB), 30 CPU, 312 TPS	$1.55/hr	$1,116
Together AI	Llama inference	$0.88/1M tokens	Variable
Together AI	Whisper STT	$0.09/hr	$64.80
Modal	A100 (40GB)	$2.10/hr	$1,512
Modal	H100	$3.95 +$ 0.0473 CPU/hr	$2,844+

Cost Optimization Strategies

Model Selection
- Smaller models (8B vs 70B) save 4-5x cost
- Quantized models (INT8, FP8) save 2-3x cost
- Specialized models (Canary vs Whisper) often cheaper
Batching
- Process multiple requests together
- Can improve throughput by 5-10x
- Requires buffering (adds latency)
Caching
- Cache embeddings
- Cache model outputs for common queries
- Reduce redundant computation
Scaling Strategy
- Auto-scale during peak hours
- Scale to zero during off-peak
- Estimated savings: 30-50%
Request Optimization
- Compress audio before processing
- Use smaller quantized models for preprocessing
- Progressive quality increase (start with small model, upgrade if needed)

Voice Agent Architecture Patterns

Pattern 1: Synchronous Request-Response

User Input → Model Inference → User Output
(Wait for response, simple but has latency)

Use Case: One-shot queries, non-interactive Latency: 200ms - 2s Platforms: Any cloud provider

Pattern 2: Streaming (Server-Sent Events)

User Input → Streaming Model → Partial Outputs (streamed)

Use Case: Long-form generation, chat Latency: 0-50ms (first token), then ~100ms chunks Platforms: Lightning AI, Modal, Koyeb (WebSocket support)

Pattern 3: Full-Duplex (WebSocket)

User Audio ← → Model (simultaneous)
(Listen while speaking)

Use Case: Real-time conversation (Moshi-like) Latency: <200ms Platforms: Modal, Koyeb (requires WebSocket)

Deployment Considerations

Latency Requirements

Application	Target Latency	RTF Needed
Batch processing	10+ seconds	> 0.1
Non-interactive	1-5 seconds	0.05-0.1
Interactive chat	500ms-1s	0.01-0.05
Real-time voice	<200ms	< 0.01

Throughput Planning

Example: Customer Support Bot

Peak load: 100 concurrent users
Avg request: 10 seconds
Throughput needed: ~10 requests/second

GPU capacity:
- H100 at peak: ~15-20 requests/second
- Recommendation: 1 H100 sufficient
- Redundancy: 2-3 H100s for high availability

High Availability Deployment

Load Balancer
  ├─ Region 1: 2× GPU nodes (active-active)
  ├─ Region 2: 2× GPU nodes (failover)
  └─ Region 3: 1× GPU node (failover)

Benefits:
- Handles 2x peak traffic
- Survives region outage
- Auto-healing of failed nodes

Monitoring & Operations

Key Metrics

Latency: p50, p95, p99 latency
Throughput: Requests per second
GPU Utilization: % of GPU compute used
Memory: GPU memory usage
Queue Length: Requests waiting
Error Rate: % of failed requests

Alerting

Set up alerts for:

P99 latency > threshold
GPU utilization > 90%
Error rate > 1%
Queue length > 10
Memory usage > 85%

Next Steps

See 07-Evaluation-Testing.md for quality assurance
See 08-Inference-Optimization.md for making inference faster/cheaper
See 09-Resources.md for tools and platforms

Voice Agents Deployment - Infrastructure & Cost

Table of Contents