Voice Agents Deployment

Overview

Deploying voice agents requires selecting appropriate hardware, cloud platforms, models, and scaling strategies. This guide covers GPU architectures, cloud platforms, cost analysis, and deployment patterns.


GPU Architecture Comparison

Different GPU architectures offer varying performance characteristics for voice model inference.

ArchitectureReleaseExample GPUsTensor GenSpecial PrecisionApprox Tensor TFLOPSNotable Features
Turing2018T4, RTX 20s1st GenFP16, INT8~65 (T4, FP16)First Tensor Cores, RT cores for RTX
Ampere2020A100, A10G, RTX 30s3rd GenTF32, FP16, BF16, INT8~312 (A100, FP16)TF32 training mode, high memory
Ada Lovelace2022/23L4, L40S, RTX 40s4th GenFP16, BF16, INT8, sparsity~180 (L40S, FP16)Better efficiency, higher clocks
Hopper2022H100, H200Transformer EngineFP8, FP16, BF16, INT8~1000 (H100, FP8)Dynamic mixed-precision
Blackwell2024+B200Next Transformer EngineFP4, FP8, FP16, INT8~2000 (est. FP4/8)Ultra-low precision for LLMs

GPU Selection Guidelines

Model SizeRecommended GPUUse Case
Small (< 1B params)T4, L4Inference, edge, low-cost
Medium (1-13B params)A10G, L40SBalanced inference
Large (13-70B params)A100, H100High-throughput, fine-tuning
XL (70B+ params)H100, H200, B200Multi-model deployment

Cloud Deployment Platforms

Platform Comparison Matrix

FeatureLightning AIModalKoyebRunPodFal AICerebrium AI
GPU TypesH100, A100, L40S, T4, A10GT4, L4, A10G, A100, L40S, H100RTX 4000, L4, A6000, L40S, A100, H100A4000, A5000, A6000, A100, H100A6000, A100, H100, H200, B200T4, L4, A10, A100, H100, H200
Monthly Fee50/mo$250 compute/month$29/moNoneNoneNone
Cold Start78–537s1–4s<200ms~200ms–12s”Near-instant”~2s (GPU)
Docker Support✅ Custom Docker✅ Custom Docker✅ GitHub deploys✅ Full Docker✅ fal deploy✅ Custom Docker
VPC/Networking✅ Enterprise VPC❌ Internal mesh only❌ No external VPC✅ Internal VPC❌ No external VPCRegion isolation
ComplianceSOC2, HIPAAHIPAA (BAA)SOC2/HIPAA (in progress)Custom licensesSOC2, HIPAA

Pricing Comparison

ProviderT4 $/hrA100 $/hrH100 $/hr
Lightning AI$0.68$2.71$5.52
Modal$0.59$2.50$3.95
Koyeb$0.50$2.00$3.30
RunPod$0.40–0.58$1.33–2.72$2.17–4.47
Fal AI$0.99$1.89
Cerebrium AI$0.59$2.48$4.68

Platform Recommendations

For MVP/Testing

  • Fal AI - Low cost, fast cold start
  • Modal - Good balance, reliable
  • RunPod - Affordable, flexible

For Production

  • Lightning AI - Enterprise features, HIPAA
  • Modal - Reliability, monitoring
  • Cerebrium AI - Full-featured, compliance

For Prototyping

  • Koyeb - Simple deployment, GitHub integration
  • RunPod - Flexibility, pay-per-second

Enterprise GPU Hosting Summary

ProviderH100 $/hrA100 $/hrT4 $/hrREST APIWebSocketDockerServerless GPUScale-to-ZeroVPC
Lightning AI$0.42+$0.42+$0.42+
Modal$3.95$2.50$0.58
Koyeb$3.30$2.00
RunPod$2.17–4.47$1.33–2.72$0.40–0.58
Fal AI$1.89$0.99
Cerebrium AI$4.68$2.48$0.58
Replicate$5.49$5.04$0.81
AnyscaleContactContactContact
Together AIContactContactContact
Paperspace$5.95$3.09–3.18$0.56

LLM & Speech Model Specifications

ModelParametersGPU MemoryContext Length
Llama 3.1 8B Instruct8B~16GB128k tokens
Llama 3.1 70B Instruct70B~140GB128k tokens
Qwen 2.5 7B7B~14GB32k tokens

Text-to-Speech Models

ModelParametersGPU MemoryNotes
Kokoro82M326MBFP32, very efficient
Fish SpeechLarge~2-4GBHigh quality
Higgs AudioLarge~4-8GBUnified tokenizer

Speech-to-Text Models

ModelParametersGPU MemoryNotes
Whisper-Medium769M~5GBFP16, balanced
Whisper-Large1.5B~10GBFP16, high accuracy
CanaryCustom~4-6GBBetter on translation

Deployment Cost Estimation

Hourly Cost Breakdown

ProviderConfigurationCost/Hour30-Day 24/7 Cost
Lightning AIH100 (80GB), 26 CPU, 1513 TPS$2.70/hr$1,944
Lightning AIA100 (80GB), 30 CPU, 312 TPS$1.55/hr$1,116
Together AILlama inference$0.88/1M tokensVariable
Together AIWhisper STT$0.09/hr$64.80
ModalA100 (40GB)$2.10/hr$1,512
ModalH1000.0473 CPU/hr$2,844+

Cost Optimization Strategies

  1. Model Selection

    • Smaller models (8B vs 70B) save 4-5x cost
    • Quantized models (INT8, FP8) save 2-3x cost
    • Specialized models (Canary vs Whisper) often cheaper
  2. Batching

    • Process multiple requests together
    • Can improve throughput by 5-10x
    • Requires buffering (adds latency)
  3. Caching

    • Cache embeddings
    • Cache model outputs for common queries
    • Reduce redundant computation
  4. Scaling Strategy

    • Auto-scale during peak hours
    • Scale to zero during off-peak
    • Estimated savings: 30-50%
  5. Request Optimization

    • Compress audio before processing
    • Use smaller quantized models for preprocessing
    • Progressive quality increase (start with small model, upgrade if needed)

Voice Agent Architecture Patterns

Pattern 1: Synchronous Request-Response

User Input → Model Inference → User Output
(Wait for response, simple but has latency)

Use Case: One-shot queries, non-interactive Latency: 200ms - 2s Platforms: Any cloud provider

Pattern 2: Streaming (Server-Sent Events)

User Input → Streaming Model → Partial Outputs (streamed)

Use Case: Long-form generation, chat Latency: 0-50ms (first token), then ~100ms chunks Platforms: Lightning AI, Modal, Koyeb (WebSocket support)

Pattern 3: Full-Duplex (WebSocket)

User Audio ← → Model (simultaneous)
(Listen while speaking)

Use Case: Real-time conversation (Moshi-like) Latency: <200ms Platforms: Modal, Koyeb (requires WebSocket)


Deployment Considerations

Latency Requirements

ApplicationTarget LatencyRTF Needed
Batch processing10+ seconds> 0.1
Non-interactive1-5 seconds0.05-0.1
Interactive chat500ms-1s0.01-0.05
Real-time voice<200ms< 0.01

Throughput Planning

Example: Customer Support Bot

Peak load: 100 concurrent users
Avg request: 10 seconds
Throughput needed: ~10 requests/second

GPU capacity:
- H100 at peak: ~15-20 requests/second
- Recommendation: 1 H100 sufficient
- Redundancy: 2-3 H100s for high availability

High Availability Deployment

Load Balancer
  ├─ Region 1: 2× GPU nodes (active-active)
  ├─ Region 2: 2× GPU nodes (failover)
  └─ Region 3: 1× GPU node (failover)

Benefits:
- Handles 2x peak traffic
- Survives region outage
- Auto-healing of failed nodes

Monitoring & Operations

Key Metrics

  • Latency: p50, p95, p99 latency
  • Throughput: Requests per second
  • GPU Utilization: % of GPU compute used
  • Memory: GPU memory usage
  • Queue Length: Requests waiting
  • Error Rate: % of failed requests

Alerting

Set up alerts for:

  • P99 latency > threshold
  • GPU utilization > 90%
  • Error rate > 1%
  • Queue length > 10
  • Memory usage > 85%

Next Steps