Over the past few months I’ve been building a fully open-source voice agent, exploring the stack end-to-end and learning a ton along the way. Now I’m ready to share everything I discovered.

The best part? In 2025 you actually can build one yourself. With today’s open-source models and frameworks you can piece together a real-time voice agent that listens, reasons, and talks back almost like a human without relying on closed platforms.

Let’s walk through the building blocks, step by step.

The Core Pipeline

At a high level, a modern voice agent looks like this:

Pretty simple on paper but each step has its own challenges. Let’s dig deeper.

Speech-to-Text (STT)

Speech is a continuous audio wave it doesn’t naturally have clear sentence boundaries or pauses. That’s where Voice Activity Detection (VAD) comes in:

  • VAD (Voice Activity Detection): Detects when the user starts and stops talking. Without it, your bot either cuts you off too soon or stares at you blankly.

Once the boundaries are clear, the audio is passed into an STT model for transcription.

FactorSilero VADWebRTC VADTEN VADYamnet VADCobra (Picovoice)
AccuracyState-of-the-art, >95% in multi-noiseGood for silence/non-silence; lower speech/noise discriminationHigh, lower false positives than WebRTC/SileroGood, multi-class capableTop-tier (see Picovoice benchmarks)
Latency<1ms per 30+ms chunk (CPU/GPU/ONNX)10-30ms frame decision, ultra low-lag2-5ms (real-time capable)5–10ms/classify5–10ms/classify
Chunk Size30, 60, 100ms selector10–30ms20ms, 40ms custom30-50ms30-50ms
Noise RobustnessExcellent, trained on 100+ noisesPoor for some background noise/overlapping speechExcellentModerateExcellent
Language Support6000+ languages/no domain restrictionLanguage-agnostic, good for basic speech/silenceLanguage-agnosticMulti-language possibleLanguage-agnostic
Footprint~2MB JIT, <1MB ONNX, minimal CPU/edge~158KB binary, extremely light~400KB~2MB (.tflite format)Small, edge-ready
Streaming SupportYes, supports real-time pipelinesYes, designed for telecom/audio streamsYes, real-timeYesYes
IntegrationPython, ONNX, PyTorch, Pipecat, edge/IoT dataC/C++/Python, embedded/web/mobilePython, C++, webTensorFlow Lite APIsPython, C, web, WASM
LicensingMIT (commercial/edge/distribution OK)BSD (very permissive)Apache 2.0, openApache 2.0Apache 2.0

Silero VAD is the gold standard and pipecat has builtin support so I have choosen that :

  • Sub-1ms per chunk on CPU
  • Just 2MB in size
  • Handles 6000+ languages
  • Works with 8kHz & 16kHz audio
  • MIT license (unrestricted use)

What are thing we need focus on choosing STT for voice agent

  • Accuracy:
    • Word Error Rate (WER): Measures transcription mistakes (lower is better).
      • Example: WER 5% means 5 mistakes per 100 words.
    • Sentence-level correctness: Some models may get individual words right but fail on sentence structure.
  • Multilingual support: If your users speak multiple languages, check language coverage.
  • Noise tolerance: Can it handle background noise, music, or multiple speakers?
  • Accent/voice variation handling: Works across accents, genders, and speech speeds.
  • Voice Activity Detection (VAD) integration: Detects when speech starts and ends.
  • Streaming: Most STT models work in batch mode (great for YouTube captions, bad for live conversations). For real-time agents, we need streaming output words should appear while you’re still speaking.
  • Low Latency: Even 300 500ms delays feel unnatural. Target sub-second responses.

Whisper often comes first to mind for most people when discussing speech-to-text because it has a large community, numerous variants, and is backed by OpenAI.

OpenAI Whisper Family

  • Whisper Large V3 — State-of-the-art accuracy with multilingual support
  • Faster-Whisper — Optimized implementation using CTranslate2
  • Distil-Whisper — Lightweight for resource-constrained environments
  • WhisperX — Enhanced timestamps and speaker diarization

NVIDIA also offers some interesting STT models, though I haven’t tried them yet since Whisper works well for my use case. I’m just listing them here for you to explore:

Here the comparsion table

ModelWER (EN, Public Bench.)MultilingualNoise/Accent/VoiceSentence AccuracyVAD IntegrationStreamingLatency
Whisper Large V32–5%99+ExcellentExcellentYes (Silero)Batch†~700ms†
Faster-Whisper2–5%99+ExcellentExcellentYes (Silero)Yes~300ms‡
Canary 1B3.06% (MLS EN) 4 (EN, DE, ES, FR)Top-tier, fair on voice/gender/ageExcellentYesYes~500ms–<1s
Parakeet TDT 0.6B5–7%3 (EN, DE, FR)GoodVery GoodYesYesUltra Low (~3,400x Real-time)

Why I Chose FastWhisper

After testing, my pick is FastWhisper, an optimized inference engine for Whisper.

Key Advantages:

  • 12.5× faster than original Whisper
  • 3× faster than Faster-Whisper with batching
  • Sub-200ms latency possible with proper tuning
  • Same accuracy as Whisper
  • Runs on CPU & GPU with automatic fallback

It’s built in C++ + CTranslate2, supports batching, and integrates neatly with VAD.

For more you can check Speech to Text AI Model & Provider Leaderboard

Large Language Model (LLM)

Once speech is transcribed, the text goes into an LLM the “brain” of your agent.

What we want in an LLM for voice agents:

  • Understands prompts, history, and context
  • Generates responses quickly
  • Supports tool calls (search, RAG, memory, APIs)

Leading Open-Source LLMs

Meta Llama Family

  • Llama 3.3 70B — Open-source leader
  • Llama 3.2 (1B, 3B, 11B) — Scaled for different deployments
  • 128K context window — remembers long conversations
  • Tool calling support — built-in function execution

Others

My Choice: Llama 3.3 70B Versatile

Why?

  • Large context window → keeps conversations coherent
  • Tool use built-in
  • Widely supported in the open-source community

Text-to-Speech (TTS)

Now the agent needs to speak back and this is where quality can make or break the experience.

A poor TTS voice instantly ruins immersion. The key requirements are:

  • Low latency avoid awkward pauses
  • Natural speech no robotic tone
  • Streaming output start speaking mid-sentence

Open-Source TTS Models I’ve Tried

There are plenty of open-source TTS models available. Here’s a snapshot of the ones I experimented with:

  • Kokoro-82M — Lightweight, #1 on HuggingFace TTS Arena, blazing fast
  • Chatterbox — Built on Llama, fast inference, rising adoption
  • XTTS-v2 — Zero-shot voice cloning, 17 languages, streaming support
  • FishSpeech — Natural dialogue flow
  • Orpheus — Scales from 150M–3B
  • Dia — A TTS model capable of generating ultra-realistic dialogue in one pass.
FactorKokoro-82MChatterboxXTTS-v2FishSpeechOrpheus
Voice NaturalnessHuman-like, top-rated in communityVery natural, quickly improvingHigh, especially with good samplesNatural, especially for dialogueGood, scales with model size
Expressiveness / EmotionModerate, some emotional rangeGood, improvingHigh, can mimic sample emotionModerate, aims for conversational flowModerate-High, model-dependent
Accent / Language Coverage8+ languages (EN, JP, ZH, FR, more)EN-focused, expanding17+ languages, strong global supportSeveral; focus variesVaries by checkpoint (3B supports many)
Latency / Inference<300ms for any length, streaming-firstFast inference, suitable for real-time~500ms (depends on hardware), good streaming support~400ms, streaming variants3B: ~1s+ (large), 150M: fast (CPU/no-GPU)
Streaming SupportYes, natural dialogue with chunked streamingYesYes, early outputYesYes (3B may be slower)
Resource UsageExtremely light (<300MB), great for CPU/edgeModerate (500M params), GPU preferredModerate-high, 500M+ params, GPU preferredModerate, CPU/GPU150M–3B options (higher = more GPU/memory)
Quantization / Optimization8-bit available, runs on most hardwareSome supportYes, 8-bit/4-bitYesYes
Voice Cloning / CustomNot by default, needs trainingVia fine-tuningZero-shot (few seconds of target voice)Beta, improving cloningFine-tuning supported for custom voices
Documentation / CommunityActive, rich demos, open source, growingGood docs, quickly growingVery large (Coqui), strong docsMedium but positive communityMedium, active research group
LicenseApache 2.0 (commercial OK)Commercial/Proprietary use may require licenseLGPL-3.0, open (see repo)See repo, mostly permissiveApache 2.0
Pretrained Voices / DemosYes (multiple voices, demos available)Yes, continually adding moreYes, huge library, instant demoYesYes (many public models on Hugging Face)

Why I Chose Kokoro-82M

Key Advantages:

  • 5–15× smaller than competing models while maintaining high quality
  • Runs under 300MB — edge-device friendly
  • Sub-300ms latency
  • High-fidelity 24kHz audio
  • Streaming-first design — natural conversation flow

Limitations:

  • No zero-shot voice cloning (uses a fixed voice library)
  • Less expressive than XTTS-v2
  • Relatively new model with a smaller community

You can also check out my minimal Kokoro-FastAPI server to experiment with it:

Speech-to-Speech Models

Speech-to-Speech (S2S) models represent an exciting advancement in AI, combining speech recognition, language understanding, and text-to-speech synthesis into a single, end-to-end pipeline. These models allow natural, real-time conversations by converting speech input directly into speech output, reducing latency and minimizing intermediate processing steps.

Some notable models in this space include:

  • Moshi: Developed by Kyutai-Labs, Moshi is a state-of-the-art speech-text foundation model designed for real-time full-duplex dialogue. Unlike traditional voice agents that process ASR, LLM, and TTS separately, Moshi handles the entire flow end-to-end.

  • CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

  • VALL-E & VALL-E X (Microsoft): These models support zero-shot voice conversion and speech-to-speech synthesis from limited voice samples.

  • AudioLM (Google Research): Leverages language modeling on audio tokens to generate high-quality speech continuation and synthesis.

Among these, I’ve primarily worked with Moshi. I’ve implemented it on a FastAPI server with streaming support, which allows you to test and interact with it in real-time. You can explore the FastAPI implementation here: FastAPI + Moshi GitHub.

Framework (The Glue)

Finally, you need something to tie all the pieces together: streaming audio, message passing, and orchestration.

Open-Source Frameworks

Pipecat

  • Purpose-built for voice-first agents
  • Streaming-first (ultra-low latency)
  • Modular design — swap models easily
  • Active community

Vocode

  • Developer-friendly, good docs
  • Direct telephony integration
  • Smaller community, less active

LiveKit Agents

  • Based on WebRTC
  • Supports voice, video, text
  • Self-hosting options

Traditional Orchestration

  • LangChain — great for docs, weak at streaming
  • LlamaIndex — RAG-focused, not optimized for voice
  • Custom builds — total control, but high overhead

Why I Recommend Pipecat

Voice-Centric Features

  • Streaming-first, frame-based pipeline (TTS can start before text is done)
  • Smart Turn Detection v2 (intonation-aware)
  • Built-in interruption handling

Production Ready

  • Sub-500ms latency achievable
  • Efficient for long-running agents
  • Excellent docs + examples
  • Strong, growing community

Real-World Performance

  • ~500ms voice-to-voice latency in production
  • Works with Twilio + phone systems
  • Supports multi-agent orchestration
  • Scales to thousands of concurrent users
FeaturePipecatVocodeLiveKitLangChain
Voice-First Design⚠️
Real-Time Streaming
Vendor Neutral⚠️
Turn Detection✅ Smart V2⚠️ Basic
Community Activity✅ High⚠️ Moderate✅ High✅ High
Learning Curve⚠️ Moderate⚠️ Moderate❌ Steep✅ Easy

Lead to Next Part

In this first part, we’ve covered the core tech stack and models needed to build a real-time voice agent.

In the next part of the series, we’ll dive into integration with Pipecat, explore our voice architecture, and walk through deployment strategies. Later, we’ll show how to enhance your agent with RAG (Retrieval-Augmented Generation), memory features, and other advanced capabilities to make your voice assistant truly intelligent.

Stay tuned the next guide will turn all these building blocks into a working, real time voice agent you can actually deploy.

I’ve created a GitHub repository VoiceAgentGuide for this series, where we can store our notes and related resources. Don’t forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).

Resources