Open-Source Text-to-Speech (TTS)

Vocoders

Audio Codecs

  • EnCodec - Meta’s neural audio codec
  • SoundStream - Google’s audio codec
  • Mimi - Kyutai’s efficient audio codec
  • SNAC - Hierarchical audio codec

Speech-to-Text (STT) & Speech Recognition

ASR Models

  • Whisper - OpenAI’s robust speech recognition
  • Canary - NVIDIA’s ASR/translation
  • Wav2Vec 2.0 - Meta’s self-supervised model
  • WavLM - Microsoft’s speech model
  • SpeechBrain - PyTorch toolkit for speech
  • Kaldi - Traditional ASR framework

Voice Activity Detection (VAD)

Language Models for Speech

Voice Agent Frameworks & Platforms

Complete Voice Agent Solutions

Real-Time Voice Processing

Development Tools & Libraries

Audio Processing

ML Frameworks

Inference Optimization

  • vLLM - Fast LLM serving
  • MLC-LLM - Optimized LLM serving
  • TensorRT - NVIDIA inference optimizer
  • ONNX - Model format for optimization
  • LiteRT - Google’s edge inference
  • Ollama - Easy local model running
  • LM Studio - GUI for local models

Model Compression

Deployment & Infrastructure

GPU Cloud Providers

Evaluation & Testing

Voice Agent Testing

  • aiewf-eval - Multi-turn voice evaluation
  • Coval - Voice agent testing framework
  • Roark - Voice quality assessment

Audio Quality Assessment


Learning Resources

Audio Processing Fundamentals

Technical Blogs & Papers

Community Resources

Research Papers & Articles

Streaming & Real-Time Development

Model Collections & Zoo

HuggingFace Hub

Pre-trained Model Collections

Specialized Tools

Indian Language Support

  • Veena - Hindi and English voice model
  • Gnani.ai - Indian languages speech AI
  • Bolna AI - Voice agents for India

Performance Monitoring

Additional Resources