Open-Source Text-to-Speech (TTS)
Full-Featured TTS Libraries
- Glow-TTS - Fast, flow-based TTS
- Tacotron 2 - Sequence-to-sequence TTS
- Parler TTS - HuggingFace controllable TTS
- Fish Speech - Fast streaming TTS
- Kokoro TTS - Efficient high-quality TTS
- KittenTTS - Lightweight TTS
- Orpheus TTS - Llama-based TTS
- Piper TTS - Fast, local TTS optimized for Raspberry Pi
- Amphion - Audio generation toolkit
- Nari Labs TTS - Open-source neural TTS
Vocoders
- HiFi-GAN - GAN-based vocoder (most used)
- BigVGAN - Extended HiFi-GAN
- UnivNet - Universal vocoder
- Firefly-GAN - Used in Fish Speech
Audio Codecs
- EnCodec - Meta’s neural audio codec
- SoundStream - Google’s audio codec
- Mimi - Kyutai’s efficient audio codec
- SNAC - Hierarchical audio codec
Speech-to-Text (STT) & Speech Recognition
ASR Models
- Whisper - OpenAI’s robust speech recognition
- Canary - NVIDIA’s ASR/translation
- Wav2Vec 2.0 - Meta’s self-supervised model
- WavLM - Microsoft’s speech model
- SpeechBrain - PyTorch toolkit for speech
- Kaldi - Traditional ASR framework
Voice Activity Detection (VAD)
- Silero VAD - Fast, lightweight VAD
- RNNoise - Noise suppression + VAD
- DeepFilterNet - Advanced denoising
Language Models for Speech
Voice Agent Frameworks & Platforms
Complete Voice Agent Solutions
- Bolna AI - India-focused voice agents
- Moshi - Kyutai’s full-duplex voice model
- Inworld AI - Character AI for voice
- Cartesia AI - Ultra-low latency voice (uses State Space Models)
- Elevenlabs - Commercial TTS + voice AI
- Resemble.ai Chatterbox - Advanced TTS
- Neuphonic - Real-time TTS synthesis
- Gradium - Voice synthesis platform
- Trillet - Voice generation platform
- Sesame - Voice assistant platform
- Fluents - Language learning with AI voice
- Gnani.ai - Indian language voice AI
- AI Coustics - Voice enhancement and AI
Real-Time Voice Processing
- Replicate - API-based model serving
- Fal AI - Fast inference API
- Together AI - LLM inference platform
- RunPod - GPU cloud for ML
- Modal Labs - Serverless GPU computing
- Cerebrium - Inference optimization platform
- Lightning AI - Model serving and deployment
- Koyeb - Serverless platform
- Anyscale - Ray-based distributed computing
Development Tools & Libraries
Audio Processing
- Librosa - Audio analysis library (Python)
- SoundFile - Audio file I/O
- PyAudio - Audio input/output
- TorchAudio - PyTorch audio processing
- julius - Audio signal processing
- Torch Audiotransforms - Audio transforms
ML Frameworks
- PyTorch - Primary framework for voice models
- TensorFlow - Alternative framework
- HuggingFace Transformers - Pre-trained models
- JAX - High-performance ML
Inference Optimization
- vLLM - Fast LLM serving
- MLC-LLM - Optimized LLM serving
- TensorRT - NVIDIA inference optimizer
- ONNX - Model format for optimization
- LiteRT - Google’s edge inference
- Ollama - Easy local model running
- LM Studio - GUI for local models
Model Compression
- GPTQ - Excellent quantization method
- AWQ - Advanced quantization
- LLMLingua - Microsoft prompt compression
- NeuralMagic - Pruned model zoo
- BitsAndBytes - Efficient inference
Deployment & Infrastructure
GPU Cloud Providers
- Lambda Labs - GPU cloud
- Paperspace - Gradient ML platform
- Civo - Kubernetes cloud with GPU
Evaluation & Testing
Voice Agent Testing
- aiewf-eval - Multi-turn voice evaluation
- Coval - Voice agent testing framework
- Roark - Voice quality assessment
Audio Quality Assessment
- pesq-python - PESQ scoring
- SEWA - Emotion recognition
- SpeechBrain - Speaker verification
Learning Resources
Audio Processing Fundamentals
- HuggingFace Audio Course - Complete audio course
- LearnOpenCV - Speech to Speech - Excellent tutorial
- LearnOpenCV - Automatic Speech Recognition - ASR deep dive
Technical Blogs & Papers
- AssemblyAI Blog - Residual Vector Quantization - RVQ explained
- Speech Zone - Speech synthesis course
- Modal LLM Almanac - LLM deployment guide
Community Resources
- Speech Synthesis GitHub Curated List - TTS projects
- Awesome Speech Processing - Speech ML resources
- Awesome Audio - Audio processing resources
Research Papers & Articles
- Google Research - Speech to Retrieval (S2R) - Voice search innovation
- Model Memorization in Machine Learning - Privacy considerations
- Groq LPU Design - Alternative inference hardware
Streaming & Real-Time Development
- Gabber Dev - Real-time AI audio apps
- Speech.Zone - Real-time TTS - Real-time voice synthesis
- Way with Words - Voice recording platform
Model Collections & Zoo
HuggingFace Hub
- Text-to-Speech Models - TTS model collection
- Automatic Speech Recognition - ASR models
- Speech Classification - Audio classification
Pre-trained Model Collections
- NVIDIA NeMo - NVIDIA’s speech models
- OpenAI Whisper Models - Whisper variants
- Meta PyTorch Audio - PyTorch audio models
Specialized Tools
Indian Language Support
- Veena - Hindi and English voice model
- Gnani.ai - Indian languages speech AI
- Bolna AI - Voice agents for India
Performance Monitoring
- HuggingFace Model Eval - Model evaluation spaces
- Memorizz - Model memorization testing
- MLflow - ML experiment tracking
Additional Resources
- Triton Inference Server - Multi-framework inference
- ZipVoice - Voice synthesis
- RealtimeTTS - Real-time TTS library