Blog Notes Projects Hire Me

Blog Notes Projects

❯

❯

❯

❯

Voice Multimodal

❯

Voice LLM Resources - Tools, Libraries & References

Voice LLM Resources - Tools, Libraries & References

Mar 01, 20264 min read

resources
tools
libraries
references

Open-Source Text-to-Speech (TTS)

Full-Featured TTS Libraries

Glow-TTS - Fast, flow-based TTS
Tacotron 2 - Sequence-to-sequence TTS
Parler TTS - HuggingFace controllable TTS
Fish Speech - Fast streaming TTS
Kokoro TTS - Efficient high-quality TTS
KittenTTS - Lightweight TTS
Orpheus TTS - Llama-based TTS
Piper TTS - Fast, local TTS optimized for Raspberry Pi
Amphion - Audio generation toolkit
Nari Labs TTS - Open-source neural TTS

Vocoders

HiFi-GAN - GAN-based vocoder (most used)
BigVGAN - Extended HiFi-GAN
UnivNet - Universal vocoder
Firefly-GAN - Used in Fish Speech

Audio Codecs

EnCodec - Meta’s neural audio codec
SoundStream - Google’s audio codec
Mimi - Kyutai’s efficient audio codec
SNAC - Hierarchical audio codec

Speech-to-Text (STT) & Speech Recognition

ASR Models

Whisper - OpenAI’s robust speech recognition
Canary - NVIDIA’s ASR/translation
Wav2Vec 2.0 - Meta’s self-supervised model
WavLM - Microsoft’s speech model
SpeechBrain - PyTorch toolkit for speech
Kaldi - Traditional ASR framework

Voice Activity Detection (VAD)

Silero VAD - Fast, lightweight VAD
RNNoise - Noise suppression + VAD
DeepFilterNet - Advanced denoising

Language Models for Speech

Ultravox - Voice input LLM
Veena - Hindi/English voice model

Voice Agent Frameworks & Platforms

Complete Voice Agent Solutions

Bolna AI - India-focused voice agents
Moshi - Kyutai’s full-duplex voice model
Inworld AI - Character AI for voice
Cartesia AI - Ultra-low latency voice (uses State Space Models)
Elevenlabs - Commercial TTS + voice AI
Resemble.ai Chatterbox - Advanced TTS
Neuphonic - Real-time TTS synthesis
Gradium - Voice synthesis platform
Trillet - Voice generation platform
Sesame - Voice assistant platform
Fluents - Language learning with AI voice
Gnani.ai - Indian language voice AI
AI Coustics - Voice enhancement and AI
rime

Real-Time Voice Processing

Replicate - API-based model serving
Fal AI - Fast inference API
Together AI - LLM inference platform
RunPod - GPU cloud for ML
Modal Labs - Serverless GPU computing
Cerebrium - Inference optimization platform
Lightning AI - Model serving and deployment
Koyeb - Serverless platform
Anyscale - Ray-based distributed computing

Development Tools & Libraries

Audio Processing

Librosa - Audio analysis library (Python)
SoundFile - Audio file I/O
PyAudio - Audio input/output
TorchAudio - PyTorch audio processing
julius - Audio signal processing
Torch Audiotransforms - Audio transforms

ML Frameworks

PyTorch - Primary framework for voice models
TensorFlow - Alternative framework
HuggingFace Transformers - Pre-trained models
JAX - High-performance ML

Inference Optimization

vLLM - Fast LLM serving
MLC-LLM - Optimized LLM serving
TensorRT - NVIDIA inference optimizer
ONNX - Model format for optimization
LiteRT - Google’s edge inference
Ollama - Easy local model running
LM Studio - GUI for local models

Model Compression

GPTQ - Excellent quantization method
AWQ - Advanced quantization
LLMLingua - Microsoft prompt compression
NeuralMagic - Pruned model zoo
BitsAndBytes - Efficient inference

Deployment & Infrastructure

GPU Cloud Providers

Lambda Labs - GPU cloud
Paperspace - Gradient ML platform
Civo - Kubernetes cloud with GPU

Evaluation & Testing

Voice Agent Testing

aiewf-eval - Multi-turn voice evaluation
Coval - Voice agent testing framework
Roark - Voice quality assessment

Audio Quality Assessment

pesq-python - PESQ scoring
SEWA - Emotion recognition
SpeechBrain - Speaker verification

Learning Resources

Audio Processing Fundamentals

HuggingFace Audio Course - Complete audio course
LearnOpenCV - Speech to Speech - Excellent tutorial
LearnOpenCV - Automatic Speech Recognition - ASR deep dive

Technical Blogs & Papers

AssemblyAI Blog - Residual Vector Quantization - RVQ explained
Speech Zone - Speech synthesis course
Modal LLM Almanac - LLM deployment guide

Community Resources

Speech Synthesis GitHub Curated List - TTS projects
Awesome Speech Processing - Speech ML resources
Awesome Audio - Audio processing resources

Research Papers & Articles

Google Research - Speech to Retrieval (S2R) - Voice search innovation
Model Memorization in Machine Learning - Privacy considerations
Groq LPU Design - Alternative inference hardware

Streaming & Real-Time Development

Gabber Dev - Real-time AI audio apps
Speech.Zone - Real-time TTS - Real-time voice synthesis
Way with Words - Voice recording platform

Model Collections & Zoo

HuggingFace Hub

Text-to-Speech Models - TTS model collection
Automatic Speech Recognition - ASR models
Speech Classification - Audio classification

Pre-trained Model Collections

NVIDIA NeMo - NVIDIA’s speech models
OpenAI Whisper Models - Whisper variants
Meta PyTorch Audio - PyTorch audio models

Specialized Tools

Indian Language Support

Veena - Hindi and English voice model
Gnani.ai - Indian languages speech AI
Bolna AI - Voice agents for India

Performance Monitoring

HuggingFace Model Eval - Model evaluation spaces
Memorizz - Model memorization testing
MLflow - ML experiment tracking

Additional Resources

Triton Inference Server - Multi-framework inference
ZipVoice - Voice synthesis
RealtimeTTS - Real-time TTS library

Graph View

Build with ♥ K.Boopathi © 2026

GitHub
Linkedin
Twitter