Historical Evolution
WaveNet (2016) - The Breakthrough
WaveNet was a key breakthrough developed by Google DeepMind. It revolutionized TTS by directly modeling the raw waveform of an audio signal.
Key innovations:
- Autoregressive generation: predicts one audio sample at a time
- Dilated causal convolutions for wide receptive fields
- Produced remarkably natural-sounding speech
- Set new standard for quality
Limitations:
- Very slow (one sample at a time)
- Computationally expensive
Tacotron (2017) - End-to-End Learning
Tacotron introduced end-to-end TTS without requiring complex handcrafted features:
- Sequence-to-sequence architecture with attention
- Encoder-Decoder model
- Input: text → Output: mel-spectrogram → Vocoder → Audio
Benefits:
- Faster than WaveNet
- Learned attention mechanisms
- Natural prosody (pitch, timing)
Deep Voice (2017) - Production-Ready TTS
Deep Voice was constructed entirely from deep neural networks:
- Simplified traditional TTS pipelines
- Components:
- Grapheme-to-phoneme model - Converts text to phonetic representation
- Segmentation model - Training data annotation
- Phoneme duration prediction model - How long each sound
- Fundamental frequency (F0) prediction model - Pitch prediction
- Audio synthesis model - WaveNet variant for actual synthesis
Achievements:
- Real-time inference
- Significant speedups over WaveNet
- Production-quality audio
KOKORO - Latent Diffusion TTS
KOKORO is a cutting-edge TTS system using latent space manipulation and adversarial training.
Pipeline
-
Text to Latent Representation
- Input: “Hello, how are you?”
- Maps text to latent space capturing speech style and semantics
- Includes phonetic and linguistic characteristics
-
Style Manipulation (Latent Diffusion)
- Adjusts latent variables for speech style
- Controls: tone, pitch, speed, emotion
- Example: cheerful vs. formal tone
-
Adversarial Training (Discriminator Feedback)
- Uses WavLM (large speech model) as discriminator
- Refines generated speech for naturalness
- Improves quality iteratively
-
Direct Waveform Generation
- Uses ISTFTNet to generate audio waveform directly
- Skips intermediate representations (spectrogram)
- Produces high-quality output
Key Features
- Streaming inference: 83 tokens per second
- CNN-based tokenizer for fast tokenization
- Efficient with good quality
Fish Speech - Slow/Fast Transformer Architecture
Fish Speech uses a two-transformer approach for text-to-speech.
Pipeline
-
Input Text
- Text is tokenized into subword or word-level tokens
-
Slow Transformer
- Text Embedding: Converts tokens to embeddings
- Contextualization: Passes through transformer layers
- Output: Abstract linguistic features (meaning-focused)
-
Fast Transformer
- Feature Fusion: Combines Slow Transformer output with acoustic features
- Acoustic Refinement: Adjusts output for acoustic details
- Output: Refined speech features
-
Grouped Finite Scalar Quantization (GFSQ)
- Feature Grouping: Groups features into categories
- Scalar Quantization: Converts to efficient form
- Codebook Indexing: Maps to codebook indices
-
Firefly-GAN (Vocoder)
- Convolution Blocks: Processes quantized features
- Output: Audio waveform reconstruction
-
Generated Audio
- Final speech with natural intonation and rhythm
Key Features
- Two-stage processing (semantic → acoustic)
- Efficient quantization
- High-quality synthesis
Chatterbox TTS - Multi-Stage Pipeline
Chatterbox is an advanced TTS system with specialized components.
Pipeline
1. Text Processing: Tokenization
- EnTokenizer: Breaks text into tokens
- Example: “Hello, how are you today?” → [“Hello”, “how”, “are”, “you”, “today”, ”?“]
2. Core Sequence Modeling
- T3 Model: Fine-tuned Llama 3 transformer
- Processes tokens through neural network
- Understands structure, meaning, and intent
- Influences context, tone, and emotion
3. Audio Processing: Tokenization
- S3Tokenizer: Converts 16kHz reference audio into tokens
- Breaks audio into features: pitch, rhythm, loudness
- Creates “language” for audio processing
4. Voice Encoding: Speaker Identity
- Voice Encoder: Extracts speaker’s unique voice embedding
- Mel Spectrogram: Analyzes audio frequency components
- Creates “signature” for specific speaker
5. Conditioning Systems
- T3 Conditioning: Guides text tokens → speech tokens
- S3 Conditioning: Refines using reference speech and emotions
- Ensures emotional tone is matched
6. Speech Generation
- S3 Conditional Flow Matching: Generates high-quality acoustics
- S3Token2Mel Converter: Converts tokens to mel spectrograms
- S3Token2Wav Vocoder: Uses HiFiGAN and ConvRNN architecture
- Output: High-fidelity speech waveform
Key Features
- Multi-stage processing
- Flow-based generative model
- Advanced prosody modeling
- High-fidelity output
Orpheus TTS - Llama-Based Audio Generation
Orpheus adapts Llama-3B transformer for speech generation using a two-stage process.
Pipeline
Stage 1: Text-to-Token Generation
- Uses Llama-3B model adapted for speech
- Processes text through transformer layers
- Generates discrete audio tokens
- Employs causal attention mechanisms
Stage 2: Audio Synthesis
- Uses SNAC (Speech Neural Audio Codec)
- Decodes audio tokens into waveform
- SNAC uses hierarchical audio tokenization
- Multiple levels of tokens for accurate reconstruction
- Output: 24kHz audio
Key Features
- Based on large language model (Llama)
- Hierarchical tokenization (SNAC)
- 24kHz audio quality
- Efficient token-to-audio conversion
Higgs Audio - Unified Audio Tokenizer
Higgs Audio v2 introduces a unified tokenizer operating at just 25 frames per second.
Key Components
1. AudioVerse Dataset
- Over 10 million hours of audio
- Languages: English, Chinese (Mandarin), Korean, German, Spanish
- Includes: Speech, music, and sound events
- Automated annotation using multiple ASR and classification models
2. Unified Audio Tokenizer
- Operates at 25 frames per second (very efficient)
- Maintains quality vs. tokenizers with 2x bitrate
- 24 kHz high-fidelity across speech, music, and sound events
- Unified semantic and acoustic capture
- 2.0 kbps bitrate - significantly more efficient
- Non-diffusion encoder/decoder for fast batch inference
3. DualFFN Architecture
- Dual Feed-Forward Network - audio-specific expert adapter
- Preserves 91% of original LLM training speed
- Minimal computational overhead
- Significantly improves WER and speaker similarity
Key Features
- Unified semantic and acoustic tokenization
- Extremely efficient bitrate
- Works with speech, music, and sound events
- Fast batch inference
Parler TTS
- Uses GFPQ quantization
- Highly controllable voice characteristics
- Character-level voice descriptions
KittenTTS
- Fast, lightweight TTS
- Good for edge devices
- Open-source
Nari Labs TTS
- Open-source neural TTS
- Community-driven development
Piper TTS
- Fast, local neural TTS
- Optimized for Raspberry Pi 4
- Good for embedded systems
TTS System Components Comparison
| Component | Purpose | Examples |
|---|---|---|
| Text Processor | Normalizes text, handles special characters | Tokenizer, G2P converter |
| Acoustic Model | Maps text → acoustic features | Transformer, RNN |
| Duration Predictor | Predicts phoneme durations | Small transformer |
| Vocoder | Converts features → waveform | HiFi-GAN, WaveGlow |
Voice Cloning in TTS
Voice cloning allows TTS to generate speech in a specific person’s voice.
Approaches
-
Reference Audio Approach
- Provide example audio of target speaker
- Extract speaker embedding
- Condition model on embedding
-
Speaker Adaptation
- Fine-tune model on speaker’s audio
- Learn speaker-specific parameters
- More accurate but requires more data
-
Prompt Learning
- Use recent speech context as prompt
- Model learns to continue in same voice
- Efficient, requires minimal data
Tools
- SpeechBrain - Speaker verification and adaptation