Text-to-Speech (TTS) Models - Architecture & Implementation

Historical Evolution

WaveNet (2016) - The Breakthrough

WaveNet was a key breakthrough developed by Google DeepMind. It revolutionized TTS by directly modeling the raw waveform of an audio signal.

Key innovations:

Autoregressive generation: predicts one audio sample at a time
Dilated causal convolutions for wide receptive fields
Produced remarkably natural-sounding speech
Set new standard for quality

Limitations:

Very slow (one sample at a time)
Computationally expensive

Tacotron (2017) - End-to-End Learning

Tacotron introduced end-to-end TTS without requiring complex handcrafted features:

Sequence-to-sequence architecture with attention
Encoder-Decoder model
Input: text → Output: mel-spectrogram → Vocoder → Audio

Benefits:

Faster than WaveNet
Learned attention mechanisms
Natural prosody (pitch, timing)

Deep Voice (2017) - Production-Ready TTS

Deep Voice was constructed entirely from deep neural networks:

Simplified traditional TTS pipelines
Components:
1. Grapheme-to-phoneme model - Converts text to phonetic representation
2. Segmentation model - Training data annotation
3. Phoneme duration prediction model - How long each sound
4. Fundamental frequency (F0) prediction model - Pitch prediction
5. Audio synthesis model - WaveNet variant for actual synthesis

Achievements:

Real-time inference
Significant speedups over WaveNet
Production-quality audio

KOKORO - Latent Diffusion TTS

KOKORO is a cutting-edge TTS system using latent space manipulation and adversarial training.

Pipeline

Text to Latent Representation
- Input: “Hello, how are you?”
- Maps text to latent space capturing speech style and semantics
- Includes phonetic and linguistic characteristics
Style Manipulation (Latent Diffusion)
- Adjusts latent variables for speech style
- Controls: tone, pitch, speed, emotion
- Example: cheerful vs. formal tone
Adversarial Training (Discriminator Feedback)
- Uses WavLM (large speech model) as discriminator
- Refines generated speech for naturalness
- Improves quality iteratively
Direct Waveform Generation
- Uses ISTFTNet to generate audio waveform directly
- Skips intermediate representations (spectrogram)
- Produces high-quality output

Key Features

Streaming inference: 83 tokens per second
CNN-based tokenizer for fast tokenization
Efficient with good quality

Fish Speech - Slow/Fast Transformer Architecture

Fish Speech uses a two-transformer approach for text-to-speech.

Pipeline

Input Text
- Text is tokenized into subword or word-level tokens
Slow Transformer
- Text Embedding: Converts tokens to embeddings
- Contextualization: Passes through transformer layers
- Output: Abstract linguistic features (meaning-focused)
Fast Transformer
- Feature Fusion: Combines Slow Transformer output with acoustic features
- Acoustic Refinement: Adjusts output for acoustic details
- Output: Refined speech features
Grouped Finite Scalar Quantization (GFSQ)
- Feature Grouping: Groups features into categories
- Scalar Quantization: Converts to efficient form
- Codebook Indexing: Maps to codebook indices
Firefly-GAN (Vocoder)
- Convolution Blocks: Processes quantized features
- Output: Audio waveform reconstruction
Generated Audio
- Final speech with natural intonation and rhythm

Key Features

Two-stage processing (semantic → acoustic)
Efficient quantization
High-quality synthesis

Chatterbox TTS - Multi-Stage Pipeline

Chatterbox is an advanced TTS system with specialized components.

Pipeline

1. Text Processing: Tokenization

EnTokenizer: Breaks text into tokens
Example: “Hello, how are you today?” → [“Hello”, “how”, “are”, “you”, “today”, ”?“]

2. Core Sequence Modeling

T3 Model: Fine-tuned Llama 3 transformer
Processes tokens through neural network
Understands structure, meaning, and intent
Influences context, tone, and emotion

3. Audio Processing: Tokenization

S3Tokenizer: Converts 16kHz reference audio into tokens
Breaks audio into features: pitch, rhythm, loudness
Creates “language” for audio processing

4. Voice Encoding: Speaker Identity

Voice Encoder: Extracts speaker’s unique voice embedding
Mel Spectrogram: Analyzes audio frequency components
Creates “signature” for specific speaker

5. Conditioning Systems

T3 Conditioning: Guides text tokens → speech tokens
S3 Conditioning: Refines using reference speech and emotions
Ensures emotional tone is matched

6. Speech Generation

S3 Conditional Flow Matching: Generates high-quality acoustics
S3Token2Mel Converter: Converts tokens to mel spectrograms
S3Token2Wav Vocoder: Uses HiFiGAN and ConvRNN architecture
Output: High-fidelity speech waveform

Key Features

Multi-stage processing
Flow-based generative model
Advanced prosody modeling
High-fidelity output

Orpheus TTS - Llama-Based Audio Generation

Orpheus adapts Llama-3B transformer for speech generation using a two-stage process.

Pipeline

Stage 1: Text-to-Token Generation

Uses Llama-3B model adapted for speech
Processes text through transformer layers
Generates discrete audio tokens
Employs causal attention mechanisms

Stage 2: Audio Synthesis

Uses SNAC (Speech Neural Audio Codec)
Decodes audio tokens into waveform
SNAC uses hierarchical audio tokenization
Multiple levels of tokens for accurate reconstruction
Output: 24kHz audio

Key Features

Based on large language model (Llama)
Hierarchical tokenization (SNAC)
24kHz audio quality
Efficient token-to-audio conversion

Higgs Audio - Unified Audio Tokenizer

Higgs Audio v2 introduces a unified tokenizer operating at just 25 frames per second.

Key Components

1. AudioVerse Dataset

Over 10 million hours of audio
Languages: English, Chinese (Mandarin), Korean, German, Spanish
Includes: Speech, music, and sound events
Automated annotation using multiple ASR and classification models

2. Unified Audio Tokenizer

Operates at 25 frames per second (very efficient)
Maintains quality vs. tokenizers with 2x bitrate
24 kHz high-fidelity across speech, music, and sound events
Unified semantic and acoustic capture
2.0 kbps bitrate - significantly more efficient
Non-diffusion encoder/decoder for fast batch inference

3. DualFFN Architecture

Dual Feed-Forward Network - audio-specific expert adapter
Preserves 91% of original LLM training speed
Minimal computational overhead
Significantly improves WER and speaker similarity

Key Features

Unified semantic and acoustic tokenization
Extremely efficient bitrate
Works with speech, music, and sound events
Fast batch inference

Parler TTS

Uses GFPQ quantization
Highly controllable voice characteristics
Character-level voice descriptions

KittenTTS

Fast, lightweight TTS
Good for edge devices
Open-source

Nari Labs TTS

Open-source neural TTS
Community-driven development

Piper TTS

Fast, local neural TTS
Optimized for Raspberry Pi 4
Good for embedded systems

TTS System Components Comparison

Component	Purpose	Examples
Text Processor	Normalizes text, handles special characters	Tokenizer, G2P converter
Acoustic Model	Maps text → acoustic features	Transformer, RNN
Duration Predictor	Predicts phoneme durations	Small transformer
Vocoder	Converts features → waveform	HiFi-GAN, WaveGlow

Voice Cloning in TTS

Voice cloning allows TTS to generate speech in a specific person’s voice.

Approaches

Reference Audio Approach
- Provide example audio of target speaker
- Extract speaker embedding
- Condition model on embedding
Speaker Adaptation
- Fine-tune model on speaker’s audio
- Learn speaker-specific parameters
- More accurate but requires more data
Prompt Learning
- Use recent speech context as prompt
- Model learns to continue in same voice
- Efficient, requires minimal data

Tools

SpeechBrain - Speaker verification and adaptation

Text-to-Speech (TTS) Models - Architecture & Implementation

Table of Contents

Historical Evolution

WaveNet (2016) - The Breakthrough

Tacotron (2017) - End-to-End Learning

Deep Voice (2017) - Production-Ready TTS

KOKORO - Latent Diffusion TTS

Pipeline

Key Features

Fish Speech - Slow/Fast Transformer Architecture

Pipeline

Key Features

Chatterbox TTS - Multi-Stage Pipeline

Pipeline

Key Features

Orpheus TTS - Llama-Based Audio Generation

Pipeline

Key Features

Higgs Audio - Unified Audio Tokenizer

Key Components

Key Features

Parler TTS

KittenTTS

Nari Labs TTS

Piper TTS

TTS System Components Comparison

Voice Cloning in TTS

Approaches

Tools

Graph View

Table of Contents