Historical Evolution

WaveNet (2016) - The Breakthrough

WaveNet was a key breakthrough developed by Google DeepMind. It revolutionized TTS by directly modeling the raw waveform of an audio signal.

Key innovations:

  • Autoregressive generation: predicts one audio sample at a time
  • Dilated causal convolutions for wide receptive fields
  • Produced remarkably natural-sounding speech
  • Set new standard for quality

Limitations:

  • Very slow (one sample at a time)
  • Computationally expensive

Tacotron (2017) - End-to-End Learning

Tacotron introduced end-to-end TTS without requiring complex handcrafted features:

  • Sequence-to-sequence architecture with attention
  • Encoder-Decoder model
  • Input: text → Output: mel-spectrogram → Vocoder → Audio

Benefits:

  • Faster than WaveNet
  • Learned attention mechanisms
  • Natural prosody (pitch, timing)

Deep Voice (2017) - Production-Ready TTS

Deep Voice was constructed entirely from deep neural networks:

  • Simplified traditional TTS pipelines
  • Components:
    1. Grapheme-to-phoneme model - Converts text to phonetic representation
    2. Segmentation model - Training data annotation
    3. Phoneme duration prediction model - How long each sound
    4. Fundamental frequency (F0) prediction model - Pitch prediction
    5. Audio synthesis model - WaveNet variant for actual synthesis

Achievements:

  • Real-time inference
  • Significant speedups over WaveNet
  • Production-quality audio

KOKORO - Latent Diffusion TTS

KOKORO is a cutting-edge TTS system using latent space manipulation and adversarial training.

Pipeline

  1. Text to Latent Representation

    • Input: “Hello, how are you?”
    • Maps text to latent space capturing speech style and semantics
    • Includes phonetic and linguistic characteristics
  2. Style Manipulation (Latent Diffusion)

    • Adjusts latent variables for speech style
    • Controls: tone, pitch, speed, emotion
    • Example: cheerful vs. formal tone
  3. Adversarial Training (Discriminator Feedback)

    • Uses WavLM (large speech model) as discriminator
    • Refines generated speech for naturalness
    • Improves quality iteratively
  4. Direct Waveform Generation

    • Uses ISTFTNet to generate audio waveform directly
    • Skips intermediate representations (spectrogram)
    • Produces high-quality output

Key Features

  • Streaming inference: 83 tokens per second
  • CNN-based tokenizer for fast tokenization
  • Efficient with good quality

Fish Speech - Slow/Fast Transformer Architecture

Fish Speech uses a two-transformer approach for text-to-speech.

Pipeline

  1. Input Text

    • Text is tokenized into subword or word-level tokens
  2. Slow Transformer

    • Text Embedding: Converts tokens to embeddings
    • Contextualization: Passes through transformer layers
    • Output: Abstract linguistic features (meaning-focused)
  3. Fast Transformer

    • Feature Fusion: Combines Slow Transformer output with acoustic features
    • Acoustic Refinement: Adjusts output for acoustic details
    • Output: Refined speech features
  4. Grouped Finite Scalar Quantization (GFSQ)

    • Feature Grouping: Groups features into categories
    • Scalar Quantization: Converts to efficient form
    • Codebook Indexing: Maps to codebook indices
  5. Firefly-GAN (Vocoder)

    • Convolution Blocks: Processes quantized features
    • Output: Audio waveform reconstruction
  6. Generated Audio

    • Final speech with natural intonation and rhythm

Key Features

  • Two-stage processing (semantic → acoustic)
  • Efficient quantization
  • High-quality synthesis

Chatterbox TTS - Multi-Stage Pipeline

Chatterbox is an advanced TTS system with specialized components.

Pipeline

1. Text Processing: Tokenization

  • EnTokenizer: Breaks text into tokens
  • Example: “Hello, how are you today?” → [“Hello”, “how”, “are”, “you”, “today”, ”?“]

2. Core Sequence Modeling

  • T3 Model: Fine-tuned Llama 3 transformer
  • Processes tokens through neural network
  • Understands structure, meaning, and intent
  • Influences context, tone, and emotion

3. Audio Processing: Tokenization

  • S3Tokenizer: Converts 16kHz reference audio into tokens
  • Breaks audio into features: pitch, rhythm, loudness
  • Creates “language” for audio processing

4. Voice Encoding: Speaker Identity

  • Voice Encoder: Extracts speaker’s unique voice embedding
  • Mel Spectrogram: Analyzes audio frequency components
  • Creates “signature” for specific speaker

5. Conditioning Systems

  • T3 Conditioning: Guides text tokens → speech tokens
  • S3 Conditioning: Refines using reference speech and emotions
  • Ensures emotional tone is matched

6. Speech Generation

  • S3 Conditional Flow Matching: Generates high-quality acoustics
  • S3Token2Mel Converter: Converts tokens to mel spectrograms
  • S3Token2Wav Vocoder: Uses HiFiGAN and ConvRNN architecture
  • Output: High-fidelity speech waveform

Key Features

  • Multi-stage processing
  • Flow-based generative model
  • Advanced prosody modeling
  • High-fidelity output

Orpheus TTS - Llama-Based Audio Generation

Orpheus adapts Llama-3B transformer for speech generation using a two-stage process.

Pipeline

Stage 1: Text-to-Token Generation

  • Uses Llama-3B model adapted for speech
  • Processes text through transformer layers
  • Generates discrete audio tokens
  • Employs causal attention mechanisms

Stage 2: Audio Synthesis

  • Uses SNAC (Speech Neural Audio Codec)
  • Decodes audio tokens into waveform
  • SNAC uses hierarchical audio tokenization
  • Multiple levels of tokens for accurate reconstruction
  • Output: 24kHz audio

Key Features

  • Based on large language model (Llama)
  • Hierarchical tokenization (SNAC)
  • 24kHz audio quality
  • Efficient token-to-audio conversion

Higgs Audio - Unified Audio Tokenizer

Higgs Audio v2 introduces a unified tokenizer operating at just 25 frames per second.

Key Components

1. AudioVerse Dataset

  • Over 10 million hours of audio
  • Languages: English, Chinese (Mandarin), Korean, German, Spanish
  • Includes: Speech, music, and sound events
  • Automated annotation using multiple ASR and classification models

2. Unified Audio Tokenizer

  • Operates at 25 frames per second (very efficient)
  • Maintains quality vs. tokenizers with 2x bitrate
  • 24 kHz high-fidelity across speech, music, and sound events
  • Unified semantic and acoustic capture
  • 2.0 kbps bitrate - significantly more efficient
  • Non-diffusion encoder/decoder for fast batch inference

3. DualFFN Architecture

  • Dual Feed-Forward Network - audio-specific expert adapter
  • Preserves 91% of original LLM training speed
  • Minimal computational overhead
  • Significantly improves WER and speaker similarity

Key Features

  • Unified semantic and acoustic tokenization
  • Extremely efficient bitrate
  • Works with speech, music, and sound events
  • Fast batch inference

Parler TTS

  • Uses GFPQ quantization
  • Highly controllable voice characteristics
  • Character-level voice descriptions

KittenTTS

  • Fast, lightweight TTS
  • Good for edge devices
  • Open-source

Nari Labs TTS

  • Open-source neural TTS
  • Community-driven development

Piper TTS

  • Fast, local neural TTS
  • Optimized for Raspberry Pi 4
  • Good for embedded systems

TTS System Components Comparison

ComponentPurposeExamples
Text ProcessorNormalizes text, handles special charactersTokenizer, G2P converter
Acoustic ModelMaps text → acoustic featuresTransformer, RNN
Duration PredictorPredicts phoneme durationsSmall transformer
VocoderConverts features → waveformHiFi-GAN, WaveGlow

Voice Cloning in TTS

Voice cloning allows TTS to generate speech in a specific person’s voice.

Approaches

  1. Reference Audio Approach

    • Provide example audio of target speaker
    • Extract speaker embedding
    • Condition model on embedding
  2. Speaker Adaptation

    • Fine-tune model on speaker’s audio
    • Learn speaker-specific parameters
    • More accurate but requires more data
  3. Prompt Learning

    • Use recent speech context as prompt
    • Model learns to continue in same voice
    • Efficient, requires minimal data

Tools