Voice LLM Fundamentals

1. Fourier Transform - Understanding Sound

What is the Fourier Transform?

The Fourier Transform is a mathematical operation that transforms a signal from the time domain to the frequency domain.

At a high level, it takes a signal (typically a waveform) that changes over time and decomposes it into a sum of sinusoidal waves (sine and cosine functions). These sinusoidal waves have specific frequencies, amplitudes, and phases. The resulting transformed signal reveals how much of each frequency is present in the original signal.

Types of Fourier Transforms

  1. Continuous Fourier Transform (CFT): Used for continuous-time signals, like real-world analog signals (e.g., sound waves).

  2. Discrete Fourier Transform (DFT): Used for discrete-time signals (digital signals), where the signal is sampled at regular intervals. The DFT can be computed efficiently using the Fast Fourier Transform (FFT) algorithm.

Time Domain vs Frequency Domain

Time Domain

The time domain is a representation of a signal as a function of time. In this domain, the signal is described by how its values (or amplitude) change over time. It’s the most direct way to represent a signal, showing you exactly how the signal behaves at any given point in time.

  • X-axis: Time (typically in seconds)
  • Y-axis: Amplitude (or some other characteristic like voltage, current, or pressure)
  • Time-domain signals give us the “shape” of the signal at any given moment
  • A sound wave (like your voice or a musical note) that varies over time can be shown as a waveform in the time domain

Frequency Domain

The frequency domain is a representation of a signal as a function of frequency, rather than time. It shows what frequencies are present in the signal and how strong each frequency is. In other words, it tells you how much of each frequency component (sine wave) is present in the signal.

  • X-axis: Frequency (typically in Hertz, Hz, or cycles per second)
  • Y-axis: Amplitude (or power), which shows how strong each frequency component is
  • The frequency domain provides information about the signal’s frequency components—whether it’s a pure tone, a complex mixture, or noise
  • Many signals, especially periodic ones (like sound waves or electrical signals), are easier to analyze in the frequency domain

2. Audio Codecs & Neural Compression

Neural Compression Concept

Neural Compression aims to transform various data types, be it in pixel form (images), waveforms (audio), or frame sequences (video), into more compact representations, such as vectors.

Residual Vector Quantization (RVQ)

What is RVQ?

  • RVQ is a technique to compress vectors (like audio embeddings) into a few integers for efficient storage and transmission
  • It achieves higher fidelity than basic quantization methods, especially at low bitrates

How RVQ Works

  1. Codebook Quantization: A set of representative vectors called “codebook vectors” are learned. Each vector is mapped to the closest codebook vector and represented by its index.
  2. Residual Calculation: The difference between the original vector and the chosen codebook vector is calculated (the “residual vector”)
  3. Iterative Quantization: The residual vector is further quantized using a new codebook, and a new residual is calculated. This process repeats for multiple iterations.
  4. Representation: The original vector is represented by a list of indices, each corresponding to a chosen codebook vector in different iterations.

RVQ in EnCodec (An Audio Compression Model)

  • EnCodec uses RVQ to compress audio embeddings, achieving good quality even at low bitrates (around 6kbps)
  • The number of RVQ iterations controls the bitrate and quality trade-off

Learning Codebook Vectors

  • Initially, K-means clustering can be used to find optimal codebook vectors
  • For better performance, codebook vectors are fine-tuned during model training:
    • Codebook Update: Codebook vectors are slightly moved towards the encoded vectors they represent
    • Commitment Loss: The encoder is penalized for producing vectors far from any codebook vector, encouraging it to produce easily quantizable representations
    • Random Restarts: Unused codebook vectors are relocated to areas where the encoder frequently produces vectors

Key Benefits & Applications

  • RVQ enables efficient audio compression with smaller file sizes than traditional formats like MP3
  • It has potential applications in music streaming, voice assistants, and other audio-related technologies

3. Voice Activity Detection (VAD)

Voice Activity Detection (VAD) is the process of determining whether a given audio segment contains speech or is silence/background noise.

VAD Pipeline

  1. Audio Input: The system receives a continuous audio stream, perhaps from a microphone.

  2. Framing: The audio is split into small chunks or frames, typically around 20 milliseconds each. This is similar to taking snapshots of the audio at rapid intervals.

  3. Feature Extraction: For each frame, the system computes features that can indicate speech. Common features include:

    • Short-Time Energy (STE): Measures the energy in the frame. Speech usually has higher energy than silence
    • Zero-Crossing Rate (ZCR): Counts how often the audio signal crosses the zero amplitude axis, indicating frequency content
  4. Classification: The system uses these features to decide whether each frame contains speech. This can be as simple as thresholding (if energy > threshold, it’s speech) or as complex as using a trained neural network model.

  5. Speech Segmentation: Consecutive frames classified as speech are grouped together to form speech segments.

  6. Output: The speech segments are then passed on to the next component in the pipeline for further processing.

ModelParametersRTF (CPU)ArchitectureBuilt-in VADPython SupportMemory UsageQuality (PESQ)
RNNoise0.06M0.03GRU + DSP HybridYesLimitedVery Low (~2MB)2.33
DeepFilterNet22.31M0.04Two-stage ERBNoYes (PyTorch)Low (~10MB)3.08
SpeechBrain SepFormer22.3M~0.15Transformer-basedNoYes (PyTorch)High (~100MB+)3.15
Silero VAD0.002M<0.001RNN-based VADYes (primary)Yes (PyTorch/ONNX)Very Low (~2MB)N/A (VAD only)
Facebook Denoiser60.8M0.8U-Net Encoder-DecoderNoYes (PyTorch)High (~250MB+)3.07

4. Phonemes, Graphemes & Bytes

Phonemes

Phonemes are the smallest units of sound in a language that can distinguish one word from another.

  • Think of them as sound building blocks
  • They’re not letters—they’re sounds
WordSpoken Sounds (Phonemes)
cat/k/ /æ/ /t/
bat/b/ /æ/ /t/

Text-to-Speech (TTS) models don’t generate speech directly from letters. They first:

  1. Convert text to phonemes (pronunciation)
  2. Then generate audio from that

Graphemes

Graphemes are the written characters (letters or symbols) of a language.

WordGraphemes
”Cat”C, A, T
”She”S, H, E

Bytes

Bytes, in this context, refer to byte-level encodings of text—raw numerical representations of characters in a specific encoding format (usually UTF-8).

For the word “Hi”, the bytes (in UTF-8) are:

  • “H” → 72
  • “i” → 105

So the byte input is: [72, 105]

Byte-based models (like Bark or ByT5) bypass tokenizers and G2P (grapheme-to-phoneme) tools. They learn pronunciation from raw byte sequences.


5. Common Vocoders in Speech Synthesis

A vocoder is a crucial component that converts intermediate representations (like spectrograms or tokens) into actual audio waveforms.

Commonly Used Vocoders

  • WaveNet - Dilated convolutions, autoregressive, high quality but slow
  • Tacotron - Encoder-Decoder with attention
  • WaveGlow - Flow-based, invertible, faster than WaveNet
  • HiFi-GAN - GAN-based, fast and high-quality (most widely used)
  • EnCodec - Neural audio codec with RVQ compression
  • Mimi - Neural audio codec (used in Moshi)
  • BigVGAN - Extended HiFi-GAN for improved quality
  • DiffWave - Diffusion-based vocoder

Vocoder Types

GAN-based Vocoders

  • Most widely used for fast and high-quality speech synthesis
  • Examples: HiFi-GAN, BigVGAN

Neural Audio Codecs

  • Combine compression and synthesis
  • Examples: EnCodec, Mimi, SoundStream

Flow-based Vocoders

  • Invertible models
  • Example: WaveGlow

Diffusion-based Vocoders

  • Recent approach using diffusion models
  • Example: DiffWave

6. History of TTS Technology

WaveNet (2016)

A key breakthrough was WaveNet, a deep neural network developed by Google DeepMind in 2016. WaveNet revolutionized TTS by directly modeling the raw waveform of an audio signal, producing remarkably natural-sounding speech and setting a new standard for quality. Its ability to capture intricate temporal dependencies in audio signals was unprecedented.

Tacotron & Deep Voice (2017)

Following WaveNet, end-to-end TTS models such as Tacotron and Deep Voice emerged, capable of generating speech directly from text input without requiring complex handcrafted features.

Deep Voice, a production-quality TTS system released in 2017, was constructed entirely from deep neural networks, laying the groundwork for truly end-to-end neural speech synthesis. It simplified traditional TTS pipelines by replacing all components with neural networks and minimizing the reliance on hand-engineered features.

Deep Voice Components:

  1. Grapheme-to-phoneme model - Converts text to phonetic representation
  2. Segmentation model - Used for training data annotation
  3. Phoneme duration prediction model - Determines how long each sound should be
  4. Fundamental frequency (F0) prediction model - Predicts pitch
  5. Audio synthesis model - A variant of WaveNet that generates the actual audio

These models employed attention mechanisms and sequence-to-sequence architectures to learn the mapping between text and speech, resulting in more fluent and expressive synthetic speech.


Further Reading