Voice LLM - Two Processing Methods

Voice LLMs understand audio using two distinct approaches:

Method 1: Raw Waveform (Time-Domain) Processing

Direct processing of the original audio signal.

Characteristics

  • Audio treated as sequence of amplitude values over time
  • Example: 44,100 samples per second in 44.1kHz audio
  • No loss of information (everything captured by mic is preserved)
  • High-dimensional input → needs large models and lots of compute

Advantages

  • Complete information preservation
  • Captures subtle acoustic details

Disadvantages

  • Computationally expensive
  • Requires large models
  • High memory usage

Example Models

  • Wav2Vec 2.0 (Meta) - Self-supervised pre-training
  • Whisper (OpenAI) - Partially uses raw waveforms
  • WaveNet (DeepMind) - For TTS

Method 2: Frequency-Domain Processing

Indirect processing using frequency spectrum instead of raw samples.

Approach

  • Convert waveform to frequency representation using Fourier Transform
  • Work with frequency features instead of amplitude samples
  • Apply perceptual scaling (Mel scale)

Representations

  • Spectrogram: Time vs Frequency + color for amplitude
  • Mel Spectrogram: Scaled to match human ear perception
  • MFCC: Mel-Frequency Cepstral Coefficients (compact form)

Advantages

  • Reduced dimensionality
  • Aligns with human perception
  • Works great with CNNs

Example Models

  • Whisper (converts raw → log-mel spectrogram)
  • Tacotron (for TTS)
  • DeepSpeech

SpeechLM Components

A typical SpeechLM consists of three major components:

Component 1: Speech Tokenizer

Purpose: Convert continuous audio signals into latent representations, transformed into discrete tokens.

Tokenizer Objectives

Semantic Understanding Objective

  • Extracts semantic features from audio
  • Enables tasks like Automatic Speech Recognition (ASR)
  • Prioritizes meaning and content over acoustic details
  • Examples: Wav2vec 2.0, W2v-BERT, WavLM

Acoustic Generation Objective

  • Captures acoustic features for high-quality speech synthesis
  • Focused on speech fidelity rather than semantic content
  • Examples: SoundStream, EnCodec

Mixed Objective

  • Balances both semantic understanding AND acoustic generation
  • Best for speech-to-speech tasks
  • Examples: SpeechTokenizer, Mimi

Component 2: Language Model

Purpose: Core reasoning and generation engine.

Characteristics

  • Inspired by Large Language Models (LLMs)
  • Usually transformer or decoder-only architecture
  • Examples: OPT, LLaMA
  • Adapted from text-based models by incorporating speech tokenizer
  • Capable of handling both text and speech modalities
  • Vocabulary expanded to include both token types

Autoregressive Generation

  • Generates tokens one at a time
  • Each token conditioned on previous tokens
  • Allows open-ended generation

Component 3: Token-to-Speech Synthesizer (Vocoder)

Purpose: Convert discrete speech tokens back into audio waveforms.

Synthesis Pipelines

Direct Synthesis

  • Converts speech tokens → waveforms directly
  • Fast and straightforward
  • Works with acoustic-focused tokenizers

Input-Enhanced Synthesis

  • Tokens → Continuous latent representation → Vocoder → Waveform
  • Helpful when tokens lack acoustic details
  • Better quality but slightly slower
  • Works with semantic-focused tokenizers

Vocoder Types

GAN-based Vocoders (Most common)

  • Fast and high-quality
  • Examples: HiFi-GAN, BigVGAN

Neural Audio Codecs

  • Combine compression and synthesis
  • Examples: EnCodec, Mimi

Other Types

  • WaveNet (high quality, slow)
  • WaveGlow (flow-based)
  • DiffWave (diffusion-based)

Canary - High-Performance ASR/Translation Model

Canary is a state-of-the-art model for speech recognition and translation, achieving high accuracy without “web-scale” data.

Architecture

FastConformer-based Attention Encoder-Decoder (AED)

The Encoder: FastConformer

  • Speech-specific modification of standard transformer
  • Features increased downsampling factor
  • Benefit: 2.8x speedup in processing without losing modeling capacity

Key Features

  1. Task Prompting (like Whisper)

    • Special tokens guide tasks
    • <|transcribe|> for transcription
    • <|translate|> for translation
    • New controls for punctuation and capitalization (PnC) via <|pnc|> and <|nopnc|> tokens
  2. Efficient Processing

    • Increased downsampling reduces computation
    • 2.8x speedup over standard transformer
    • Maintains modeling quality
  3. Pre-trained Initialization

    • Encoder initialized from pre-trained weights
    • Converges faster
    • Achieves better metrics than training from scratch

Performance

  • Achieves state-of-the-art accuracy on ASR/AST tasks
  • Works well without massive datasets
  • Faster inference than comparable models
  • Better than Whisper on many benchmarks

Speech-Augmented Language Model (SALM)

SALM integrates speech processing with a large language model, enabling single models to understand both spoken audio and text.

Core Concept

  • Convert raw audio → vectors
  • Make vectors compatible with LLM
  • Let LLM reason over audio “as if it were text tokens”
  • Bridges spoken and written language

Two-Component Architecture

  1. Speech Encoder (from ASR models)

    • Converts audio → embeddings
  2. Large Language Model (LLM)

    • Performs reasoning and generation

Component Breakdown

Audio Perception Module

Converts raw audio into embeddings for LLM consumption.

Subcomponents:

  • Preprocessor

    • Converts waveform → time-frequency representation
    • Example: Mel Spectrogram
    • Standard: 80 frequency bins
  • Encoder

    • Strong ASR encoder (e.g., FastConformer from Canary)
    • Extracts high-level semantic information
    • Output: 1024-dimensional embeddings
  • Modality Adapter

    • Adjusts encoder outputs for LLM expectations
    • Often a projection or small Conformer stack
    • Ensures structural alignment
  • Projector

    • Linear layer mapping audio dimensions to LLM dimensions
    • Example: 1024 → 4096 (for 4B parameter LLM)
    • Makes audio indistinguishable from text embeddings

Large Language Model

Standard transformer-based model (e.g., Qwen-2.5B):

  • Accepts both text embeddings and audio embeddings
  • Generates text output
  • No architectural changes needed
  • Vocabulary includes audio placeholder token: <|audio_locator|>

Complete Pipeline: Audio to Output

Let’s trace a 1-second audio clip saying “Cat” through SALM:

Stage 1: Preprocessing

  • Input: Raw waveform at 16,000 Hz
  • Shape: [1, 16000] (Batch, Samples)
  • Operation: Sliding windows (~25 ms) with ~10 ms stride
  • Output: Mel Spectrogram [1, 80, 100] (Batch, Frequencies, Time)

Stage 2: Encoding

  • Model: FastConformer encoder
  • Operation: Semantic extraction with temporal subsampling
  • Subsampling: 100 frames → ~12 frames (8× reduction)
  • Output: [1, 1024, 12] (Batch, Encoder_Dim, Time)

Stage 3: Projection

  • Input: [1, 1024, 12]
  • Transpose: [1, 12, 1024]
  • Linear Projection: Linear(1024 → 4096)
  • Output: [1, 12, 4096] (each timestep is LLM-compatible embedding)

Stage 4: LLM Processing

  • Text input: “Transcribe:”
  • Placeholder: <|audio_locator|>
  • Final sequence: Text tokens + Audio embeddings
  • Total: ~15 embeddings of 4096 dims each
  • Processing: Standard transformer attention across both text and audio
  • Output: Generated tokens for transcription

Mental Model

  • Audio converted to short sequence of vectors
  • Vectors shaped to be indistinguishable from text embeddings
  • LLM reasons over both intent (text) and content (audio) with same attention mechanism
  • SALM doesn’t add new speech reasoning — it reformats speech so LLM can already reason over it

NLU in Speech Models

Natural Language Understanding (NLU) models are designed to enable computers to understand the meaning and intent behind human language, both written and spoken.

Role in Speech Models

  • Critical for voice agents
  • Enable accurate task understanding
  • Support context-aware responses
  • Improve dialogue quality

Integration

  • Often combined with speech tokenizers
  • Upstream of response generation
  • Helps models understand user intent
  • Enables more intelligent routing and handling

Advances in Speech Language Models (Research Insights)

Speech Tokenization

The choice of tokenizer significantly impacts downstream performance:

  • Semantic tokenizers: Better for ASR, understanding
  • Acoustic tokenizers: Better for synthesis, voice quality
  • Mixed tokenizers: Balance between both

End-to-End Learning

Models benefit from joint training of:

  • Speech perception
  • Language reasoning
  • Speech generation

Separate training results in suboptimal performance.

Context and Coherence

Long-form speech generation requires:

  • Proper context management
  • Consistent speaker identity
  • Coherent semantic flow
  • Emotional consistency