Speech Language Models - Voice LLM Architecture

Voice LLM - Two Processing Methods

Voice LLMs understand audio using two distinct approaches:

Method 1: Raw Waveform (Time-Domain) Processing

Direct processing of the original audio signal.

Characteristics

Audio treated as sequence of amplitude values over time
Example: 44,100 samples per second in 44.1kHz audio
No loss of information (everything captured by mic is preserved)
High-dimensional input → needs large models and lots of compute

Advantages

Complete information preservation
Captures subtle acoustic details

Disadvantages

Computationally expensive
Requires large models
High memory usage

Example Models

Wav2Vec 2.0 (Meta) - Self-supervised pre-training
Whisper (OpenAI) - Partially uses raw waveforms
WaveNet (DeepMind) - For TTS

Method 2: Frequency-Domain Processing

Indirect processing using frequency spectrum instead of raw samples.

Approach

Convert waveform to frequency representation using Fourier Transform
Work with frequency features instead of amplitude samples
Apply perceptual scaling (Mel scale)

Representations

Spectrogram: Time vs Frequency + color for amplitude
Mel Spectrogram: Scaled to match human ear perception
MFCC: Mel-Frequency Cepstral Coefficients (compact form)

Advantages

Reduced dimensionality
Aligns with human perception
Works great with CNNs

Example Models

Whisper (converts raw → log-mel spectrogram)
Tacotron (for TTS)
DeepSpeech

SpeechLM Components

A typical SpeechLM consists of three major components:

Component 1: Speech Tokenizer

Purpose: Convert continuous audio signals into latent representations, transformed into discrete tokens.

Tokenizer Objectives

Semantic Understanding Objective

Extracts semantic features from audio
Enables tasks like Automatic Speech Recognition (ASR)
Prioritizes meaning and content over acoustic details
Examples: Wav2vec 2.0, W2v-BERT, WavLM

Acoustic Generation Objective

Captures acoustic features for high-quality speech synthesis
Focused on speech fidelity rather than semantic content
Examples: SoundStream, EnCodec

Mixed Objective

Balances both semantic understanding AND acoustic generation
Best for speech-to-speech tasks
Examples: SpeechTokenizer, Mimi

Component 2: Language Model

Purpose: Core reasoning and generation engine.

Characteristics

Inspired by Large Language Models (LLMs)
Usually transformer or decoder-only architecture
Examples: OPT, LLaMA
Adapted from text-based models by incorporating speech tokenizer
Capable of handling both text and speech modalities
Vocabulary expanded to include both token types

Autoregressive Generation

Generates tokens one at a time
Each token conditioned on previous tokens
Allows open-ended generation

Component 3: Token-to-Speech Synthesizer (Vocoder)

Purpose: Convert discrete speech tokens back into audio waveforms.

Synthesis Pipelines

Direct Synthesis

Converts speech tokens → waveforms directly
Fast and straightforward
Works with acoustic-focused tokenizers

Input-Enhanced Synthesis

Tokens → Continuous latent representation → Vocoder → Waveform
Helpful when tokens lack acoustic details
Better quality but slightly slower
Works with semantic-focused tokenizers

Vocoder Types

GAN-based Vocoders (Most common)

Fast and high-quality
Examples: HiFi-GAN, BigVGAN

Neural Audio Codecs

Combine compression and synthesis
Examples: EnCodec, Mimi

Other Types

WaveNet (high quality, slow)
WaveGlow (flow-based)
DiffWave (diffusion-based)

Canary - High-Performance ASR/Translation Model

Canary is a state-of-the-art model for speech recognition and translation, achieving high accuracy without “web-scale” data.

Architecture

FastConformer-based Attention Encoder-Decoder (AED)

The Encoder: FastConformer

Speech-specific modification of standard transformer
Features increased downsampling factor
Benefit: 2.8x speedup in processing without losing modeling capacity

Key Features

Task Prompting (like Whisper)
- Special tokens guide tasks
- <|transcribe|> for transcription
- <|translate|> for translation
- New controls for punctuation and capitalization (PnC) via <|pnc|> and <|nopnc|> tokens
Efficient Processing
- Increased downsampling reduces computation
- 2.8x speedup over standard transformer
- Maintains modeling quality
Pre-trained Initialization
- Encoder initialized from pre-trained weights
- Converges faster
- Achieves better metrics than training from scratch

Performance

Achieves state-of-the-art accuracy on ASR/AST tasks
Works well without massive datasets
Faster inference than comparable models
Better than Whisper on many benchmarks

Speech-Augmented Language Model (SALM)

SALM integrates speech processing with a large language model, enabling single models to understand both spoken audio and text.

Core Concept

Convert raw audio → vectors
Make vectors compatible with LLM
Let LLM reason over audio “as if it were text tokens”
Bridges spoken and written language

Two-Component Architecture

Speech Encoder (from ASR models)
- Converts audio → embeddings
Large Language Model (LLM)
- Performs reasoning and generation

Component Breakdown

Audio Perception Module

Converts raw audio into embeddings for LLM consumption.

Subcomponents:

Preprocessor
- Converts waveform → time-frequency representation
- Example: Mel Spectrogram
- Standard: 80 frequency bins
Encoder
- Strong ASR encoder (e.g., FastConformer from Canary)
- Extracts high-level semantic information
- Output: 1024-dimensional embeddings
Modality Adapter
- Adjusts encoder outputs for LLM expectations
- Often a projection or small Conformer stack
- Ensures structural alignment
Projector
- Linear layer mapping audio dimensions to LLM dimensions
- Example: 1024 → 4096 (for 4B parameter LLM)
- Makes audio indistinguishable from text embeddings

Large Language Model

Standard transformer-based model (e.g., Qwen-2.5B):

Accepts both text embeddings and audio embeddings
Generates text output
No architectural changes needed
Vocabulary includes audio placeholder token: <|audio_locator|>

Complete Pipeline: Audio to Output

Let’s trace a 1-second audio clip saying “Cat” through SALM:

Stage 1: Preprocessing

Input: Raw waveform at 16,000 Hz
Shape: [1, 16000] (Batch, Samples)
Operation: Sliding windows (~25 ms) with ~10 ms stride
Output: Mel Spectrogram [1, 80, 100] (Batch, Frequencies, Time)

Stage 2: Encoding

Model: FastConformer encoder
Operation: Semantic extraction with temporal subsampling
Subsampling: 100 frames → ~12 frames (8× reduction)
Output: [1, 1024, 12] (Batch, Encoder_Dim, Time)

Stage 3: Projection

Input: [1, 1024, 12]
Transpose: [1, 12, 1024]
Linear Projection: Linear(1024 → 4096)
Output: [1, 12, 4096] (each timestep is LLM-compatible embedding)

Stage 4: LLM Processing

Text input: “Transcribe:”
Placeholder: <|audio_locator|>
Final sequence: Text tokens + Audio embeddings
Total: ~15 embeddings of 4096 dims each
Processing: Standard transformer attention across both text and audio
Output: Generated tokens for transcription

Mental Model

Audio converted to short sequence of vectors
Vectors shaped to be indistinguishable from text embeddings
LLM reasons over both intent (text) and content (audio) with same attention mechanism
SALM doesn’t add new speech reasoning — it reformats speech so LLM can already reason over it

NLU in Speech Models

Natural Language Understanding (NLU) models are designed to enable computers to understand the meaning and intent behind human language, both written and spoken.

Role in Speech Models

Critical for voice agents
Enable accurate task understanding
Support context-aware responses
Improve dialogue quality

Integration

Often combined with speech tokenizers
Upstream of response generation
Helps models understand user intent
Enables more intelligent routing and handling

Advances in Speech Language Models (Research Insights)

Speech Tokenization

The choice of tokenizer significantly impacts downstream performance:

Semantic tokenizers: Better for ASR, understanding
Acoustic tokenizers: Better for synthesis, voice quality
Mixed tokenizers: Balance between both

End-to-End Learning

Models benefit from joint training of:

Speech perception
Language reasoning
Speech generation

Separate training results in suboptimal performance.

Context and Coherence

Long-form speech generation requires:

Proper context management
Consistent speaker identity
Coherent semantic flow
Emotional consistency

Speech Language Models - Voice LLM Architecture

Table of Contents

Voice LLM - Two Processing Methods

Method 1: Raw Waveform (Time-Domain) Processing

Characteristics

Advantages

Disadvantages

Example Models

Method 2: Frequency-Domain Processing

Approach

Representations

Advantages

Example Models

SpeechLM Components

Component 1: Speech Tokenizer

Tokenizer Objectives

Component 2: Language Model

Characteristics

Autoregressive Generation

Component 3: Token-to-Speech Synthesizer (Vocoder)

Synthesis Pipelines

Vocoder Types

Canary - High-Performance ASR/Translation Model

Architecture

The Encoder: FastConformer

Key Features

Performance

Speech-Augmented Language Model (SALM)

Core Concept

Two-Component Architecture

Component Breakdown

Audio Perception Module

Large Language Model

Complete Pipeline: Audio to Output

Stage 1: Preprocessing

Stage 2: Encoding

Stage 3: Projection

Stage 4: LLM Processing

Mental Model

NLU in Speech Models

Role in Speech Models

Integration

Advances in Speech Language Models (Research Insights)

Speech Tokenization

End-to-End Learning

Context and Coherence

Graph View

Table of Contents