Voice LLM - Two Processing Methods
Voice LLMs understand audio using two distinct approaches:
Method 1: Raw Waveform (Time-Domain) Processing
Direct processing of the original audio signal.
Characteristics
- Audio treated as sequence of amplitude values over time
- Example: 44,100 samples per second in 44.1kHz audio
- No loss of information (everything captured by mic is preserved)
- High-dimensional input → needs large models and lots of compute
Advantages
- Complete information preservation
- Captures subtle acoustic details
Disadvantages
- Computationally expensive
- Requires large models
- High memory usage
Example Models
- Wav2Vec 2.0 (Meta) - Self-supervised pre-training
- Whisper (OpenAI) - Partially uses raw waveforms
- WaveNet (DeepMind) - For TTS
Method 2: Frequency-Domain Processing
Indirect processing using frequency spectrum instead of raw samples.
Approach
- Convert waveform to frequency representation using Fourier Transform
- Work with frequency features instead of amplitude samples
- Apply perceptual scaling (Mel scale)
Representations
- Spectrogram: Time vs Frequency + color for amplitude
- Mel Spectrogram: Scaled to match human ear perception
- MFCC: Mel-Frequency Cepstral Coefficients (compact form)
Advantages
- Reduced dimensionality
- Aligns with human perception
- Works great with CNNs
Example Models
- Whisper (converts raw → log-mel spectrogram)
- Tacotron (for TTS)
- DeepSpeech
SpeechLM Components
A typical SpeechLM consists of three major components:
Component 1: Speech Tokenizer
Purpose: Convert continuous audio signals into latent representations, transformed into discrete tokens.
Tokenizer Objectives
Semantic Understanding Objective
- Extracts semantic features from audio
- Enables tasks like Automatic Speech Recognition (ASR)
- Prioritizes meaning and content over acoustic details
- Examples: Wav2vec 2.0, W2v-BERT, WavLM
Acoustic Generation Objective
- Captures acoustic features for high-quality speech synthesis
- Focused on speech fidelity rather than semantic content
- Examples: SoundStream, EnCodec
Mixed Objective
- Balances both semantic understanding AND acoustic generation
- Best for speech-to-speech tasks
- Examples: SpeechTokenizer, Mimi
Component 2: Language Model
Purpose: Core reasoning and generation engine.
Characteristics
- Inspired by Large Language Models (LLMs)
- Usually transformer or decoder-only architecture
- Examples: OPT, LLaMA
- Adapted from text-based models by incorporating speech tokenizer
- Capable of handling both text and speech modalities
- Vocabulary expanded to include both token types
Autoregressive Generation
- Generates tokens one at a time
- Each token conditioned on previous tokens
- Allows open-ended generation
Component 3: Token-to-Speech Synthesizer (Vocoder)
Purpose: Convert discrete speech tokens back into audio waveforms.
Synthesis Pipelines
Direct Synthesis
- Converts speech tokens → waveforms directly
- Fast and straightforward
- Works with acoustic-focused tokenizers
Input-Enhanced Synthesis
- Tokens → Continuous latent representation → Vocoder → Waveform
- Helpful when tokens lack acoustic details
- Better quality but slightly slower
- Works with semantic-focused tokenizers
Vocoder Types
GAN-based Vocoders (Most common)
- Fast and high-quality
- Examples: HiFi-GAN, BigVGAN
Neural Audio Codecs
- Combine compression and synthesis
- Examples: EnCodec, Mimi
Other Types
- WaveNet (high quality, slow)
- WaveGlow (flow-based)
- DiffWave (diffusion-based)
Canary - High-Performance ASR/Translation Model
Canary is a state-of-the-art model for speech recognition and translation, achieving high accuracy without “web-scale” data.
Architecture
FastConformer-based Attention Encoder-Decoder (AED)
The Encoder: FastConformer
- Speech-specific modification of standard transformer
- Features increased downsampling factor
- Benefit: 2.8x speedup in processing without losing modeling capacity
Key Features
-
Task Prompting (like Whisper)
- Special tokens guide tasks
<|transcribe|>for transcription<|translate|>for translation- New controls for punctuation and capitalization (PnC) via
<|pnc|>and<|nopnc|>tokens
-
Efficient Processing
- Increased downsampling reduces computation
- 2.8x speedup over standard transformer
- Maintains modeling quality
-
Pre-trained Initialization
- Encoder initialized from pre-trained weights
- Converges faster
- Achieves better metrics than training from scratch
Performance
- Achieves state-of-the-art accuracy on ASR/AST tasks
- Works well without massive datasets
- Faster inference than comparable models
- Better than Whisper on many benchmarks
Speech-Augmented Language Model (SALM)
SALM integrates speech processing with a large language model, enabling single models to understand both spoken audio and text.
Core Concept
- Convert raw audio → vectors
- Make vectors compatible with LLM
- Let LLM reason over audio “as if it were text tokens”
- Bridges spoken and written language
Two-Component Architecture
-
Speech Encoder (from ASR models)
- Converts audio → embeddings
-
Large Language Model (LLM)
- Performs reasoning and generation
Component Breakdown
Audio Perception Module
Converts raw audio into embeddings for LLM consumption.
Subcomponents:
-
Preprocessor
- Converts waveform → time-frequency representation
- Example: Mel Spectrogram
- Standard: 80 frequency bins
-
Encoder
- Strong ASR encoder (e.g., FastConformer from Canary)
- Extracts high-level semantic information
- Output: 1024-dimensional embeddings
-
Modality Adapter
- Adjusts encoder outputs for LLM expectations
- Often a projection or small Conformer stack
- Ensures structural alignment
-
Projector
- Linear layer mapping audio dimensions to LLM dimensions
- Example: 1024 → 4096 (for 4B parameter LLM)
- Makes audio indistinguishable from text embeddings
Large Language Model
Standard transformer-based model (e.g., Qwen-2.5B):
- Accepts both text embeddings and audio embeddings
- Generates text output
- No architectural changes needed
- Vocabulary includes audio placeholder token:
<|audio_locator|>
Complete Pipeline: Audio to Output
Let’s trace a 1-second audio clip saying “Cat” through SALM:
Stage 1: Preprocessing
- Input: Raw waveform at 16,000 Hz
- Shape: [1, 16000] (Batch, Samples)
- Operation: Sliding windows (~25 ms) with ~10 ms stride
- Output: Mel Spectrogram [1, 80, 100] (Batch, Frequencies, Time)
Stage 2: Encoding
- Model: FastConformer encoder
- Operation: Semantic extraction with temporal subsampling
- Subsampling: 100 frames → ~12 frames (8× reduction)
- Output: [1, 1024, 12] (Batch, Encoder_Dim, Time)
Stage 3: Projection
- Input: [1, 1024, 12]
- Transpose: [1, 12, 1024]
- Linear Projection: Linear(1024 → 4096)
- Output: [1, 12, 4096] (each timestep is LLM-compatible embedding)
Stage 4: LLM Processing
- Text input: “Transcribe:”
- Placeholder:
<|audio_locator|> - Final sequence: Text tokens + Audio embeddings
- Total: ~15 embeddings of 4096 dims each
- Processing: Standard transformer attention across both text and audio
- Output: Generated tokens for transcription
Mental Model
- Audio converted to short sequence of vectors
- Vectors shaped to be indistinguishable from text embeddings
- LLM reasons over both intent (text) and content (audio) with same attention mechanism
- SALM doesn’t add new speech reasoning — it reformats speech so LLM can already reason over it
NLU in Speech Models
Natural Language Understanding (NLU) models are designed to enable computers to understand the meaning and intent behind human language, both written and spoken.
Role in Speech Models
- Critical for voice agents
- Enable accurate task understanding
- Support context-aware responses
- Improve dialogue quality
Integration
- Often combined with speech tokenizers
- Upstream of response generation
- Helps models understand user intent
- Enables more intelligent routing and handling
Advances in Speech Language Models (Research Insights)
Speech Tokenization
The choice of tokenizer significantly impacts downstream performance:
- Semantic tokenizers: Better for ASR, understanding
- Acoustic tokenizers: Better for synthesis, voice quality
- Mixed tokenizers: Balance between both
End-to-End Learning
Models benefit from joint training of:
- Speech perception
- Language reasoning
- Speech generation
Separate training results in suboptimal performance.
Context and Coherence
Long-form speech generation requires:
- Proper context management
- Consistent speaker identity
- Coherent semantic flow
- Emotional consistency