Moshi - Full-Duplex Voice Interaction
Moshi is a state-of-the-art speech-to-speech model that enables natural, real-time conversations with AI.
Architecture Overview
[Your Voice]
↓
[Mimi Encoder]
↓
[Audio Tokens]
↓
[Helium LLM + RQ-Transformer]
↓
[Response Tokens + Inner Monologue]
↓
[Mimi Decoder]
↓
[Moshi's Voice]
Component 1: Mimi Encoder (Audio Compression)
Mimi is a neural audio codec, similar to an AI-powered MP3 converter, but instead of compressing for storage, it converts voice into digital tokens.
How It Works
- Audio waveform input: Raw voice sampled at 24,000 times per second (24kHz)
- Feature extraction: Finds patterns in the waveform (pitch, tone, timbre)
- Quantization: Patterns are turned into discrete tokens using Residual Vector Quantization (RVQ)
- Each token represents a piece of your voice
- Efficient compression while preserving quality
Component 2: Helium (Language Model)
Helium is a 7-billion parameter language model, similar to GPT.
Key Characteristics
- Doesn’t “hear” voice directly
- Reads tokens from Mimi encoder
- Understands the meaning behind the audio
- Processes tokenized speech (not text!)
- Infers what the user is saying
- Generates responses first as text tokens (like words)
- Maintains context across turns
Why Token-Based?
Using tokens instead of text allows the model to:
- Work with audio-specific features
- Preserve prosody and emotion
- Handle overlapping speech
- Enable more natural interaction
Component 3: RQ-Transformer (Dual-Stream Processing)
This is a custom neural network that handles simultaneous listening and speaking — like having two conversations at once.
Multi-Stream Processing
The RQ-Transformer manages two token streams:
-
👂 User stream:
- Your voice → Mimi tokens → Helium processing
- Continuous listening
-
🗣️ Response stream:
- Helium response → Speech tokens → Moshi’s voice
- Continuous generation
Key Features
- Multi-stream transformer: Processes input and generates output simultaneously
- Full-duplex interaction: No turn-taking needed
- Neural finite-state machine: Controls conversation flow
- When to speak
- When to listen
- When to interrupt
- How to handle overlaps
Inner Monologue Concept
Instead of going straight from thought to voice, Moshi first generates a response as text tokens, then converts to speech tokens.
Why two stages?
- Improves coherence and grammar
- Ensures factual accuracy
- Makes system easier to control and debug
- Allows voice adaptation
Component 4: Mimi Decoder (Audio Synthesis)
Purpose: Convert response tokens back into natural-sounding speech
Process
- Text response from Helium → Audio tokens (TTS component)
- Mimi’s decoder reconstructs natural-sounding voice
- Real-time generation using tokens
- Output: Natural speech in Moshi’s voice
Full Example: Moshi in Action
Scenario
User asks: “What’s the weather like?”
Step-by-Step
1. USER SPEAKS
Audio: "What's the weather like?"
↓
2. MIMI ENCODING
Waveform → [24,000 Hz samples] → RVQ tokens
Example: [t1, t2, t3, ..., t_n]
↓
3. HELIUM PROCESSING (while still listening)
Input tokens → LLM inference
Output: "I don't have real-time weather data, but..."
↓
4. INNER MONOLOGUE
Text response → Text tokens
Tokens: [w1, w2, w3, ..., w_m]
↓
5. SPEECH TOKEN GENERATION
Text tokens → Speech tokens (TTS layer)
Speech tokens: [s1, s2, s3, ..., s_m]
↓
6. MIMI DECODING
Speech tokens → 24 kHz waveform
↓
7. SPEAKER HEARS
Audio output: "I don't have real-time weather data..."
(While potentially already responding to next input)
Comparison: Traditional vs. Moshi
Traditional Pipeline (Sequential)
Speech → STT → Text → LLM → Response → TTS → Speech
↓
Each step waits for the previous one to complete
High latency, feels like turns
Moshi Pipeline (Parallel)
Speech → Mimi → Tokens ━━━┓
├→ Helium + RQ ━→ Output Tokens ━→ Mimi ━→ Speech
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Listening and responding happen simultaneously
Lower latency, natural conversation flow
Other Speech-to-Speech Models
LLaMA-Omni
- Omni-modal model based on LLaMA
- Speech and text input/output
- Direct integration with language understanding
CosyVoice
- Focus on natural, expressive speech
- Multiple speaker support
- Emotion control capabilities
SpeechGPT-Gen
- GPT-based approach to speech generation
- Instruction-following capabilities
- Flexible voice control
Key Concepts
Real-Time vs. Offline
| Aspect | Real-Time (Moshi) | Offline |
|---|---|---|
| Latency | Sub-second | Multiple seconds |
| Streaming | Input/output streaming | Batch processing |
| Quality | Good (optimized for speed) | Excellent (more compute) |
| Use Case | Conversations, Call Centers | Batch audio processing |
Full-Duplex vs. Half-Duplex
- Full-Duplex (Moshi): Can listen and speak simultaneously
- Half-Duplex: Must alternate between listening and speaking
- Like walkie-talkies
- Simpler to implement
- Requires turn-taking
Latency Considerations
For natural conversation, latency should be:
- < 200ms for interruptions
- < 500ms for natural feel
- < 1000ms for acceptable
Moshi achieves sub-200ms latency through:
- Streaming input/output
- Parallel processing (listening while speaking)
- Efficient token generation
Streaming Architecture Details
Input Streaming
- Audio continuously fed to Mimi encoder
- Encoder produces tokens in real-time (low buffering)
- Tokens fed to Helium as they arrive
- Model makes predictions incrementally
Output Streaming
- Helium generates response tokens incrementally
- RQ-Transformer routes to speech synthesis
- Mimi decoder converts tokens to audio chunks
- Audio streamed to user in real-time
Synchronization
- RQ-Transformer manages timing
- Ensures output rate matches input rate
- Prevents audio underflow/overflow
- Handles interruptions naturally