Moshi - Full-Duplex Voice Interaction

Moshi is a state-of-the-art speech-to-speech model that enables natural, real-time conversations with AI.

Architecture Overview

[Your Voice]
    ↓
[Mimi Encoder]
    ↓
[Audio Tokens]
    ↓
[Helium LLM + RQ-Transformer]
    ↓
[Response Tokens + Inner Monologue]
    ↓
[Mimi Decoder]
    ↓
[Moshi's Voice]

Component 1: Mimi Encoder (Audio Compression)

Mimi is a neural audio codec, similar to an AI-powered MP3 converter, but instead of compressing for storage, it converts voice into digital tokens.

How It Works

  1. Audio waveform input: Raw voice sampled at 24,000 times per second (24kHz)
  2. Feature extraction: Finds patterns in the waveform (pitch, tone, timbre)
  3. Quantization: Patterns are turned into discrete tokens using Residual Vector Quantization (RVQ)
    • Each token represents a piece of your voice
    • Efficient compression while preserving quality

Component 2: Helium (Language Model)

Helium is a 7-billion parameter language model, similar to GPT.

Key Characteristics

  • Doesn’t “hear” voice directly
  • Reads tokens from Mimi encoder
  • Understands the meaning behind the audio
  • Processes tokenized speech (not text!)
  • Infers what the user is saying
  • Generates responses first as text tokens (like words)
  • Maintains context across turns

Why Token-Based?

Using tokens instead of text allows the model to:

  • Work with audio-specific features
  • Preserve prosody and emotion
  • Handle overlapping speech
  • Enable more natural interaction

Component 3: RQ-Transformer (Dual-Stream Processing)

This is a custom neural network that handles simultaneous listening and speaking — like having two conversations at once.

Multi-Stream Processing

The RQ-Transformer manages two token streams:

  • 👂 User stream:

    • Your voice → Mimi tokens → Helium processing
    • Continuous listening
  • 🗣️ Response stream:

    • Helium response → Speech tokens → Moshi’s voice
    • Continuous generation

Key Features

  1. Multi-stream transformer: Processes input and generates output simultaneously
  2. Full-duplex interaction: No turn-taking needed
  3. Neural finite-state machine: Controls conversation flow
    • When to speak
    • When to listen
    • When to interrupt
    • How to handle overlaps

Inner Monologue Concept

Instead of going straight from thought to voice, Moshi first generates a response as text tokens, then converts to speech tokens.

Why two stages?

  • Improves coherence and grammar
  • Ensures factual accuracy
  • Makes system easier to control and debug
  • Allows voice adaptation

Component 4: Mimi Decoder (Audio Synthesis)

Purpose: Convert response tokens back into natural-sounding speech

Process

  1. Text response from Helium → Audio tokens (TTS component)
  2. Mimi’s decoder reconstructs natural-sounding voice
  3. Real-time generation using tokens
  4. Output: Natural speech in Moshi’s voice

Full Example: Moshi in Action

Scenario

User asks: “What’s the weather like?”

Step-by-Step

1. USER SPEAKS
   Audio: "What's the weather like?"
   ↓

2. MIMI ENCODING
   Waveform → [24,000 Hz samples] → RVQ tokens
   Example: [t1, t2, t3, ..., t_n]
   ↓

3. HELIUM PROCESSING (while still listening)
   Input tokens → LLM inference
   Output: "I don't have real-time weather data, but..."
   ↓

4. INNER MONOLOGUE
   Text response → Text tokens
   Tokens: [w1, w2, w3, ..., w_m]
   ↓

5. SPEECH TOKEN GENERATION
   Text tokens → Speech tokens (TTS layer)
   Speech tokens: [s1, s2, s3, ..., s_m]
   ↓

6. MIMI DECODING
   Speech tokens → 24 kHz waveform
   ↓

7. SPEAKER HEARS
   Audio output: "I don't have real-time weather data..."
   (While potentially already responding to next input)

Comparison: Traditional vs. Moshi

Traditional Pipeline (Sequential)

Speech → STT → Text → LLM → Response → TTS → Speech
↓
Each step waits for the previous one to complete
High latency, feels like turns

Moshi Pipeline (Parallel)

Speech → Mimi → Tokens ━━━┓
                           ├→ Helium + RQ ━→ Output Tokens ━→ Mimi ━→ Speech
                           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Listening and responding happen simultaneously
Lower latency, natural conversation flow

Other Speech-to-Speech Models

LLaMA-Omni

  • Omni-modal model based on LLaMA
  • Speech and text input/output
  • Direct integration with language understanding

CosyVoice

  • Focus on natural, expressive speech
  • Multiple speaker support
  • Emotion control capabilities

SpeechGPT-Gen

  • GPT-based approach to speech generation
  • Instruction-following capabilities
  • Flexible voice control

Key Concepts

Real-Time vs. Offline

AspectReal-Time (Moshi)Offline
LatencySub-secondMultiple seconds
StreamingInput/output streamingBatch processing
QualityGood (optimized for speed)Excellent (more compute)
Use CaseConversations, Call CentersBatch audio processing

Full-Duplex vs. Half-Duplex

  • Full-Duplex (Moshi): Can listen and speak simultaneously
  • Half-Duplex: Must alternate between listening and speaking
    • Like walkie-talkies
    • Simpler to implement
    • Requires turn-taking

Latency Considerations

For natural conversation, latency should be:

  • < 200ms for interruptions
  • < 500ms for natural feel
  • < 1000ms for acceptable

Moshi achieves sub-200ms latency through:

  • Streaming input/output
  • Parallel processing (listening while speaking)
  • Efficient token generation

Streaming Architecture Details

Input Streaming

  1. Audio continuously fed to Mimi encoder
  2. Encoder produces tokens in real-time (low buffering)
  3. Tokens fed to Helium as they arrive
  4. Model makes predictions incrementally

Output Streaming

  1. Helium generates response tokens incrementally
  2. RQ-Transformer routes to speech synthesis
  3. Mimi decoder converts tokens to audio chunks
  4. Audio streamed to user in real-time

Synchronization

  • RQ-Transformer manages timing
  • Ensures output rate matches input rate
  • Prevents audio underflow/overflow
  • Handles interruptions naturally