Speech-to-Speech Models - Real-Time Voice Interaction

Moshi - Full-Duplex Voice Interaction

Moshi is a state-of-the-art speech-to-speech model that enables natural, real-time conversations with AI.

Architecture Overview

[Your Voice]
    ↓
[Mimi Encoder]
    ↓
[Audio Tokens]
    ↓
[Helium LLM + RQ-Transformer]
    ↓
[Response Tokens + Inner Monologue]
    ↓
[Mimi Decoder]
    ↓
[Moshi's Voice]

Component 1: Mimi Encoder (Audio Compression)

Mimi is a neural audio codec, similar to an AI-powered MP3 converter, but instead of compressing for storage, it converts voice into digital tokens.

How It Works

Audio waveform input: Raw voice sampled at 24,000 times per second (24kHz)
Feature extraction: Finds patterns in the waveform (pitch, tone, timbre)
Quantization: Patterns are turned into discrete tokens using Residual Vector Quantization (RVQ)
- Each token represents a piece of your voice
- Efficient compression while preserving quality

Component 2: Helium (Language Model)

Helium is a 7-billion parameter language model, similar to GPT.

Key Characteristics

Doesn’t “hear” voice directly
Reads tokens from Mimi encoder
Understands the meaning behind the audio
Processes tokenized speech (not text!)
Infers what the user is saying
Generates responses first as text tokens (like words)
Maintains context across turns

Why Token-Based?

Using tokens instead of text allows the model to:

Work with audio-specific features
Preserve prosody and emotion
Handle overlapping speech
Enable more natural interaction

Component 3: RQ-Transformer (Dual-Stream Processing)

This is a custom neural network that handles simultaneous listening and speaking — like having two conversations at once.

Multi-Stream Processing

The RQ-Transformer manages two token streams:

👂 User stream:
- Your voice → Mimi tokens → Helium processing
- Continuous listening
🗣️ Response stream:
- Helium response → Speech tokens → Moshi’s voice
- Continuous generation

Key Features

Multi-stream transformer: Processes input and generates output simultaneously
Full-duplex interaction: No turn-taking needed
Neural finite-state machine: Controls conversation flow
- When to speak
- When to listen
- When to interrupt
- How to handle overlaps

Inner Monologue Concept

Instead of going straight from thought to voice, Moshi first generates a response as text tokens, then converts to speech tokens.

Why two stages?

Improves coherence and grammar
Ensures factual accuracy
Makes system easier to control and debug
Allows voice adaptation

Component 4: Mimi Decoder (Audio Synthesis)

Purpose: Convert response tokens back into natural-sounding speech

Process

Text response from Helium → Audio tokens (TTS component)
Mimi’s decoder reconstructs natural-sounding voice
Real-time generation using tokens
Output: Natural speech in Moshi’s voice

Full Example: Moshi in Action

Scenario

User asks: “What’s the weather like?”

Step-by-Step

1. USER SPEAKS
   Audio: "What's the weather like?"
   ↓

2. MIMI ENCODING
   Waveform → [24,000 Hz samples] → RVQ tokens
   Example: [t1, t2, t3, ..., t_n]
   ↓

3. HELIUM PROCESSING (while still listening)
   Input tokens → LLM inference
   Output: "I don't have real-time weather data, but..."
   ↓

4. INNER MONOLOGUE
   Text response → Text tokens
   Tokens: [w1, w2, w3, ..., w_m]
   ↓

5. SPEECH TOKEN GENERATION
   Text tokens → Speech tokens (TTS layer)
   Speech tokens: [s1, s2, s3, ..., s_m]
   ↓

6. MIMI DECODING
   Speech tokens → 24 kHz waveform
   ↓

7. SPEAKER HEARS
   Audio output: "I don't have real-time weather data..."
   (While potentially already responding to next input)

Comparison: Traditional vs. Moshi

Traditional Pipeline (Sequential)

Speech → STT → Text → LLM → Response → TTS → Speech
↓
Each step waits for the previous one to complete
High latency, feels like turns

Moshi Pipeline (Parallel)

Speech → Mimi → Tokens ━━━┓
                           ├→ Helium + RQ ━→ Output Tokens ━→ Mimi ━→ Speech
                           ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Listening and responding happen simultaneously
Lower latency, natural conversation flow

Other Speech-to-Speech Models

LLaMA-Omni

Omni-modal model based on LLaMA
Speech and text input/output
Direct integration with language understanding

CosyVoice

Focus on natural, expressive speech
Multiple speaker support
Emotion control capabilities

SpeechGPT-Gen

GPT-based approach to speech generation
Instruction-following capabilities
Flexible voice control

Key Concepts

Real-Time vs. Offline

Aspect	Real-Time (Moshi)	Offline
Latency	Sub-second	Multiple seconds
Streaming	Input/output streaming	Batch processing
Quality	Good (optimized for speed)	Excellent (more compute)
Use Case	Conversations, Call Centers	Batch audio processing

Full-Duplex vs. Half-Duplex

Full-Duplex (Moshi): Can listen and speak simultaneously
Half-Duplex: Must alternate between listening and speaking
- Like walkie-talkies
- Simpler to implement
- Requires turn-taking

Latency Considerations

For natural conversation, latency should be:

< 200ms for interruptions
< 500ms for natural feel
< 1000ms for acceptable

Moshi achieves sub-200ms latency through:

Streaming input/output
Parallel processing (listening while speaking)
Efficient token generation

Streaming Architecture Details

Input Streaming

Audio continuously fed to Mimi encoder
Encoder produces tokens in real-time (low buffering)
Tokens fed to Helium as they arrive
Model makes predictions incrementally

Output Streaming

Helium generates response tokens incrementally
RQ-Transformer routes to speech synthesis
Mimi decoder converts tokens to audio chunks
Audio streamed to user in real-time

Synchronization

RQ-Transformer manages timing
Ensures output rate matches input rate
Prevents audio underflow/overflow
Handles interruptions naturally

TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

https://www.hume.ai/blog/opensource-tada