Evaluation Frameworks
aiewf-eval
Purpose: Framework for evaluating multi-turn LLM conversations with support for text, realtime audio, and speech-to-speech models.
Features
- Multi-turn conversation evaluation
- Text evaluation
- Realtime audio evaluation
- Speech-to-speech model support
- Customizable metrics
Usage
from aiewf_eval import Evaluator
evaluator = Evaluator()
# Evaluate conversation
results = evaluator.evaluate(
conversations=test_conversations,
metrics=['relevance', 'coherence', 'emotion']
)GitHub
Coval
Purpose: Testing framework for voice agents.
Features
- Voice agent testing
- Multi-turn evaluation
- Quality metrics
- Test scenarios
Use Cases
- Conversation quality assessment
- Multi-turn interaction testing
- Baseline comparison
Website
Cekura
Roark
Alternative evaluation tool for voice agent testing.
Features
- Voice quality assessment
- Interaction testing
- Scenario-based evaluation
Website
Evaluation Metrics
Speech Quality Metrics
PESQ (Perceptual Evaluation of Speech Quality)
- Range: -0.5 to 4.5
- Meaning:
- 4.5 = Perfect quality
- 4.0 = Excellent
- 3.5 = Good
- 3.0 = Fair
- < 3.0 = Poor
Use: Comparing audio quality across models/configurations
MOS (Mean Opinion Score)
- Method: Have humans rate audio quality (1-5 scale)
- Average: Calculate mean across raters
- Scale:
- 5 = Excellent
- 4 = Good
- 3 = Fair
- 2 = Poor
- 1 = Bad
Use: Gold standard for audio quality, expensive but accurate
Speaker Similarity
- Metric: How closely generated speech matches target speaker
- Method: Cosine similarity of speaker embeddings
- Scale: 0-1 (1 = identical)
Use: Voice cloning/speaker matching evaluation
Conversation Quality Metrics
Relevance
- Definition: How well response addresses the user input
- Scale: 0-1 or 1-5
- Evaluation Method: Manual or LLM-based
Coherence
- Definition: Logical flow and consistency of conversation
- Scale: 0-1 or 1-5
- Evaluation Method: Manual or automated
Completeness
- Definition: Whether response fully answers user question
- Scale: 0-1 or 1-5
- Evaluation Method: Manual checklist
Natural Prosody
- Definition: How natural speech sounds (tone, pacing, emphasis)
- Scale: 0-1 or 1-5
- Evaluation Method: Human raters
System Metrics
Latency
- P50: Median response time
- P95: 95th percentile response time
- P99: 99th percentile response time
Target:
- Interactive: < 500ms
- Real-time: < 200ms
Throughput
- Definition: Requests processed per second
- Unit: RPS (requests per second)
- Target: Depends on load requirements
Error Rate
- Definition: % of failed requests
- Target: < 1% for production