Evaluation Frameworks
aiewf-eval
Purpose: Framework for evaluating multi-turn LLM conversations with support for text, realtime audio, and speech-to-speech models.
Features
- Multi-turn conversation evaluation
- Text evaluation
- Realtime audio evaluation
- Speech-to-speech model support
- Customizable metrics
Usage
from aiewf_eval import Evaluator
evaluator = Evaluator()
# Evaluate conversation
results = evaluator.evaluate(
conversations=test_conversations,
metrics=['relevance', 'coherence', 'emotion']
)GitHub
Coval
Purpose: Testing framework for voice agents.
Features
- Voice agent testing
- Multi-turn evaluation
- Quality metrics
- Test scenarios
Use Cases
- Conversation quality assessment
- Multi-turn interaction testing
- Baseline comparison
Website
Cekura
Roark
Alternative evaluation tool for voice agent testing.
Features
- Voice quality assessment
- Interaction testing
- Scenario-based evaluation
Website
Evaluation Metrics
Speech Quality Metrics
PESQ (Perceptual Evaluation of Speech Quality)
- Range: -0.5 to 4.5
- Meaning:
- 4.5 = Perfect quality
- 4.0 = Excellent
- 3.5 = Good
- 3.0 = Fair
- < 3.0 = Poor
Use: Comparing audio quality across models/configurations
MOS (Mean Opinion Score)
- Method: Have humans rate audio quality (1-5 scale)
- Average: Calculate mean across raters
- Scale:
- 5 = Excellent
- 4 = Good
- 3 = Fair
- 2 = Poor
- 1 = Bad
Use: Gold standard for audio quality, expensive but accurate
Speaker Similarity
- Metric: How closely generated speech matches target speaker
- Method: Cosine similarity of speaker embeddings
- Scale: 0-1 (1 = identical)
Use: Voice cloning/speaker matching evaluation
Insights
What is a good resolution rate for voice AI?
Leading enterprise voice AI deployments achieve 75-85% resolution rates meaning three out of four customer calls are fully resolved by the AI agent without human intervention. Resolution rates below 60% typically indicate significant room for improvement in the voice AI implementation.
How do you measure voice AI performance?
Modern voice AI performance measurement focuses on five key metrics: resolution rate, average handle time reduction, human agent productivity gains, post-escalation outcomes, and end-to-end customer journey success. These metrics require voice observability infrastructure to track systematically.
What metrics matter for enterprise voice AI?
The metrics that matter most to enterprise buyers in 2026 are cost per resolution, human agent hours saved, and customer satisfaction maintained or improved. These business-outcome metrics have replaced audio quality scores and demo impressions as the primary evaluation criteria.
-
Demo performance ≠ production performance. Controlled environments tell you almost nothing about real-world success. Demand production metrics from any voice AI platform you evaluate.
-
Resolution rate is the new north star. If your AI voice agent isn’t resolving 75%+ of interactions, you have work to do—regardless of how human it sounds.
-
Customers accept voice AI agents when they work. Bot drop-off rates are declining industry-wide. The barrier isn’t user acceptance it’s execution quality.
-
Measure the complete journey. Cost per resolution, human hours saved, and customer satisfaction maintained are the metrics that matter to executives evaluating conversational AI platforms.
-
Build AI agent evaluation infrastructure early. The gap between leaders and laggards is systematic testing and voice observability, not better technology.
Context-Aware Voice AI: The DoorDash Model
Think about the best support experiences you’ve had in apps. They don’t start with “How can I help you today?” They start with a prediction.
DoorDash and Uber example: When you open support in the DoorDash or Uber app, what’s the first thing you see? It’s not a generic menu. It’s a list of your most recent orders or rides—because if you ordered food 20 minutes ago, there’s a 90% chance that’s why you’re contacting support. So it leads with that context.
The voice AI equivalent:
Instead of:
“Hello, thank you for calling support. My name is Alex, and I’m a virtual assistant. How can I help you today?”
Try:
“Hi Melissa, this is your virtual assistant. Are you calling about your order from Chipotle that’s arriving in 10 minutes?”
Strategy 1: Use Recency Signals
What you know: Customer’s last transaction, order, appointment, or interaction
How to use it: Lead with that context
| Scenario | Generic Opening | Business Logic Opening |
| E-commerce | ”How can I help you?" | "Hi Sarah, are you calling about your order from yesterday that’s out for delivery?” |
| Healthcare | ”How can I direct your call?" | "Hi James, I see you have an appointment with Dr. Chen tomorrow at 2pm. Are you calling about that?” |
| Banking | ”How can I assist you?" | "Hi Michael, I noticed a transaction at Target for $247 this morning. Are you calling about that?” |
| Telecom | ”How can I help?" | "Hi Lisa, I see your bill is due in 3 days. Would you like to make a payment or discuss your charges?” |
Strategy 2: Use Behavioral Patterns
What you know: Why customers typically call at certain times or after certain events
How to use it: Predict based on patterns, not just individual data
Examples:
- Customer calls within 1 hour of placing an order → likely asking about order status or wanting to modify
- Customer calls the day after delivery → likely has an issue with the order
- Customer calls on the 15th of the month → likely asking about billing
- Customer calls after failed login attempts → likely locked out of account
Strategy 3: Use Real-Time Signals
What you know: What’s happening right now in your systems
How to use it: Surface relevant context immediately
Examples:
- There’s an outage in the customer’s area → “Hi, I see you’re calling from the Portland area. We’re aware of a service disruption and crews are working on it. Estimated restoration is 4pm. Is there anything else I can help with?”
- The customer’s payment just declined → “Hi, I noticed your payment didn’t go through. Would you like to update your payment method?”
- The customer’s flight was just delayed → “Hi, I see your flight to Chicago has been delayed to 6:45pm. Would you like to rebook or get information about the delay?”
Strategy 4: Use Account Context
What you know: Customer’s account status, history, preferences
How to use it: Personalize the experience to their situation
Examples:
- Premium customer → different routing, acknowledge status
- Customer with open support ticket → “Are you following up on your case from Tuesday?”
- Customer who called yesterday → “I see we spoke yesterday about your refund. It’s been processed and you should see it in 2-3 business days. Is there anything else?”
Bench mark
Resources
- https://www.speechmatics.com/company/articles-and-news/speed-you-can-trust-the-stt-metrics-that-matter-for-voice-agents
- https://www.coval.dev/blog/voice-ai-evaluation-in-2026-the-5-metrics-that-actually-predict-production-success
- https://www.coval.dev/blog/voice-ai-drop-off-rate-the-metric-that-predicts-whether-customers-stay-or-hang-up