Voice Agent Evaluation & Testing

Evaluation Frameworks

aiewf-eval

Purpose: Framework for evaluating multi-turn LLM conversations with support for text, realtime audio, and speech-to-speech models.

Features

Multi-turn conversation evaluation
Text evaluation
Realtime audio evaluation
Speech-to-speech model support
Customizable metrics

Usage

from aiewf_eval import Evaluator
 
evaluator = Evaluator()
 
# Evaluate conversation
results = evaluator.evaluate(
    conversations=test_conversations,
    metrics=['relevance', 'coherence', 'emotion']
)

GitHub

aiewf-eval Repository

Coval

Purpose: Testing framework for voice agents.

Features

Voice agent testing
Multi-turn evaluation
Quality metrics
Test scenarios

Use Cases

Conversation quality assessment
Multi-turn interaction testing
Baseline comparison

Website

Coval

Cekura

https://www.cekura.ai/

Roark

Alternative evaluation tool for voice agent testing.

Features

Voice quality assessment
Interaction testing
Scenario-based evaluation

Website

Roark AI

Evaluation Metrics

Speech Quality Metrics

PESQ (Perceptual Evaluation of Speech Quality)

Range: -0.5 to 4.5
Meaning:
- 4.5 = Perfect quality
- 4.0 = Excellent
- 3.5 = Good
- 3.0 = Fair
- < 3.0 = Poor

Use: Comparing audio quality across models/configurations

MOS (Mean Opinion Score)

Method: Have humans rate audio quality (1-5 scale)
Average: Calculate mean across raters
Scale:
- 5 = Excellent
- 4 = Good
- 3 = Fair
- 2 = Poor
- 1 = Bad

Use: Gold standard for audio quality, expensive but accurate

Speaker Similarity

Metric: How closely generated speech matches target speaker
Method: Cosine similarity of speaker embeddings
Scale: 0-1 (1 = identical)

Use: Voice cloning/speaker matching evaluation

Insights

What is a good resolution rate for voice AI?

Leading enterprise voice AI deployments achieve 75-85% resolution rates meaning three out of four customer calls are fully resolved by the AI agent without human intervention. Resolution rates below 60% typically indicate significant room for improvement in the voice AI implementation.

How do you measure voice AI performance?

Modern voice AI performance measurement focuses on five key metrics: resolution rate, average handle time reduction, human agent productivity gains, post-escalation outcomes, and end-to-end customer journey success. These metrics require voice observability infrastructure to track systematically.

What metrics matter for enterprise voice AI?

The metrics that matter most to enterprise buyers in 2026 are cost per resolution, human agent hours saved, and customer satisfaction maintained or improved. These business-outcome metrics have replaced audio quality scores and demo impressions as the primary evaluation criteria.

Demo performance ≠ production performance. Controlled environments tell you almost nothing about real-world success. Demand production metrics from any voice AI platform you evaluate.
Resolution rate is the new north star. If your AI voice agent isn’t resolving 75%+ of interactions, you have work to do—regardless of how human it sounds.
Customers accept voice AI agents when they work. Bot drop-off rates are declining industry-wide. The barrier isn’t user acceptance it’s execution quality.
Measure the complete journey. Cost per resolution, human hours saved, and customer satisfaction maintained are the metrics that matter to executives evaluating conversational AI platforms.
Build AI agent evaluation infrastructure early. The gap between leaders and laggards is systematic testing and voice observability, not better technology.

Context-Aware Voice AI: The DoorDash Model

Think about the best support experiences you’ve had in apps. They don’t start with “How can I help you today?” They start with a prediction.

DoorDash and Uber example: When you open support in the DoorDash or Uber app, what’s the first thing you see? It’s not a generic menu. It’s a list of your most recent orders or rides—because if you ordered food 20 minutes ago, there’s a 90% chance that’s why you’re contacting support. So it leads with that context.

The voice AI equivalent:

Instead of:

“Hello, thank you for calling support. My name is Alex, and I’m a virtual assistant. How can I help you today?”

Try:

“Hi Melissa, this is your virtual assistant. Are you calling about your order from Chipotle that’s arriving in 10 minutes?”

Strategy 1: Use Recency Signals

What you know: Customer’s last transaction, order, appointment, or interaction

How to use it: Lead with that context


Scenario	Generic Opening	Business Logic Opening
E-commerce	”How can I help you?"	"Hi Sarah, are you calling about your order from yesterday that’s out for delivery?”
Healthcare	”How can I direct your call?"	"Hi James, I see you have an appointment with Dr. Chen tomorrow at 2pm. Are you calling about that?”
Banking	”How can I assist you?"	"Hi Michael, I noticed a transaction at Target for $247 this morning. Are you calling about that?”
Telecom	”How can I help?"	"Hi Lisa, I see your bill is due in 3 days. Would you like to make a payment or discuss your charges?”

Strategy 2: Use Behavioral Patterns

What you know: Why customers typically call at certain times or after certain events

How to use it: Predict based on patterns, not just individual data

Examples:

Customer calls within 1 hour of placing an order → likely asking about order status or wanting to modify
Customer calls the day after delivery → likely has an issue with the order
Customer calls on the 15th of the month → likely asking about billing
Customer calls after failed login attempts → likely locked out of account

Strategy 3: Use Real-Time Signals

What you know: What’s happening right now in your systems

How to use it: Surface relevant context immediately

Examples:

There’s an outage in the customer’s area → “Hi, I see you’re calling from the Portland area. We’re aware of a service disruption and crews are working on it. Estimated restoration is 4pm. Is there anything else I can help with?”
The customer’s payment just declined → “Hi, I noticed your payment didn’t go through. Would you like to update your payment method?”
The customer’s flight was just delayed → “Hi, I see your flight to Chicago has been delayed to 6:45pm. Would you like to rebook or get information about the delay?”

Strategy 4: Use Account Context

What you know: Customer’s account status, history, preferences

How to use it: Personalize the experience to their situation

Examples:

Premium customer → different routing, acknowledge status
Customer with open support ticket → “Are you following up on your case from Tuesday?”
Customer who called yesterday → “I see we spoke yesterday about your refund. It’s been processed and you should see it in 2-3 business days. Is there anything else?”

Bench mark

https://benchmarks.coval.ai/tts

Voice Agent Evaluation & Testing

Table of Contents

Evaluation Frameworks

aiewf-eval

Features

Usage

GitHub

Coval

Features

Use Cases

Website

Cekura

Roark

Features

Website

Evaluation Metrics

Speech Quality Metrics

PESQ (Perceptual Evaluation of Speech Quality)

MOS (Mean Opinion Score)

Speaker Similarity

Insights

What is a good resolution rate for voice AI?

How do you measure voice AI performance?

What metrics matter for enterprise voice AI?

Context-Aware Voice AI: The DoorDash Model

Strategy 1: Use Recency Signals

Strategy 2: Use Behavioral Patterns

Strategy 3: Use Real-Time Signals

Strategy 4: Use Account Context

Bench mark

Resources

Graph View

Table of Contents