Evaluation Frameworks

aiewf-eval

Purpose: Framework for evaluating multi-turn LLM conversations with support for text, realtime audio, and speech-to-speech models.

Features

  • Multi-turn conversation evaluation
  • Text evaluation
  • Realtime audio evaluation
  • Speech-to-speech model support
  • Customizable metrics

Usage

from aiewf_eval import Evaluator
 
evaluator = Evaluator()
 
# Evaluate conversation
results = evaluator.evaluate(
    conversations=test_conversations,
    metrics=['relevance', 'coherence', 'emotion']
)

GitHub

Coval

Purpose: Testing framework for voice agents.

Features

  • Voice agent testing
  • Multi-turn evaluation
  • Quality metrics
  • Test scenarios

Use Cases

  • Conversation quality assessment
  • Multi-turn interaction testing
  • Baseline comparison

Website

Cekura

Roark

Alternative evaluation tool for voice agent testing.

Features

  • Voice quality assessment
  • Interaction testing
  • Scenario-based evaluation

Website

Evaluation Metrics

Speech Quality Metrics

PESQ (Perceptual Evaluation of Speech Quality)

  • Range: -0.5 to 4.5
  • Meaning:
    • 4.5 = Perfect quality
    • 4.0 = Excellent
    • 3.5 = Good
    • 3.0 = Fair
    • < 3.0 = Poor

Use: Comparing audio quality across models/configurations

MOS (Mean Opinion Score)

  • Method: Have humans rate audio quality (1-5 scale)
  • Average: Calculate mean across raters
  • Scale:
    • 5 = Excellent
    • 4 = Good
    • 3 = Fair
    • 2 = Poor
    • 1 = Bad

Use: Gold standard for audio quality, expensive but accurate

Speaker Similarity

  • Metric: How closely generated speech matches target speaker
  • Method: Cosine similarity of speaker embeddings
  • Scale: 0-1 (1 = identical)

Use: Voice cloning/speaker matching evaluation

Conversation Quality Metrics

Relevance

  • Definition: How well response addresses the user input
  • Scale: 0-1 or 1-5
  • Evaluation Method: Manual or LLM-based

Coherence

  • Definition: Logical flow and consistency of conversation
  • Scale: 0-1 or 1-5
  • Evaluation Method: Manual or automated

Completeness

  • Definition: Whether response fully answers user question
  • Scale: 0-1 or 1-5
  • Evaluation Method: Manual checklist

Natural Prosody

  • Definition: How natural speech sounds (tone, pacing, emphasis)
  • Scale: 0-1 or 1-5
  • Evaluation Method: Human raters

System Metrics

Latency

  • P50: Median response time
  • P95: 95th percentile response time
  • P99: 99th percentile response time

Target:

  • Interactive: < 500ms
  • Real-time: < 200ms

Throughput

  • Definition: Requests processed per second
  • Unit: RPS (requests per second)
  • Target: Depends on load requirements

Error Rate

  • Definition: % of failed requests
  • Target: < 1% for production

Bench mark