LLM Observability And Eval

Eval

LLM Evaluation Metrics Types

Intrinsic metrics: evaluate the model’s internal workings, such as perplexity and fluency.

Perplexity: measures how well the model predicts a test dataset. Lower perplexity indicates better performance.
Fluency: measures the coherence and naturalness of the generated text.
BLEU (Bilingual Evaluation Understudy) Score: measures the similarity between the generated text and a reference text.

Extrinsic metrics: evaluate the model’s performance on specific tasks, such as question-answering and text classification.

Accuracy: measures the proportion of correct predictions or answers.
F1 Score: measures the balance between precision and recall.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score: measures the quality of generated summaries.

Hybrid metrics: combine intrinsic and extrinsic metrics to provide a more comprehensive evaluation.

METEOR (Metric for Evaluation of Translation with Explicit ORdering) Score: measures the similarity between generated and reference translations, taking into account the order of the words.
GEVAL

G-Eval

Using GPT-4 and chain-of-thoughts (CoT) approach to generate detailed evaluation steps for NLG outputs.

How G-EVAL Works

Text Embeddings: G-EVAL uses GPT-4 to generate text embeddings for both the generated text and human-written reference texts.
Similarity Computation: The similarity between the generated text and human-written text is computed using a similarity metric, such as cosine similarity or dot product.
Score Computation: The similarity scores are aggregated to compute a final score, which reflects the overall quality of the generated text.

Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."
 
Evaluation Steps:
 
1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.
 
 
Example:
 
 
Source Text:
 
{{Document}}
 
Summary:
 
{{Summary}}
 
 
Evaluation Form (scores ONLY):
 
- Coherence:

checkout here

SelfcheckGPT

BERTScore: Compares the generated text with reference samples using BERT embeddings.
Question-Answering (QA): Generates questions from the text and checks consistency in answers.
N-gram Analysis: Uses statistical properties of n-grams for consistency checks.
Natural Language Inference (NLI): Uses entailment and contradiction probabilities.
LLM Prompting: Queries LLMs directly to check consistency.

Check out more SelfcheckGPT

DeepEval

An open-source LLM evaluation framework that includes:

G-Eval
Summarization
Answer Relevancy
Faithfulness
Contextual Recall
Contextual Precision
RAGAS
Hallucination
Toxicity
Bias
and more. GitHub

LLM-as-Judge

Use pairwise comparisons: Instead of asking the LLM to score a single output on a Likert scale, present it with two options and ask it to select the better one. This tends to lead to more stable results.
Control for position bias: The order of options presented can bias the LLM’s decision. To mitigate this, do each pairwise comparison twice, swapping the order of pairs each time. Just be sure to attribute wins to the right option after swapping!
Allow for ties: In some cases, both options may be equally good. Thus, allow the LLM to declare a tie so it doesn’t have to arbitrarily pick a winner.
Use Chain-of-Thought: Asking the LLM to explain its decision before giving a final answer can increase eval reliability. As a bonus, this lets you to use a weaker but faster LLM and still achieve similar results. Because this part of the pipeline is typically run in batch, the extra latency from CoT isn’t a problem.
Control for response length: LLMs tend to bias toward longer responses. To mitigate this, ensure response pairs are similar in length.

Model bias where when we give to judge one model have different data trained on for that it look correct but for another it may wrong so keep this in mind

use YAML because it is less verbose, and hence consumes fewer tokens than JSON. when getting output from LLM

Metrics for N-Gram Matching

BLEU: Compares the generated text with reference completions, scoring between 0 (no match) and 1 (perfect match).
ROUGE-N: Measures n-gram overlap between generated text and references.

Never ask the point to LLM for question out of 5 like because we cannot decide what we going to do with the point. so ask for Critiques how it can improved etc For more check Creating a LLM-as-a-Judge That Drives Business Results

Agent as judge

Using agent as judge. which have access to tools etc so it will be perform better then LLM as judge

for more check here

Auto-Arena

Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Auto-Arena framework consists of three stages: Question Generation, Multi-round Peer Battles, and Committee Discussions. These three stages are run sequentially and fully simulated with LLM-powered agents to evaluate the response check here

Chain poll

A HIGH EFFICACY METHOD FOR LLM HALLUCINATION DETECTION

The Correctness and Context Adherence metrics in the Galileo console are powered by ChainPoll-Correctness and ChainPoll-Adherence

ChainPoll-based metric for each of these cases.

ChainPoll-Correctness uses ChainPoll to detect open-domain hallucination.
ChainPoll-Adherence uses ChainPoll to detect open-domain hallucination.

Steps

Ask gpt-3.5-turbo whether the completion contained hallucination(s), using a detailed and carefully engineered prompt.
Run step 1 multiple times, typically 5. (We use batch inference here for its speed and cost advantages.)
Divide the number of “yes” answers from step 2 by the total number of answers to produce a score between 0 and 1

I need you to verify the following statements for correctness using the ChainPoll method:

1. Break down the response into individual facts.
2. Verify each fact using reliable sources.
3. Identify any inconsistencies or errors.
4. Provide the correct information if any fact is incorrect.

Prometheus

Prometheus is a family of open-source language models specialized in evaluating other language models. By effectively simulating human judgments and proprietary LM-based evaluations, we aim to resolve the following issues:

Fairness: Not relying on closed-source models for evaluations!
Controllability: You don’t have to worry about GPT version updates or sending your private data to OpenAI by constructing internal evaluation pipelines
Affordability: If you already have GPUs, it is free to use!

 
You are a fair judge assistant tasked with providing clear, objective feedback 
based on specific criteria, ensuring each assessment reflects the absolute 
standards set for performance.
 
###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a 
reference answer that gets a score of 5, and a score rubric representing a 
evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly 
based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. 
You should refer to the score rubric.
3. The output format should look as follows: \\"Feedback: (write a feedback for 
criteria) [RESULT] (an integer number between 1 and 5)\\"
4. Please do not generate any other opening, closing, and explanations.
 
###The instruction to evaluate:
{instruction}
 
###Response to evaluate:
{response}
 
###Reference Answer (Score 5):
{reference_answer}
 
###Score Rubrics:
{score_rubric}
 
###Feedback:

prometheus-eval

Ragas

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context.

Ragas Framework

Resources

DSPy Multi-Hop Chain-of-Thought RAG

EvalLM

Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria it a website where we can enter prompt and do eval using perdefind criteria and user defined criteria

ChainForge

ChainForge is an open-source visual programming environment for prompt engineering, LLM evaluation and experimentation

SPADE

System for Prompt Analysis and Delta-Based Evaluation (SPADE) A method for automatically synthesizing data quality assertions that identify bad LLM outputs

How it Works

Prompt Tracking: Logs prompt changes over time.
Prompt Changes Evaluation: Generates responses based on updated prompts.
Automated Unit Test Generation: Creates unit tests for each prompt variation.
Delta-Based Analysis: Compares outputs before and after prompt changes.
Quality Assertion Creation: Forms assertions to detect bad outputs.

Resources

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions here
Evaluating the Effectiveness of LLM-Evaluators

Giskard

Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data

The Giskard LLM scan comprises two main types of detectors:

Traditional detectors: which exploit known techniques or heuristics to detect vulnerabilities Example : LLMCharsInjectionDetector
LLM assisted detectors: which use another LLM model to probe the model under analysis. Example:LLMBasicSycophancyDetector

Snorkel

Create custom trained model with custom data and use it for eval Steps

Create golden dataset
Encode acceptance criteria into custom quality model
Slice your prompts to evaluate what matters
Review fine grainded benchmarks

True lens

RAG Triad of metrics

Context Relevance → is retervied context relvant to the query?
Answer Relevance → is the response relvant to the query?
Groundedness → is response supported by the context?

Quotient AI

Quotient AI automates manual evaluations starting from real data, incorporating human feedback.

Context Relevance
Chunk Relevance
Faithfulness
ROUGE-L
BERT Sentence Similarity
BERTScore

Eleuther AI

A framework for few-shot evaluation of language models.

Arize Phoenix

Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting.

Braintrust

Braintrust is an end-to-end platform for building AI applications. It makes software development with large language models (LLMs) robust and iterative.

Learning

A three-step approach to improve evaluations:
- Align evaluators with domain experts by having them continuously critique the evaluator results and incorporating their feedback into the evaluator prompt.
- Keep data sets aligned with real-world user queries by logging underperforming queries in production and flowing them back into the test suite.
- Measure and track alignment over time using metrics like F1 score or correlation coefficients to determine if the evaluator is truly improving.
Customizing the LM evaluator prompt is crucial. Rather than relying on templated metrics, tailor the evaluation criteria to the specific use case and business context.
Involve domain experts early in the evaluation process to assess whether the evaluator’s judgments align with their expertise.
Treat LM evaluator prompts as living documents that need to evolve. Regularly test new versions against the expanding test bank and invest in tools that allow domain experts to iterate on the evaluator prompt.
Continuous improvement is the goal, and iterative feedback loops should be built into the development process.
The ultimate measure of effective LM evaluations is their alignment with real-world usage.
Use LLm as juduge for the review if the LLM give more score we can directly send to customer else send for human review
A key recommendation is to decompose complex evaluation tasks into separate criteria, evaluating each aspect (e.g., length, tone, topic in summarization) individually for better performance

Types of Evals: - Deterministic Checks: These include sanity checks and regular expression (regex) checks to ensure the AI output adheres to basic rules and constraints. - Assertion/Criteria Checks: These evaluations use the AI model’s intelligence to assess whether the output meets specific criteria, such as relevance and accuracy. - Single-Step Evals: Focus on evaluating individual steps within an AI application, using a cascading series of evaluations like regex checks and judge LLMs. - Multi-Step Evals: Address the complexities of multi-step workflows, where errors can compound. These evaluations often require tracing to capture each step and identify where failures occur. - Trajectory Evals: Used for agentic systems to evaluate the paths the AI takes to solve problems. This involves running simulations to test the AI in different environments and analyzing the trajectories it follows.

Tools

Port Key
TrueLens: Website
Inspect AI: GitHub
Giskard: GitHub
https://github.com/EleutherAI/lm-evaluation-harness
https://gymnasium.farama.org/

Resources

A Survey on Hallucination in Large Language Models
Evaluating the Effectiveness of LLM-Evaluators
A framework for few-shot evaluation of language models.
LLM Evaluation Skills Are Easy to Pick Up
How to Cook Good AI Products with What You Already Have in your Data Warehouse
Optimizing RAG Through an Evaluation-Based Methodology
Innovations in Evaluating AI Agent Performance
Your Evals Are Meaningless (And Here’s How to Fix Them)
Mission-Critical Evals at Scale (Learnings from 100k medical decisions)
[Evaluating AI Agents via “Trajectory Evals” & “Eval Agents” | w/ Dhruv Singh Co-Founder @ HoneyHive](https://www.youtube.com/watch?v=IWy7towYJDM&t=9s “Evaluating AI Agents via “Trajectory Evals” & “Eval Agents” | w/ Dhruv Singh Co-Founder @ HoneyHive”)
https://eugeneyan.com/writing/llm-evaluators/
https://trilogyai.substack.com/p/llm-evaluation-frameworks
https://www.comet.com/site/blog/llm-juries-for-evaluation/
Building a SNAP LLM eval: part 1
https://www.qawolf.com/webinars/innovations-in-evaluating-ai-agent-performance

LLM Observability And Eval

Table of Contents

Eval

LLM Evaluation Metrics Types

G-Eval

SelfcheckGPT

DeepEval

LLM-as-Judge

Agent as judge

Auto-Arena

Chain poll

Prometheus

Ragas

Resources

EvalLM

ChainForge

SPADE

Resources

Giskard

Snorkel

True lens

Quotient AI

Eleuther AI

Arize Phoenix

Braintrust

Learning

Tools

Resources

Graph View

Table of Contents