Eval
LLM Evaluation Metrics Types
Intrinsic metrics: evaluate the model’s internal workings, such as perplexity and fluency.
- Perplexity: measures how well the model predicts a test dataset. Lower perplexity indicates better performance.
- Fluency: measures the coherence and naturalness of the generated text.
- BLEU (Bilingual Evaluation Understudy) Score: measures the similarity between the generated text and a reference text.
Extrinsic metrics: evaluate the model’s performance on specific tasks, such as question-answering and text classification.
- Accuracy: measures the proportion of correct predictions or answers.
- F1 Score: measures the balance between precision and recall.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score: measures the quality of generated summaries.
Hybrid metrics: combine intrinsic and extrinsic metrics to provide a more comprehensive evaluation.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering) Score: measures the similarity between generated and reference translations, taking into account the order of the words.
- GEVAL
G-Eval
Using GPT-4 and chain-of-thoughts (CoT) approach to generate detailed evaluation steps for NLG outputs.
How G-EVAL Works
- Text Embeddings: G-EVAL uses GPT-4 to generate text embeddings for both the generated text and human-written reference texts.
- Similarity Computation: The similarity between the generated text and human-written text is computed using a similarity metric, such as cosine similarity or dot product.
- Score Computation: The similarity scores are aggregated to compute a final score, which reflects the overall quality of the generated text.
Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."
Evaluation Steps:
1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.
Example:
Source Text:
{{Document}}
Summary:
{{Summary}}
Evaluation Form (scores ONLY):
- Coherence:
checkout here
SelfcheckGPT
- BERTScore: Compares the generated text with reference samples using BERT embeddings.
- Question-Answering (QA): Generates questions from the text and checks consistency in answers.
- N-gram Analysis: Uses statistical properties of n-grams for consistency checks.
- Natural Language Inference (NLI): Uses entailment and contradiction probabilities.
- LLM Prompting: Queries LLMs directly to check consistency.
Check out more SelfcheckGPT
DeepEval
An open-source LLM evaluation framework that includes:
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- Hallucination
- Toxicity
- Bias
- and more. GitHub
LLM-as-Judge
- Use pairwise comparisons: Instead of asking the LLM to score a single output on a Likert scale, present it with two options and ask it to select the better one. This tends to lead to more stable results.
- Control for position bias: The order of options presented can bias the LLM’s decision. To mitigate this, do each pairwise comparison twice, swapping the order of pairs each time. Just be sure to attribute wins to the right option after swapping!
- Allow for ties: In some cases, both options may be equally good. Thus, allow the LLM to declare a tie so it doesn’t have to arbitrarily pick a winner.
- Use Chain-of-Thought: Asking the LLM to explain its decision before giving a final answer can increase eval reliability. As a bonus, this lets you to use a weaker but faster LLM and still achieve similar results. Because this part of the pipeline is typically run in batch, the extra latency from CoT isn’t a problem.
- Control for response length: LLMs tend to bias toward longer responses. To mitigate this, ensure response pairs are similar in length.
use YAML because it is less verbose, and hence consumes fewer tokens than JSON. when getting output from LLM
Metrics for N-Gram Matching
- BLEU: Compares the generated text with reference completions, scoring between 0 (no match) and 1 (perfect match).
- ROUGE-N: Measures n-gram overlap between generated text and references.
Never ask the point to LLM for question out of 5 like because we cannot decide what we going to do with the point. so ask for Critiques how it can improved etc For more check Creating a LLM-as-a-Judge That Drives Business Results
Agent as judge
Using agent as judge. which have access to tools etc so it will be perform better then LLM as judge
for more check here
Auto-Arena
Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
Auto-Arena framework consists of three stages: Question Generation, Multi-round Peer Battles, and Committee Discussions. These three stages are run sequentially and fully simulated with LLM-powered agents to evaluate the response check here
Chain poll
A HIGH EFFICACY METHOD FOR LLM HALLUCINATION DETECTION
The Correctness and Context Adherence metrics in the Galileo console are powered by ChainPoll-Correctness and ChainPoll-Adherence
ChainPoll-based metric for each of these cases.
- ChainPoll-Correctness uses ChainPoll to detect open-domain hallucination.
- ChainPoll-Adherence uses ChainPoll to detect open-domain hallucination.
Steps
- Ask gpt-3.5-turbo whether the completion contained hallucination(s), using a detailed and carefully engineered prompt.
- Run step 1 multiple times, typically 5. (We use batch inference here for its speed and cost advantages.)
- Divide the number of “yes” answers from step 2 by the total number of answers to produce a score between 0 and 1
I need you to verify the following statements for correctness using the ChainPoll method:
1. Break down the response into individual facts.
2. Verify each fact using reliable sources.
3. Identify any inconsistencies or errors.
4. Provide the correct information if any fact is incorrect.
Prometheus
Prometheus is a family of open-source language models specialized in evaluating other language models. By effectively simulating human judgments and proprietary LM-based evaluations, we aim to resolve the following issues:
-
Fairness: Not relying on closed-source models for evaluations!
-
Controllability: You don’t have to worry about GPT version updates or sending your private data to OpenAI by constructing internal evaluation pipelines
-
Affordability: If you already have GPUs, it is free to use!
You are a fair judge assistant tasked with providing clear, objective feedback
based on specific criteria, ensuring each assessment reflects the absolute
standards set for performance.
###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a
reference answer that gets a score of 5, and a score rubric representing a
evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly
based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5.
You should refer to the score rubric.
3. The output format should look as follows: \\"Feedback: (write a feedback for
criteria) [RESULT] (an integer number between 1 and 5)\\"
4. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate:
{instruction}
###Response to evaluate:
{response}
###Reference Answer (Score 5):
{reference_answer}
###Score Rubrics:
{score_rubric}
###Feedback:
Ragas
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context.
Ragas Framework
Resources
EvalLM
Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria it a website where we can enter prompt and do eval using perdefind criteria and user defined criteria
ChainForge
ChainForge is an open-source visual programming environment for prompt engineering, LLM evaluation and experimentation
SPADE
System for Prompt Analysis and Delta-Based Evaluation (SPADE) A method for automatically synthesizing data quality assertions that identify bad LLM outputs
How it Works
- Prompt Tracking: Logs prompt changes over time.
- Prompt Changes Evaluation: Generates responses based on updated prompts.
- Automated Unit Test Generation: Creates unit tests for each prompt variation.
- Delta-Based Analysis: Compares outputs before and after prompt changes.
- Quality Assertion Creation: Forms assertions to detect bad outputs.
Resources
- A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions here
- Evaluating the Effectiveness of LLM-Evaluators
Giskard
Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data
The Giskard LLM scan comprises two main types of detectors:
- Traditional detectors: which exploit known techniques or heuristics to detect vulnerabilities Example : LLMCharsInjectionDetector
- LLM assisted detectors: which use another LLM model to probe the model under analysis. Example:LLMBasicSycophancyDetector
Snorkel
Create custom trained model with custom data and use it for eval Steps
- Create golden dataset
- Encode acceptance criteria into custom quality model
- Slice your prompts to evaluate what matters
- Review fine grainded benchmarks
True lens
RAG Triad of metrics
- Context Relevance → is retervied context relvant to the query?
- Answer Relevance → is the response relvant to the query?
- Groundedness → is response supported by the context?
Quotient AI
Quotient AI automates manual evaluations starting from real data, incorporating human feedback.
- Context Relevance
- Chunk Relevance
- Faithfulness
- ROUGE-L
- BERT Sentence Similarity
- BERTScore
Eleuther AI
A framework for few-shot evaluation of language models.
Arize Phoenix
Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting.
Braintrust
Braintrust is an end-to-end platform for building AI applications. It makes software development with large language models (LLMs) robust and iterative.
Learning
-
A three-step approach to improve evaluations:
- Align evaluators with domain experts by having them continuously critique the evaluator results and incorporating their feedback into the evaluator prompt.
- Keep data sets aligned with real-world user queries by logging underperforming queries in production and flowing them back into the test suite.
- Measure and track alignment over time using metrics like F1 score or correlation coefficients to determine if the evaluator is truly improving.
-
Customizing the LM evaluator prompt is crucial. Rather than relying on templated metrics, tailor the evaluation criteria to the specific use case and business context.
-
Involve domain experts early in the evaluation process to assess whether the evaluator’s judgments align with their expertise.
-
Treat LM evaluator prompts as living documents that need to evolve. Regularly test new versions against the expanding test bank and invest in tools that allow domain experts to iterate on the evaluator prompt.
-
Continuous improvement is the goal, and iterative feedback loops should be built into the development process.
-
The ultimate measure of effective LM evaluations is their alignment with real-world usage.
-
Use LLm as juduge for the review if the LLM give more score we can directly send to customer else send for human review
Types of Evals: - Deterministic Checks: These include sanity checks and regular expression (regex) checks to ensure the AI output adheres to basic rules and constraints. - Assertion/Criteria Checks: These evaluations use the AI model’s intelligence to assess whether the output meets specific criteria, such as relevance and accuracy. - Single-Step Evals: Focus on evaluating individual steps within an AI application, using a cascading series of evaluations like regex checks and judge LLMs. - Multi-Step Evals: Address the complexities of multi-step workflows, where errors can compound. These evaluations often require tracing to capture each step and identify where failures occur. - Trajectory Evals: Used for agentic systems to evaluate the paths the AI takes to solve problems. This involves running simulations to test the AI in different environments and analyzing the trajectories it follows.
Tools
- Port Key
- TrueLens: Website
- Inspect AI: GitHub
- Giskard: GitHub
- https://github.com/EleutherAI/lm-evaluation-harness
- https://gymnasium.farama.org/
Resources
- A Survey on Hallucination in Large Language Models
- Evaluating the Effectiveness of LLM-Evaluators
- A framework for few-shot evaluation of language models.
- LLM Evaluation Skills Are Easy to Pick Up
- How to Cook Good AI Products with What You Already Have in your Data Warehouse
- Optimizing RAG Through an Evaluation-Based Methodology
- Innovations in Evaluating AI Agent Performance
- Your Evals Are Meaningless (And Here’s How to Fix Them)
- Mission-Critical Evals at Scale (Learnings from 100k medical decisions)
- [Evaluating AI Agents via “Trajectory Evals” & “Eval Agents” | w/ Dhruv Singh Co-Founder @ HoneyHive](https://www.youtube.com/watch?v=IWy7towYJDM&t=9s “Evaluating AI Agents via “Trajectory Evals” & “Eval Agents” | w/ Dhruv Singh Co-Founder @ HoneyHive”)
- https://eugeneyan.com/writing/llm-evaluators/
- https://trilogyai.substack.com/p/llm-evaluation-frameworks
- https://www.comet.com/site/blog/llm-juries-for-evaluation/