Blog

Ragas Test Set Generation Breakdown

Delving into RAG application testing with Ragas, this article explores manual vs. automated test set creation using LLMs, focusing on prompt design and the generation process for accurate and efficient testing.

Ragas Test Set Generation Breakdown
Development17 min read
Luka Panic
LukaPanicLuka Panic2024-05-22
2024-05-22
Large Language ModelsAI Blog SeriesRetrieval Augmented GenerationTest set generation
Large Language Models
AI Blog Series
Retrieval Augmented Generation
Test set generation

Introduction

Testing the RAG application can initially appear complex, particularly when it comes to obtaining a high-quality test set. The integrity of the entire testing process hinges on the quality of the set. While creating a test set manually is an option, leveraging Large Language Models (LLMs) offers a way to automate the process.

After gaining a high-level understanding of working with Ragas from the previous article, it's time to focus on the generation of a test set. This article explores the trade-offs between manual and automated test set creation, highlighting their respective advantages and challenges.

Test set definition

When creating a test set by hand, the first thought is often the enormous time consumption. However, there's much more to it. To understand the issue, it's necessary to first consider the requirements that the test set must meet.

The core idea of the RAG system is to provide relevant context to the LLM that will generate a final answer. Therefore, a test set needs to have a question and the correct answer, and thus the test set can be represented as a set of (question, answer) pairs. These pairs are commonly referred to as samples. But where does that question come from? The answer lies in the context, whose meaning is ambiguous, and here’s why. When it comes to generating an answer to the user's question in the RAG system, the context represents a result of retrieval that is provided to the LLM, as previously stated.

On the other hand, when creating a test set the context is just a chunk of text which is used to derive a question. In the rest of the article, the latter interpretation will be used. Having this in mind, a definition of the test set sample can be upgraded to the (context, question, answer).

Once we have a context obtained from a given document, it's time to move to the next step - a question generation. From the previous discussion, the term question seems loose, it can be literally anything. From true/false and multiple choice questions all the way to the opinion-based, interpretation, comparison and critical thinking questions. Therefore, there arises a need to introduce different types of questions, which leads to another upgrade of the test set sample: (context, question, question_type, answer).

A test set may be derived from multiple documents, and testers might be interested in identifying the source document for certain questions. This information can be saved within the metadata. The sample is now represented as (context, question, question_type, answer, metadata).

Test set generation challenges

Manually creating a test set for RAG systems is an intensive and exhaustive process that demands significant concentration. In machine learning, manual labeling of samples can be feasible for smaller datasets used in classification models. For example, one might manually classify food images as fruits or vegetables for an entire day. However, when dealing with RAG systems based on generative models, one must be fully focused and deeply engaged in the process.

Additionally, there is no guarantee that a single person will be able to create questions of specific complexity level. In certain cases, question generation might seem subjective and affected by bias. To address these challenges, explodinggradients team decided to implement a synthetic test set generation in the Ragas.

Motivation

Before deep diving into the Ragas test set generation process, it's important to discuss the ideas behind it. According to Ragas official documentation, test set generation is based on Evol-Instruct, a method for producing a massive amount of instructions at different difficulty levels. It aims to improve the performance of LLMs. Evol-Instruct features an interesting concept of instruction evolution. This method starts with a simple initial instruction and tries to increase its complexity by incorporating the LLM. Instructions can evolve (become more complex) in two ways: in-depth or in-breadth.

In-breadth evolving upgrades the simple instruction to a more complex one or creates a new instruction to increase the diversity. In-depth evolving is more diverse and offers the following operations:

  • add constraints
  • deepening
  • concretizing
  • increase reasoning steps
  • complicate input

Let's cut to the chase and see an example for each type:

  • initial instruction - What is the water cycle?
  • in-breadth evolving - What role does the water cycle play in maintaining Earth's ecosystems
  • add constraints - How does the water cycle vary in different geographical regions (e.g., deserts vs. rainforests)?
  • deepening - What are the stages of the water cycle (evaporation, condensation, precipitation, infiltration, runoff)?
  • increase reasoning steps - How does the water cycle contribute to the global distribution of freshwater?
  • complicate input (table) - Fill in the table with the average annual rainfall for different regions.
  • complicate input (code) - Write a Python program to simulate the water cycle, tracking the movement of water through different phases (evaporation, condensation, precipitation).
  • complicate input (formula) - Calculate the total amount of water evaporated from a lake given its surface area and average evaporation rate.

Test set generation in Ragas

Test set definition

The test set was previously defined as a list of the samples with the following properties: (context, question, question_type, answer, metadata). After understanding these on the conceptual level, it's time to align with the terminology used in Ragas for the sake of simplicity. In Ragas, a sample is represented as (question, contexts, ground_truth, evolution_type, metadata, episode_done), where ground_truth represents the correct answer. episode_done tells whether the generation process has finished successfully. Other properties are self-explanatory.

Ragas supports generating the following evolution (question) types:

  • simple
  • reasoning
  • conditional
  • multi context

Prompt design

Prompt engineering is an important aspect of working with LLMs, especially in the case of long chains where the next action depends on the output of the previous one. Thus, prompts in Ragas were carefully designed by following the common schema:

  • name uniquely identifies a prompt
  • instruction describes a task that needs to be done
  • input_keys identifies inputs
  • output_key identifies output
  • output_type output format type, JSON or string
  • examples showcase the expected behavior utilizing the input_keys and output_key
  • language language of the prompt

The prompts in Ragas are originally written in English but can be adapted to other languages. Once translated, the prompts are cached and retrieved when needed to reduce token consumption and cut costs.

Dual LLM configuration

There are two “types” of LLMs running in Ragas: critic and generator. This offers flexibility in terms of cost reduction because the cheaper model can be used as a generator, while the more expensive model can be used for evaluating, i.e. criticizing.

Prompts for test set generation

After becoming acquainted with the structure of a test set, one might wonder how it was generated step by step and which prompts were sent to the LLMs. The following schema illustrates the generation process resulting in a test set sample. Yellow rectangles represent the prompts used in specific steps of the generation.

Test set generation schema Image 1 - Test set generation schema

Before diving into the structure and explanation of each prompt, note that these are shortened versions. The actual task is omitted and replaced with to avoid clutter, as the essence of each prompt is clear from the Examples. Additionally, JSON formatting instructions are skipped to make the prompts more readable, as they consistently follow the same approach:

The output should be a well-formatted JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output JSON schema:
```
{"type": "object", "properties": {"clarity": {"title": "Clarity", "type": "integer"}, "depth": {"title": "Depth", "type": "integer"}, "structure": {"title": "Structure", "type": "integer"}, "relevance": {"title": "Relevance", "type": "integer"}}, "required": ["clarity", "depth", "structure", "relevance"]}
```

Do not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).

Keyphrase extraction prompt extracts keyphrases from a given text that will be used to generate a seed question.

Extract the top 3 to 5 keyphrases from the provided text, focusing on the most significant and distinctive aspects. 

Examples:
text: "A black hole is a region of spacetime where gravity is so strong that nothing, including light and other electromagnetic waves, has enough energy to escape it. The theory of general relativity predicts that a sufficiently compact mass can deform spacetime to form a black hole."

output: ```{"keyphrases": ["Black hole", "Region of spacetime", "Strong gravity", "Light and electromagnetic waves", "Theory of general relativity"]}```


Your actual task: ...

output:

Context scoring prompt is used to obtain a numerical score (1-3) for a given context based on the following criteria: clarity, depth, structure, and relevance. After the output is generated, Ragas calculates a total score using the following expression: Document model Image 2 - Score The context will be passed to the next step if the condition is met: Document model Image 3- Condition where threshold is set to value of 1.5 by default. The exact output JSON schema is included in the prompt to reduce the chance of the LLM making a mistake.

Given a context, perform the following task and output the answer in VALID JSON format: Assess the provided context and assign a numerical score of 1 (Low), 2 (Medium), or 3 (High) for each of the following criteria in your JSON response:

clarity: Evaluate the precision and understandability of the information presented. High scores (3) are reserved for contexts that are both precise in their information and easy to understand. Low scores (1) are for contexts where the information is vague or hard to comprehend.

depth: Determine the level of detailed examination and the inclusion of innovative insights within the context. A high score indicates a comprehensive and insightful analysis, while a low score suggests a superficial treatment of the topic.
structure: Assess how well the content is organized and whether it flows logically. High scores are awarded to contexts that demonstrate coherent organization and logical progression, whereas low scores indicate a lack of structure or clarity in progression.

relevance: Judge the pertinence of the content to the main topic, awarding high scores to contexts tightly focused on the subject without unnecessary digressions, and low scores to those that are cluttered with irrelevant information.

Structure your JSON output to reflect these criteria as keys with their corresponding scores as values

Examples:
context: "The Pythagorean theorem is a fundamental principle in geometry. It states that in a right-angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. This can be written as a^2 + b^2 = c^2 where c represents the length of the hypotenuse, and a and b represent the lengths of the other two sides."

output: ```{"clarity": 3, "depth": 1, "structure": 3, "relevance": 3}```


context: "Albert Einstein (14 March 1879 - 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time."

output: ```{"clarity": 3, "depth": 2, "structure": 3, "relevance": 3}```


context: "I love chocolate. It's really tasty. Oh, and by the way, the earth orbits the sun, not the other way around. Also, my favorite color is blue."

output: ```{"clarity": 2, "depth": 1, "structure": 1, "relevance": 1}```


Your actual task: ...

output:

Seed question prompt creates a simple (seed) question from given context.

Generate a question that can be fully answered from given context. The question should be formed using topic

Examples:
context: "The process of evaporation plays a crucial role in the water cycle, converting water from liquid to vapor and allowing it to rise into the atmosphere."

keyphrase: "Evaporation"

question: "Why is evaporation important in the water cycle?"


Your actual task: ...

question:

Filter question prompt assigns a verdict of 1 if question meets specified criteria (independence and clear intent), or 0 otherwise. Additionally, it returns a detailed feedback. After the output is generated, the question will be accepted only if the verdict is 1, and the evolution process will continue. Otherwise, Ragas will try to rewrite the question using the feedback.

Asses the given question for clarity and answerability given enough domain knowledge, consider the following criteria:

1.Independence: Can the question be understood and answered without needing additional context or access to external references not provided within the question itself? Questions should be self-contained, meaning they do not rely on specific documents, tables, or prior knowledge not shared within the question.

2.Clear Intent: Is it clear what type of answer or information the question seeks? The question should convey its purpose without ambiguity, allowing for a direct and relevant response.

Based on these criteria, assign a verdict of "1" if a question is specific, independent, and has a clear intent, making it understandable and answerable based on the details provided. Assign "0" if it fails to meet one or more of these criteria due to vagueness, reliance on external references, or ambiguity in intent.
Provide feedback and a verdict in JSON format, including suggestions for improvement if the question is deemed unclear. Highlight aspects of the question that contribute to its clarity or lack thereof, and offer advice on how it could be reframed or detailed for better understanding and answerability.

Examples:
question: "What is the discovery about space?"

output: ```{"feedback": "The question is too vague and broad, asking for a 'discovery about space' without specifying any particular aspect, time frame, or context of interest. This could refer to a wide range of topics, from the discovery of new celestial bodies to advancements in space travel technology. To improve clarity and answerability, the question could specify the type of discovery (e.g., astronomical, technological), the time frame (e.g., recent, historical), or the context (e.g., within a specific research study or space mission).", "verdict": 0}```


question: "What is the configuration of UL2 training objective in OpenMoE and why is it a better choice for pre-training?"

output: ```{"feedback": "The question asks for the configuration of the UL2 training objective within the OpenMoE framework and the rationale behind its suitability for pre-training. It is clear in specifying the topic of interest (UL2 training objective, OpenMoE) and seeks detailed information on both the configuration and the reasons for its effectiveness in pre-training. However, the question might be challenging for those unfamiliar with the specific terminology or the context of OpenMoE and UL2. For broader clarity and answerability, it would be helpful if the question included a brief explanation or context about OpenMoE and the UL2 training objective, or clarified the aspects of pre-training effectiveness it refers to (e.g., efficiency, accuracy, generalization).", "verdict": 1}```


Your actual task: ...

output:

Reasoning question prompt evolves given question into the reasoning question.

Complicate the given question by rewriting question into a multi-hop reasoning question based on the provided context.
    Answering the question should require the reader to make multiple logical connections or inferences using the information available in given context.
    Rules to follow when rewriting question:
    1. Ensure that the rewritten question can be answered entirely from the information present in the contexts.
    2. Do not frame questions that contains more than 15 words. Use abbreviation wherever possible.
    3. Make sure the question is clear and unambiguous.
    4. phrases like 'based on the provided context','according to the context',etc are not allowed to appear in the question.

Examples:
question: "What is the capital of France?"

context: "France is a country in Western Europe. It has several cities, including Paris, Lyon, and Marseille. Paris is not only known for its cultural landmarks like the Eiffel Tower and the Louvre Museum but also as the administrative center."

output: "Linking the Eiffel Tower and administrative center, which city stands as both?"


Your actual task: ...

output:

Conditional question prompt evolves given question by introducing the conditional element to it.

Rewrite the provided question to increase its complexity by introducing a conditional element.
The goal is to make the question more intricate by incorporating a scenario or condition that affects the context of the question.
Follow the rules given below while rewriting the question.
    1. The rewritten question should not be longer than 25 words. Use abbreviation wherever possible.
    2. The rewritten question must be reasonable and must be understood and responded by humans.
    3. The rewritten question must be fully answerable from information present context.
    4. phrases like 'provided context','according to the context?',etc are not allowed to appear in the question.

Examples:
question: "What is the function of the roots of a plant?"

context: "The roots of a plant absorb water and nutrients from the soil, anchor the plant in the ground, and store food."

output: "What dual purpose do plant roots serve concerning soil nutrients and stability?"


Your actual task: ...

output:

Multi context question prompt evolves given question so the LLM needs the knowledge from multiple chunks to answer it.

The task is to rewrite and complicate the given question in a way that answering it requires information derived from both context1 and context2. 
    Follow the rules given below while rewriting the question.
        1. The rewritten question should not be very long. Use abbreviation wherever possible.
        2. The rewritten question must be reasonable and must be understood and responded by humans.
        3. The rewritten question must be fully answerable from information present in context1 and context2. 
        4. Read and understand both contexts and rewrite the question so that answering requires insight from both context1 and context2.
        5. phrases like 'based on the provided context','according to the context?',etc are not allowed to appear in the question.

Examples:
question: "What process turns plants green?"

context1: "Chlorophyll is the pigment that gives plants their green color and helps them photosynthesize."

context2: "Photosynthesis in plants typically occurs in the leaves where chloroplasts are concentrated."

output: "In which plant structures does the pigment responsible for their verdancy facilitate energy production?"


Your actual task: ...

output:

Question rewrite prompt rewrites given question that failed in the question filtering step utilizing feedback.

Given a context, question and feedback, rewrite the question to improve its clarity and answerability based on the feedback provided.

Examples:
context: "The Eiffel Tower was constructed using iron and was originally intended as a temporary exhibit for the 1889 World's Fair held in Paris. Despite its initial temporary purpose, the Eiffel Tower quickly became a symbol of Parisian ingenuity and an iconic landmark of the city, attracting millions of visitors each year. The tower's design, created by Gustave Eiffel, was initially met with criticism from some French artists and intellectuals, but it has since been celebrated as a masterpiece of structural engineering and architectural design."

question: "Who created the design for the Tower?"

feedback: "The question asks about the creator of the design for 'the Tower', but it does not specify which tower it refers to. There are many towers worldwide, and without specifying the exact tower, the question is unclear and unanswerable. To improve the question, it should include the name or a clear description of the specific tower in question."

output: "Who created the design for the Eiffel Tower?"


Your actual task: ...

output:

Compress question prompt not only makes the question shorter but also makes it more indirect.

Rewrite the following question to make it more indirect and shorter while retaining the essence of the original question.
    The goal is to create a question that conveys the same meaning but in a less direct manner. The rewritten question should shorter so use abbreviation wherever possible.

Examples:
question: "What is the distance between the Earth and the Moon?"

output: "How far is the Moon from Earth?"


Your actual task: ...

output:

Evolution elimination prompt checks whether two questions (simple and compressed) are equal based on the provided requirements and provides a reason for the decision. If these questions are equal, it means that Ragas wasn’t able to properly evolve the simple question throughout the evolution process, causing the evolution filter to fail.

Check if the given two questions are equal based on following requirements:
    1. They have same constraints and requirements.
    2. They have same depth and breadth of the inquiry.
    Output verdict as 1 if they are equal and 0 if they are not

Examples:
question1: "What are the primary causes of climate change?"

question2: "What factors contribute to global warming?"

output: ```{"reason": "While both questions deal with environmental issues, 'climate change' encompasses broader changes than 'global warming', leading to different depths of inquiry.", "verdict": 0}```


question1: "How does photosynthesis work in plants?"

question2: "Can you explain the process of photosynthesis in plants?"

output: ```{"reason": "Both questions ask for an explanation of the photosynthesis process in plants, sharing the same depth, breadth, and requirements for the answer.", "verdict": 1}```


question1: "What are the health benefits of regular exercise?"

question2: "Can you list the advantages of exercising regularly for health?"

output: ```{"reason": "Both questions seek information about the positive effects of regular exercise on health. They require a similar level of detail in listing the health benefits.", "verdict": 1}```


Your actual task: ...

output:

Find relevant context prompt selects relevant contexts from the list of provided contexts.

Given a question and set of contexts, find the most relevant contexts to answer the question.

Examples:
question: "What is the capital of France?"

contexts: ```["1. France is a country in Western Europe. It has several cities, including Paris, Lyon, and Marseille. Paris is not only known for its cultural landmarks like the Eiffel Tower and the Louvre Museum but also as the administrative center.", "2. The capital of France is Paris. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum.", "3. Paris is the capital of France. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum."]```

output: ```{"relevant_contexts": [1, 2]}```


Your actual task: ...

output:

Question answer prompt ends the generation process of a sample by answering the question using the provided context.

Answer the question using the information from the given context. Output verdict as '1' if answer is present '-1' if answer is not present in the context.

Examples:

context: "The concept of artificial intelligence (AI) has evolved over time, but it fundamentally refers to machines designed to mimic human cognitive functions. AI can learn, reason, perceive, and, in some instances, react like humans, making it pivotal in fields ranging from healthcare to autonomous vehicles."

question: "What are the key capabilities of artificial intelligence?"

answer: ```{"answer": "Artificial intelligence is designed to mimic human cognitive functions, with key capabilities including learning, reasoning, perception, and reacting to the environment in a manner similar to humans. These capabilities make AI pivotal in various fields, including healthcare and autonomous driving.", "verdict": "1"}```


context: "The novel \"Pride and Prejudice\" by Jane Austen revolves around the character Elizabeth Bennet and her family. The story is set in the 19th century in rural England and deals with issues of marriage, morality, and misconceptions."

question: "What year was 'Pride and Prejudice' published?"

answer: ```{"answer": "The answer to given question is not present in context", "verdict": "-1"}```


Your actual task: ...

answer:

Conclusion

With a total of 12 different prompts, the Ragas test set generation implementation is more complex than it might initially appear from the official documentation. Managing a significant amount of logic aimed at producing high-quality samples requires additional handling of edge cases and retries. As seen from the generation schema, these retries are costly because they involve rewriting the question, which slows down the process and increases token consumption, ultimately resulting in higher costs. The worst-case scenario occurs if the rewritten question is still unsatisfactory, leading to almost the entire process being repeated, returning all the way to the node filtering step. This highlights the importance of preprocessing the documents from which the test is derived. In some cases, it is impossible to generate a question of a certain complexity (such as reasoning or conditional questions) if the provided chunk does not contain sufficient information. These insufficient chunks will enter the process, likely fail, and increase costs.

If you want to analyze the test set generation process, track which prompts are sent, and monitor costs with each iteration, stick with us. In one of the upcoming articles, we will help you understand the entire generation process without digging through the code base or perceive Ragas as a black box.

Summary

Testing the RAG application requires a high-quality test set, which can be created manually or using Large Language Models (LLMs) for automation. Ragas, a framework for this purpose, employs 12 different prompts to generate test sets. It balances costs by using a cheaper model for generation and a more expensive one for evaluation. The process includes careful prompt design and managing edge cases, highlighting the importance of preprocessing documents to ensure sufficient information for generating complex questions. This meticulous approach helps avoid costly retries. One might optimize performance and reduce expenses by understanding the generation process in detail, instead of thinking of it as a black box.

Get blog post updates
Sign up to our newsletter and never miss an update about relevant topics in the industry.