Dive into the practical process of synthetic test set generation with Ragas. This article showcases how AI models are applied to create diverse question types using U.S. Code corpus of data, highlighting real-world testing and cost analysis in RAG applications.
The theoretical presentation of test set generation sets the stage for showcasing the results. This article explains various parameters integral to the synthetic test set generation process and concludes with a cost analysis. Building upon earlier acquired concepts, basic Ragas nuances will be revisited. By revealing careful AI model selection, thus admiring the quality of a benchmark, we delve into the practical testing of the scalable RAG architecture leveraging the Ragas framework.
The test set sample in Ragas consists of the following data:
question
contexts
ground_truth
evolution_type
metadata
episode_done
There are four evolution types, i.e. question types:
simple
reasoning
conditional
multi_context
In short, each question is generated by choosing a random chunk of text (context) from the in-memory vector store. Once the simple question is generated, it evolves into more complex forms.
For testing purposes of the RAG pipeline, the legal domain of the U.S. Code was chosen. The reasons supporting this decision are listed here.
The U.S. Code consists of 54 titles. Once parsed, they contain about 15 billion tokens when tokenized using cl100k_base
tokenizer which is used across all models from OpenAI used within this experiment.
The question arises: how many questions are needed to conduct a high-quality benchmark? Is it 10, 100, 1.000, or 10.000? What if there’s more to it than sheer quantity? This will be addressed in one of the upcoming articles.
The selection fell on roughly 500 questions. Once the test set size is known, the source of each question remains to be determined. As previously mentioned, the U.S. Code, containing 15 billion tokens, serves as the data source. However, there are some drawbacks to generating a relatively small number of questions from such a vast dataset. Firstly, let’s bring these 15 billion tokens closer because it’s abstract and hard to grasp. 1 token is approximately 3/4 of a word, so there are 11.25 billion words. On average, 250-300 words fit into a single page. By taking the upper bound of 300 words, 15 billion tokens are now scaled to 37.5 thousand pages for better understanding. That is a lot of pages!
Having this order of magnitude in mind, it doesn’t make sense to process 37.5k pages to generate 500 questions. Solely based on intuition, one page may contain enough material for multiple questions. On the other hand, some pages may not contain enough information to form a question. Considering the iterative nature of the Ragas test set generator, avoiding unnecessary retries due to the inefficient amount of information needed to generate a question of desired complexity is beneficial in both financial and time matters. It boils down to having a sufficient amount of information to generate a question. The interesting thing is that having an excessive or insufficient is both leading to the rise in the cost and time. However, it manifests differently. It’s better to have slightly excessive information because it leverages the embedding model, while insufficiency is punished by higher LLM usage which is more expensive.
There is another aspect to choosing the scale of a corpus of data from which the questions will be generated - representability of the corpus. The experiment is conducted using 5 titles which together have 2.342.281 tokens, which is 15.13% out of 15.480.094 tokens from the complete corpus.
Image 1 - Number of tokens by title number
To provide a clearer understanding, here is a brief list of titles along with their full names:
Back to technical matters, a distributions
parameter in Ragas determines how many questions of each type will be produced. This experiment relies on uniform distribution meaning that each question type should be equally incorporated into the test set.
distributions = {
simple: 0.25,
reasoning: 0.25,
multi_context: 0.25,
conditional: 0.25
}
It’s essential to highlight that questions are generated for each title individually, even though Ragas supports combining contexts from multiple documents to generate a question. This means that the resulting test set will have fewer samples from shorter titles, proportional to their share in the corpus. For example, Title 19, with 0.16M tokens, holds a 7% share in the 2.3M token subset of the U.S. Code. If the goal is to generate 500 questions, 7% of them, or approximately 35 questions, will be from Title 19.
Test set generation is powered by AI models deployed through Azure OpenAI Studio. A key feature of Azure is its content filters, which are divided into the following categories:
Each category allows for the adjustment of a severity threshold, which can be set to:
Content filtering is crucial in testing the RAG application within the legal domain, particularly when handling sensitive information. In this experiment, the content filters for all categories were set to high. However, there's an important caveat: while it may seem that setting the filters to "high" would block even the smallest contentious issues within these categories, this is not the case. In practice, the "high" severity filter only removes highly problematic content, making it relatively permissive and allowing most content to pass through.
For the nuances of Ragas, the following AI models are employed:
gpt-4o-mini
version 2024-05-13
with temperature
set to 0.7
gpt-4o
version 2024-07-18
with temperature
set to 0.1
text-embedding-ada-002
version 2
Ragas version is 0.1.7
.
The following is an analysis of the results, observed across the output examples, context, and related issues.
The outputs are structured as (context is shortened for readability):
{
"question": "What authority does the Secretary have regarding civil penalties for violations of railroad safety regulations?",
"contexts": [
"Technology Implementation Plan.—\n\n\n\nf Fatigue Management Plan.—\n\n\n\ng Consensus.—\n\n\n\nh Enforcement.—\n\nThe Secretary shall have the authority to assess civil penalties pursuant to chapter 213 for a violation of this section, including the failure to submit, certify, or comply with a safety risk reduction program, risk mitigation plan, technology implementation plan, or fatigue management plan."
],
"ground_truth": "The Secretary has the authority to assess civil penalties pursuant to chapter 213 for a violation of railroad safety regulations, including the failure to submit, certify, or comply with a safety risk reduction program, risk mitigation plan, technology implementation plan, or fatigue management plan.",
"evolution_type": "simple",
"metadata": {
"source": "00708b01-ae56-11ee-878a-91381eb3fb9c.txt"
},
"episode_done": true
}
The upcoming section presents a single question
and ground_truth
pair for each question (evolution) type, to provide a sense of the question's complexity.
Simple
Q: What does the term 'food security' mean in the context of agricultural policy?
GT: The term 'food security' means access by all people at all times to sufficient food and nutrition for a healthy and productive life.
Conditional
Q: What penalties could arise for insider trading after prior warnings?
GT: If a registered entity, director, officer, agent, or employee fails or refuses to obey or comply with a cease and desist order after prior warnings, they shall be guilty of a misdemeanor and, upon conviction, shall be fined not more than $500,000 or imprisoned for not less than six months nor more than one year, or both. If the failure to comply involves an offense under section 13(a)(2), it shall be considered a felony, subjecting them to penalties under that section.
Multi-context
Q: What goals does the National Nutrition Monitoring plan have for dietary assessment and federal nutrition coordination?
GT: The National Nutrition Monitoring plan aims to establish and implement a comprehensive plan to assess the dietary and nutritional status of the people of the United States, improve the quality of national nutritional and health status data, and provide a central Federal focus for the coordination, management, and direction of Federal nutrition monitoring activities.
Reasoning
Q: What should the Secretary know about state measures before plant destruction?
GT: The Secretary may take action under this section only upon finding, after review and consultation with the Governor or other appropriate official of the State affected, that the measures being taken by the State are inadequate to eradicate the plant pest or noxious weed.
It’s important to note that, by default, the context size in Ragas is set to 1024 tokens. The term "context" can be somewhat ambiguous. In previous articles, "context" referred to the relevant chunks of text used by the LLM to generate a question. However, in the context of test set generation in Ragas, it refers to the text from which the question is derived. With that distinction in mind, let’s examine the context-related statistics.
Most comparisons will focus on evolution types, as they offer valuable insights that impact both the cost and performance of the test set generation process. Therefore, the first key metric to consider is the mean number of contexts. As highlighted in the test set recap section, Ragas may employ multiple contexts to generate a single question. According to the following diagram, the multi-context evolution type (unsurprisingly) utilizes the highest number of contexts:
Image 2 - Mean number of contexts by evolution type
Since the chunk size is around 1024 tokens, the ratio of the mean number of contexts across evolution types is preserved in the mean context size as shown below:
Image 3 - Mean context size by evolution type
However, the mean number of contexts alone doesn’t provide much insight into the distribution of different values. For example, how many contexts are typically used by certain evolution types in most cases? The following diagram helps clarify this:
Image 4 - Number of contexts used to generate a question by evolution type
Interestingly, simple questions in edge cases may require up to four contexts. Conversely, as expected, multi-context questions utilize more contexts, though they don’t go to extremes.
As with anything, the test set generation in Ragas isn't flawless. One of its main drawbacks is the occasional failure to return a ground_truth
, instead resulting in NaN
(not a number) in Python. This issue becomes especially prominent in more complex evolution types:
Image 5 - Generated questions by evolution type
The target was to generate 510 questions in total, of which 87 failed and 423 succeeded, i.e. Ragas failed to generate 17% of the questions, which is a significant number.
Another common issue when working with AI models (unrelated to Ragas) is hitting the rate limit (exceeding TPM - Tokens Per Minute), which triggers the following error:
{
"error": {
"code": 429,
"message": "Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2023-07-01-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit."
}
}
Ragas addresses this issue through the RunConfig
, which is utilized to adjust the parameters for exponential backoff. It’s important to note that failed requests with code 429 are not billed.
The usage of AI models is billed based on token consumption as shown in the following diagram:
Image 6 - Token usage by AI model
At the time of conducting this experiment, the following costs were applied per 1000 tokens:
gpt-4o
input: 0.0047€gpt-4o
output: 0.0139€gpt-4o-mini
input: 0.00014€gpt-4o-mini
output: 0.0006€text-embedding-ada-002
input: 0.000093€The application of these prices to the token consumption results in:
Image 7 - Price by AI model
The total price is 23.81€. It is evident that gpt-4o
holds the largest share:
Image 8 - Price Distribution of AI Models
This article explores the generation of synthetic test sets for RAG architectures using the Ragas framework. It focuses on creating diverse question types (simple, reasoning, multi-context, conditional) from U.S. Code titles. From a vast dataset of 15 billion tokens, the experiment narrows down to 5 titles, generating around 500 questions.
Challenges include occasional failures in generating ground truths, especially for complex questions. Efficient test set generation balances the right amount of information, avoiding inefficiencies in cost and time.
The total cost of the experiment is 23.81€, with most expenses driven by gpt-4o
's usage. The article emphasizes the importance of efficient model selection and data handling for effective test generation.
The success of the synthetic test set generation process in Ragas depends heavily on loading a high-quality corpus of documents from which the questions are generated. Ideally, these documents are packed with valuable information, allowing for seamless question generation while minimizing the time and computational resources spent on retries, which become more prominent during large-scale test set generation.
Another key factor is selecting appropriate AI models for their specific roles, whether as the critic or generator model. Once valuable information is identified, the focus shifts to utilizing it effectively by selecting high-performing LLMs. In practice, while the generator model consumes the most tokens, the critic model accounts for 93.6% of the total cost of 23.81€.
The biggest limitation of Ragas is its occasional inability to consistently produce ground truths, instead returning NaN
values. This can lead to significant issues, as Ragas may miss generating a ground truth from critical contexts, resulting in valuable information being wasted.
Once the test set is generated, it's time to move forward by activating embedding and retrieval pipelines to generate answers. Stay tuned for the next article, where we will explore this in more detail.