Explore the RAG evaluation on a large scale to assess the effectiveness of different strategies. This article analyzes answer quality across various configurations, addressing challenges like answer coverage, comparison methods, and rate limits. It also provides insights into the cost and resource impact of running the evaluation pipeline.
A short while ago, various Ragas evaluation metrics were broken down into detail. Now, the time has come to apply them to real data, rather than simple proof-of-concept examples. The evaluation was performed using the dataset generated and discussed in recent articles:
Evaluating a larger number of samples presents challenges in determining what can and cannot be compared regarding metric values, as the process introduces many edge cases that will be covered in detail. Without prolonging it any further, it’s time to review the AI model setup and dataset.
The following AI models are employed:
gpt-4o-mini
version 2024-05-13
with temperature
set to 0.1
text-embedding-ada-002
version 2
Ragas version is 0.1.7
.
The dataset evaluated also referred to as the synthetic test set, was initially composed of 510 questions (samples). However, this number was reduced to 423 after excluding samples that returned NaN
values for the ground_truth
. The final test set includes the samples structured like this:
class TestSetSample:
id: UUID
question: str
contexts: list[str]
ground_truth: str
The objective was to test a wide range of configurations for each RAG strategy. In total, there were 89 carefully selected configurations of parameters across different strategies. Initially, it was expected to obtain 510 × 89 = 45,390 answers for evaluation. However, due to the aforementioned issues with test set generation, the number was reduced by 17%, resulting in 423 × 89 = 37,647 answers.
Each strategy configuration is represented as:
class Strategy:
id: UUID
strategy_name: str
chunk_size: int
limit: int
# other parameters
...
Answer generation involves applying these strategies to retrieve relevant context and obtain answers from an LLM, thereby producing samples for evaluation.
class EvaluationSample:
id: UUID
test_sample_id: UUID
strategy_id: UUID
answer: str | None
During runtime, test set sample features are fetched by id
and sent to Ragas for evaluation. Recall that answer generation failed in some cases, leaving 37,466 answers. Moreover, not all generated answers are valid, some may be null
when the LLM doesn’t find the provided context useful to answer a particular question (notice the None
in EvaluationSample
). In the end, answer generation produced 29,224 answers (such samples are also referred to as answered samples later on) and 8,242 null
answers (unanswered samples). Why is this distinction examined so extensively? Because certain evaluation metrics require an answer, making it necessary to differentiate between these two subsets of samples.
Once the evaluation is completed, various metrics are obtained. For answered samples, almost all metrics from this version of Ragas are utilized except for aspect critique. This exclusion is because nuances such as harmfulness and maliciousness are outside the scope of this experiment. Additionally, context relevancy is not considered reliable and has been deprecated in favor of context precision. The final result is represented as:
class EvaluationResult:
id: UUID
evaluation_sample_id: UUID
context_precision: float
context_recall: float
context_entity_recall: float
faithfulness: float | None
answer_relevancy: float | None
answer_similarity: float | None
answer_correctness: float | None
None
values are assigned to metrics when evaluation samples with a None
answer are provided. Everything is now well-organized, allowing for the tracking of connections between specific metrics, strategies, and questions, which is important for subsequent analysis. This organization is illustrated in the following schema:
Image 1 - Complete RAG evaluation pipeline
The ragas_score
is frequently emphasized in Ragas-related literature. Due to its importance, it will be studied here as well. The ragas_score
is calculated as the harmonic mean of the following metrics:
context_precision
context_recall
faithfulness
answer_relevancy
Specifically, substituting listed metrics into a general expression for the harmonic mean of variables yields:
There are two approaches to analyzing evaluation results, let’s take a step back to fully understand it. Each question from the test set was answered with the help of the context built by various strategies, so there are multiple answers to the same question, and the metrics are calculated for each combination. Now, these strategies need to be compared. For example, by the mean value of each metric. However, there is a problem with this.
To bring it closer, suppose faithfulness is a metric of interest, a metric that requires an answer to be calculated. When comparing two strategies, one of them may yield an answer successfully, while the other one may return a null
. How does it affect the metric comparison between strategies? There are a few approaches to resolving this issue, such as paired and unpaired comparisons, each with pros and cons.
This method compares the metrics of various parameter configurations using the same subset of test samples, i.e. questions. Paired comparison is sensitive because if certain strategy parameter configurations fail to retrieve useful context, the LLM may be unable to answer those questions. As a result, after filtering out such samples, only a few remain for analysis, making the paired comparison non-representative.
Utilizes all available samples. Sounds simple, but the unpaired comparison is harder to interpret, and here’s why.
Suppose there are two configurations and of the same strategy, let’s say the basic. Both have the same context size, which makes them fairly comparable in the first place. Additionally, assume their performance is measured on a subset of three questions: . Utilizing configuration , an LLM managed to answer and , but returned null
for . On the other hand, the LLM answered all three questions with configuration . The metric of interest is the ragas_score
. Theoretically, the following results could have occurred:
Image 2 - Example of comparison
The mean values are found as:
One could naively think that config is better solely based on the mean values, but the success rate of answering questions must be considered as well. What is meant under the term success rate is shown in diagrams like this.
One solution to the previous problem is a zero assignment comparison, which, as the name suggests, assigns zeros to the answer-related metrics for the unanswered samples. Now, the previous example transforms into something more profound:
However, several questions should be considered:
null
value each time?The following sections present the evaluation results for each strategy and its various parameter configurations. In the accompanying diagrams, parameters are listed in the rows and enclosed in parentheses. Cell values represent the means, and the maximum value in each column is highlighted in bold. When it comes to the paired and unpaired comparisons, metrics for unanswered and answered samples are separated in distinct columns, while they are merged in zero assignment comparison following the previously established principles.
Before proceeding, it’s essential to understand the parameters of each strategy.
Starting with the basic strategy, the following number of samples were utilized:
A significant decrease in the number of utilized samples per configuration is immediately evident in the paired comparison, thus making it not truly representative. This data is displayed in the following three diagrams:
The first two diagrams feature two main groups - Unanswered and Answered, which allow to understand how the context-related metrics change when the LLM is unable to answer the question. Color encoding also proves useful in analyzing such metrics, i.e., context_precision
, context_recall
, and context_entity_recall
, which are noticeably lower for unanswered samples. This clearly indicates that these metrics effectively describe the quality of the context, as lower values correspond to contexts that the LLM didn’t find useful for answering the given questions. This is the desired behavior.
Assigning zeros to metrics that require unanswered samples yields trends similar to those shown here. As chunk sizes decrease, colors consistently become warmer, indicating an increase in metric values, except for context_precision
.
Taking a closer look at the Answered group in the unpaired comparison, context_precision increases with chunk size. This raises the question: why does context_precision
improve with larger chunk sizes when smaller chunk sizes are more effective overall in the basic strategy? The underlying reason is not immediately apparent, so let’s dig a bit deeper. First, remember how the context_precision
is calculated. All contexts (relevant chunks) are taken into consideration. Next, consider the edge cases: configurations with chunk sizes of 256 and 2,048 tokens, which scored 0.876 and 0.98 on average, respectively. The key lies in the limit
parameter, set to 8 for the 256-token configuration and 1 for the 2,048-token configuration. Simply put, when the RAG pipeline finds a single chunk for the latter configuration, it’s scored higher than when multiple relevant chunks are provided, because some of the top-ranked chunks by relevancy may not actually be relevant.
In addition, take into account that the described comparison is unpaired, meaning that certain questions that the higher chunk size configuration failed to answer are excluded, creating the illusion that larger chunk sizes perform better, although this is not the case.
The context_entity_recall
is low across all configurations, and concerns about this metric were raised here. This metric compares entities extracted from the retrieved context with those from the ground truth. In addition to the aforementioned issues, the context is much larger than the ground truth, so the LLM may skip many entities, as shown in the upcoming example.
The original context contains 1,576 tokens (1,212 words), but it was reduced to 81 tokens (63 words) to avoid unnecessary clutter.
Not later than the date that is 180 days after February 24, 2016, the Commissioner shall establish a program that directs U.S. Customs and Border Protection to adjust bond amounts for importers, including new importers and nonresident importers, based on risk assessments of such importers conducted by U.S. Customs and Border Protection, in order to protect the revenue of the Federal Government.
...
The entities extracted from the context include:
{
"entities": [
"February 24, 2016",
"U.S. Customs and Border Protection",
"Federal Government",
"December 31, 2016",
"Secretary of Homeland Security",
"Import Safety Working Group",
"joint import safety rapid response plan",
"intellectual property rights",
"National Intellectual Property Rights Coordination Center",
"Pub. L. 114–125",
"title II",
"130 Stat. 148",
"Pub. L. 114–125",
"title III",
"130 Stat. 153",
"Pub. L. 99–198",
"title XVI",
"99 Stat. 1626",
"Pub. L. 103–189",
"107 Stat. 2262",
"watermelons"
]
}
The ground truth has 79 tokens (62 words):
The purpose of the importer risk assessment program established by the Commissioner is to direct U.S. Customs and Border Protection to adjust bond amounts for importers, including new importers and nonresident importers, based on risk assessments conducted by U.S. Customs and Border Protection, in order to protect the revenue of the Federal Government.
Using the same prompt template as for the context, the extracted entities are:
{
"entities": [
"importer risk assessment program",
"Commissioner",
"U.S. Customs and Border Protection",
"bond amounts",
"new importers",
"nonresident importers",
"Federal Government"
]
}
All entities except importer risk assessment program are present in the context, which was derived from the original context when the ground truth was generated. Although the context contains nearly 20 times more tokens, only 3 times more entities were extracted. This imbalance significantly affects the results, as there are only two overlapping entities:
The exact context_entity_recall
is found as follows, although it was expected to be very close to 1:
Overall, raw mean values don’t provide insight into the distributions of metrics. Therefore, histograms are used to visualize them:
Image 5 - Histograms of evaluation metrics for the basic strategy
These histograms are based on the basic strategy with the (256, 0, 8)
configuration and include 378 answered samples. In short, they correspond to the (256, 0, 8)
row and the Answered column in the unpaired comparison diagram. Histograms are shown for a single configuration only to reduce clutter.
Proceeding with the sentence window strategy, configurations are divided into three distinct subsets based on context size.
For the context size of 2,560 tokens:
For the context size of 3,072 tokens:
For the context size of 3,584 tokens:
The paired comparison is now more respectable, as there are multiple groups of configurations based on context size, with fewer items in each group compared to the basic strategy.
The auto-merging strategy utilizes two different maximum context sizes.
For the maximum context size of 8,192 tokens:
For the maximum context size of 16,384 tokens:
The paired comparison is highly applicable to the auto-merging strategy, as it proved successful in answering questions, largely due to its extensive context size.
A thorough analysis of the samples was conducted for the previous RAG strategies, so it will be skipped for the remaining strategies, including the hierarchical. Additionally, the paired comparison is excluded, as this strategy includes 48 configurations, resulting in an exceptionally small common subset of samples across configurations.
There were two flavors of the hypothetical question, one with a static, and another with a dynamic number_of_questions
. Due to the hypothetical question's overall poor performance, the paired comparison is excluded.
To get a sense of the scale, there were 29,224 answered questions and 8,242 unanswered questions. For a rough estimation of the number of API calls, suppose that each metric requires a single LLM call (although some need more or even none for purely embedding-based metrics). This gives 29,224 × 7 + 8,242 × 3 = 229,294k requests!
This amount of requests combined with the prompts filled with the contexts and detailed instructions for in-context learning are definitely going to hit both RPM (requests per minute) and TPM (tokens per minute) limitations of the AI model deployment.
What if the evaluation pipeline crashes in the middle of the process? Since just retrying everything isn’t feasible because the evaluation is costly, it’s recommended to evaluate the dataset in batches rather than all at once and implement the logic for progress tracking to know where it stopped. There is one more thing worth being aware of, it’s hard to cover all edge cases at once, as some components in the pipeline that one considered reliable may suddenly start raising exceptions when faced with higher workloads. Moving from straightforward code snippets from docs to a robust pipeline takes time.
Another concern was the occurrence of NaN
values. While they had a considerable impact during synthetic test set generation, NaN
values were less frequent during evaluation, as shown in the following table:
Image 19 - Distribution of NaN values across metrics
One reason for this is that the evaluation metrics pipeline is far less complex than test set generation.
As expected, most tokens are utilized for LLM input.
Image 20 - Token usage by AI model
This translates to the following pricing diagram:
Image 21 - Price by AI model
Image 22 - Price Distribution of AI Models
Bringing all stages together, the evaluation stage emerges as the most expensive. However, there is more to consider. The stages are ordered as follows:
When new test set samples are generated, all subsequent stages must be executed. If the test set remains unchanged and different RAG strategy configurations are to be tested, certain optimizations can be implemented. For instance, the same embeddings can be reused for different retrieval options (e.g., adjusting context size and specific parameters). In this case, test set generation and re-embedding can be skipped. However, evaluation remains inevitable and must be performed for each change in the preceding stages.
Image 23 - Price by stage
The total price is 203.79€.
This article presents a comprehensive evaluation of various retrieval-augmented generation (RAG) strategies using AI models, specifically analyzing how different configurations impact evaluation metrics. The study employs a synthetic dataset of questions, reduced from 510 to 423 samples after filtering for valid ground_truth
entries. Each configuration is tested across 37,647 answers generated by the LLM, resulting in 29,224 answered and 8,242 unanswered cases.
The evaluation utilized a range of metrics to assess answer quality and context relevance, with particular attention given to cases where no answer was generated. Different comparison methods, including paired, unpaired, and zero assignment, were applied to capture variations in strategy performance. Results showed that configurations with smaller chunks generally yielded better results.
Practical challenges in scaling evaluations, noting API limitations, and handling NaN
values are also discussed. Cost considerations emphasized that evaluation was the most resource-intensive step, incurring a cost of 151.75€.
The article explains three approaches to interpreting evaluation results: paired, unpaired, and zero assignment comparison. The paired approach works best when the strategy configurations being compared consistently produce results and when the number of configurations is small enough to increase the likelihood that each sample receives an answer across all configurations being tested.
An emphasis was placed on the metrics that exhibited unusual patterns, like context precision and context entity recall. The context precision penalized configurations that retrieved more chunks, while the context entity recall had problems dealing with large contexts. The approach to the entity extraction turned out to be insufficient as the majority of important entities were simply skipped by the LLM. A more in-depth discussion of metrics will be the subject of future work.
Ragas evaluation encountered significantly fewer NaN
value issues than synthetic test set generation. This improvement is largely due to the evaluation pipeline’s relative simplicity, which reduces potential errors.
In conclusion, Ragas provides a solid foundation for evaluating RAG metrics. However, the reliability and robustness of some metrics may be questionable when applied to larger-scale evaluations. Given the diverse range of domains and use cases for LLMs, it’s challenging to create a one-size-fits-all evaluation library. Therefore, customization is crucial to enhance the quality of evaluation metrics. One effective approach is to log LLM calls to gain deeper insights into the factors influencing metric performance.
This article concludes the RAG in practice series, offering valuable insights into effective strategies and future research directions.