RAG in practice - Evaluation

Development

by Luka Panic

15 min read6 November 2024

Contents

Explore the RAG evaluation on a large scale to assess the effectiveness of different strategies. This article analyzes answer quality across various configurations, addressing challenges like answer coverage, comparison methods, and rate limits. It also provides insights into the cost and resource impact of running the evaluation pipeline.

Introduction
Copied!

A short while ago, various Ragas evaluation metrics were broken down into detail. Now, the time has come to apply them to real data, rather than simple proof-of-concept examples. The evaluation was performed using the dataset generated and discussed in recent articles:

Evaluating a larger number of samples presents challenges in determining what can and cannot be compared regarding metric values, as the process introduces many edge cases that will be covered in detail. Without prolonging it any further, it’s time to review the AI model setup and dataset.

Setup
Copied!

The following AI models are employed:

LLM: gpt-4o-mini version 2024-05-13 with temperature set to 0.1
embedding model: text-embedding-ada-002 version 2

Ragas version is 0.1.7.

Dataset
Copied!

The dataset evaluated also referred to as the synthetic test set, was initially composed of 510 questions (samples). However, this number was reduced to 423 after excluding samples that returned NaN values for the ground_truth. The final test set includes the samples structured like this:

class TestSetSample:
	id: UUID
	question: str
	contexts: list[str]
	ground_truth: str

The objective was to test a wide range of configurations for each RAG strategy. In total, there were 89 carefully selected configurations of parameters across different strategies. Initially, it was expected to obtain 510 × 89 = 45,390 answers for evaluation. However, due to the aforementioned issues with test set generation, the number was reduced by 17%, resulting in 423 × 89 = 37,647 answers.

Each strategy configuration is represented as:

class Strategy:
	id: UUID
	strategy_name: str
	chunk_size: int
	limit: int
	# other parameters
	...

Answer generation involves applying these strategies to retrieve relevant context and obtain answers from an LLM, thereby producing samples for evaluation.

class EvaluationSample:
	id: UUID
	test_sample_id: UUID
	strategy_id: UUID
	answer: str | None

During runtime, test set sample features are fetched by id and sent to Ragas for evaluation. Recall that answer generation failed in some cases, leaving 37,466 answers. Moreover, not all generated answers are valid, some may be null when the LLM doesn’t find the provided context useful to answer a particular question (notice the None in EvaluationSample). In the end, answer generation produced 29,224 answers (such samples are also referred to as answered samples later on) and 8,242 null answers (unanswered samples). Why is this distinction examined so extensively? Because certain evaluation metrics require an answer, making it necessary to differentiate between these two subsets of samples.

Once the evaluation is completed, various metrics are obtained. For answered samples, almost all metrics from this version of Ragas are utilized except for aspect critique. This exclusion is because nuances such as harmfulness and maliciousness are outside the scope of this experiment. Additionally, context relevancy is not considered reliable and has been deprecated in favor of context precision. The final result is represented as:

class EvaluationResult:
	id: UUID
	evaluation_sample_id: UUID
	context_precision: float
	context_recall: float
	context_entity_recall: float
	faithfulness: float | None
	answer_relevancy: float | None
	answer_similarity: float | None
	answer_correctness: float | None

None values are assigned to metrics when evaluation samples with a None answer are provided. Everything is now well-organized, allowing for the tracking of connections between specific metrics, strategies, and questions, which is important for subsequent analysis. This organization is illustrated in the following schema:

Complete RAG evaluation pipeline Image 1 - Complete RAG evaluation pipeline

Ragas score
Copied!

The ragas_score is frequently emphasized in Ragas-related literature. Due to its importance, it will be studied here as well. The ragas_score is calculated as the harmonic mean of the following metrics:

context_precision
context_recall
faithfulness
answer_relevancy

Specifically, substituting listed metrics into a general expression for the harmonic mean of $n$ variables yields:

H(x_1, x_2, \dots, x_n) = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}

Comparison
Copied!

There are two approaches to analyzing evaluation results, let’s take a step back to fully understand it. Each question from the test set was answered with the help of the context built by various strategies, so there are multiple answers to the same question, and the metrics are calculated for each combination. Now, these strategies need to be compared. For example, by the mean value of each metric. However, there is a problem with this.

To bring it closer, suppose faithfulness is a metric of interest, a metric that requires an answer to be calculated. When comparing two strategies, one of them may yield an answer successfully, while the other one may return a null. How does it affect the metric comparison between strategies? There are a few approaches to resolving this issue, such as paired and unpaired comparisons, each with pros and cons.

Paired comparison
Copied!

This method compares the metrics of various parameter configurations using the same subset of test samples, i.e. questions. Paired comparison is sensitive because if certain strategy parameter configurations fail to retrieve useful context, the LLM may be unable to answer those questions. As a result, after filtering out such samples, only a few remain for analysis, making the paired comparison non-representative.

Unpaired comparison
Copied!

Utilizes all available samples. Sounds simple, but the unpaired comparison is harder to interpret, and here’s why.

Suppose there are two configurations $A$ and $B$ of the same strategy, let’s say the basic. Both have the same context size, which makes them fairly comparable in the first place. Additionally, assume their performance is measured on a subset of three questions: $\{\, q_{1},\ q_{2},\ q_{3} \,\}$ . Utilizing configuration $A$ , an LLM managed to answer $q_{1}$ and $q_{2}$ , but returned null for $q_{3}$ . On the other hand, the LLM answered all three questions with configuration $B$ . The metric of interest is the ragas_score. Theoretically, the following results could have occurred:

Example of comparison Image 2 - Example of comparison

The mean values are found as:

\mu_{A,\ ragas \ score} = \frac{0.7 + 0.8}{2} = 0.75

\mu_{B,\ ragas \ score} = \frac{0.7 + 0.8 + 0.5}{3} \approx 0.67

One could naively think that config $A$ is better solely based on the mean values, but the success rate of answering questions must be considered as well. What is meant under the term success rate is shown in diagrams like this.

Zero assignment comparison
Copied!

One solution to the previous problem is a zero assignment comparison, which, as the name suggests, assigns zeros to the answer-related metrics for the unanswered samples. Now, the previous example transforms into something more profound:

\mu_{A,\ ragas \ score} = \frac{0.7 + 0.8 + 0}{3} = 0.5

\mu_{B,\ ragas \ score} = \frac{0.7 + 0.8 + 0.5}{3} \approx 0.67

However, several questions should be considered:

Is this approach too simplistic given the non-deterministic output of the large language model?
If the LLM is configured to generate multiple answers, would it consistently return a valid answer instead of a null value each time?

Results
Copied!

The following sections present the evaluation results for each strategy and its various parameter configurations. In the accompanying diagrams, parameters are listed in the rows and enclosed in parentheses. Cell values represent the means, and the maximum value in each column is highlighted in bold. When it comes to the paired and unpaired comparisons, metrics for unanswered and answered samples are separated in distinct columns, while they are merged in zero assignment comparison following the previously established principles.

Before proceeding, it’s essential to understand the parameters of each strategy.

Basic
Copied!

Starting with the basic strategy, the following number of samples were utilized:

paired comparison - 9 unanswered and 88 answered samples per configuration
unpaired comparison - 1,299 unanswered samples (108.25 per configuration on average) and 3,671 answered samples (305.92 per configuration on average)
zero assignment comparison - 4,970 samples in total (1,299 unanswered + 3,671 answered, 414.17 per configuration on average)

A significant decrease in the number of utilized samples per configuration is immediately evident in the paired comparison, thus making it not truly representative. This data is displayed in the following three diagrams:

1/3

Image 3 - Paired comparison of the basic strategy configurations

The first two diagrams feature two main groups - Unanswered and Answered, which allow to understand how the context-related metrics change when the LLM is unable to answer the question. Color encoding also proves useful in analyzing such metrics, i.e., context_precision, context_recall, and context_entity_recall, which are noticeably lower for unanswered samples. This clearly indicates that these metrics effectively describe the quality of the context, as lower values correspond to contexts that the LLM didn’t find useful for answering the given questions. This is the desired behavior.

Assigning zeros to metrics that require unanswered samples yields trends similar to those shown here. As chunk sizes decrease, colors consistently become warmer, indicating an increase in metric values, except for context_precision.

Taking a closer look at the Answered group in the unpaired comparison, context_precision increases with chunk size. This raises the question: why does context_precision improve with larger chunk sizes when smaller chunk sizes are more effective overall in the basic strategy? The underlying reason is not immediately apparent, so let’s dig a bit deeper. First, remember how the context_precision is calculated. All contexts (relevant chunks) are taken into consideration. Next, consider the edge cases: configurations with chunk sizes of 256 and 2,048 tokens, which scored 0.876 and 0.98 on average, respectively. The key lies in the limit parameter, set to 8 for the 256-token configuration and 1 for the 2,048-token configuration. Simply put, when the RAG pipeline finds a single chunk for the latter configuration, it’s scored higher than when multiple relevant chunks are provided, because some of the top-ranked chunks by relevancy may not actually be relevant.

In addition, take into account that the described comparison is unpaired, meaning that certain questions that the higher chunk size configuration failed to answer are excluded, creating the illusion that larger chunk sizes perform better, although this is not the case.

The context_entity_recall is low across all configurations, and concerns about this metric were raised here. This metric compares entities extracted from the retrieved context with those from the ground truth. In addition to the aforementioned issues, the context is much larger than the ground truth, so the LLM may skip many entities, as shown in the upcoming example.

The original context contains 1,576 tokens (1,212 words), but it was reduced to 81 tokens (63 words) to avoid unnecessary clutter.

Not later than the date that is 180 days after February 24, 2016, the Commissioner shall establish a program that directs U.S. Customs and Border Protection to adjust bond amounts for importers, including new importers and nonresident importers, based on risk assessments of such importers conducted by U.S. Customs and Border Protection, in order to protect the revenue of the Federal Government.
...

The entities extracted from the context include:

{
	"entities": [
		"February 24, 2016",
		"U.S. Customs and Border Protection",
		"Federal Government",
		"December 31, 2016",
		"Secretary of Homeland Security",
		"Import Safety Working Group",
		"joint import safety rapid response plan",
		"intellectual property rights",
		"National Intellectual Property Rights Coordination Center",
		"Pub. L. 114–125",
		"title II",
		"130 Stat. 148",
		"Pub. L. 114–125",
		"title III",
		"130 Stat. 153",
		"Pub. L. 99–198",
		"title XVI",
		"99 Stat. 1626",
		"Pub. L. 103–189",
		"107 Stat. 2262",
		"watermelons"
	]
}

The ground truth has 79 tokens (62 words):

The purpose of the importer risk assessment program established by the Commissioner is to direct U.S. Customs and Border Protection to adjust bond amounts for importers, including new importers and nonresident importers, based on risk assessments conducted by U.S. Customs and Border Protection, in order to protect the revenue of the Federal Government.

Using the same prompt template as for the context, the extracted entities are:

{
	"entities": [
		"importer risk assessment program", 
		"Commissioner", 
		"U.S. Customs and Border Protection", 
		"bond amounts", 
		"new importers", 
		"nonresident importers", 
		"Federal Government"
	]
}

All entities except importer risk assessment program are present in the context, which was derived from the original context when the ground truth was generated. Although the context contains nearly 20 times more tokens, only 3 times more entities were extracted. This imbalance significantly affects the results, as there are only two overlapping entities:

U.S. Customs and Border Protection
Federal Government

The exact context_entity_recall is found as follows, although it was expected to be very close to 1:

|CE|=21

|GE|=7

|CE \cap GE|=2

\text{context entity recall} = \frac{|CE \cap GE|}{|GE|} = \frac{2}{7} \approx 0.29

Overall, raw mean values don’t provide insight into the distributions of metrics. Therefore, histograms are used to visualize them:

Histograms of evaluation metrics for the basic strategy Image 5 - Histograms of evaluation metrics for the basic strategy

These histograms are based on the basic strategy with the (256, 0, 8) configuration and include 378 answered samples. In short, they correspond to the (256, 0, 8) row and the Answered column in the unpaired comparison diagram. Histograms are shown for a single configuration only to reduce clutter.

Sentence window
Copied!

Proceeding with the sentence window strategy, configurations are divided into three distinct subsets based on context size.

For the context size of 2,560 tokens:

paired comparison - 24 unanswered and 254 answered samples per configuration
unpaired comparison - 237 unanswered samples (79 per configuration on average) and 1,000 answered samples (333.33 on average)
zero assignment comparison - 1,237 samples in total (412.33 per configuration on average)

For the context size of 3,072 tokens:

paired comparison - 17 unanswered and 276 answered samples per configuration
unpaired comparison - 180 unanswered samples (60 per configuration on average) and 1,052 answered samples (350.67 on average)
zero assignment comparison - 1,232 samples in total (410.67 per configuration on average)

For the context size of 3,584 tokens:

paired comparison - 28 unanswered and 258 answered samples per configuration
unpaired comparison - 239 unanswered samples (79.67 per configuration on average) and 996 answered samples (332 on average)
zero assignment comparison - 1,235 samples in total (411.67 per configuration on average)

The paired comparison is now more respectable, as there are multiple groups of configurations based on context size, with fewer items in each group compared to the basic strategy.

1/3

Image 6 - Paired comparison of the sentence window strategy configurations

Auto-merging
Copied!

The auto-merging strategy utilizes two different maximum context sizes.

For the maximum context size of 8,192 tokens:

paired comparison - 11 unanswered and 304 answered samples per configuration
unpaired comparison - 291 unanswered samples (32.33 per configuration on average) and 3,408 answered samples (378.67 on average)
zero assignment comparison - 3,699 samples in total (411 per configuration on average)

For the maximum context size of 16,384 tokens:

paired comparison - 19 unanswered and 340 answered samples per configuration
unpaired comparison - 104 unanswered samples (34.67 per configuration on average) and 1,121 answered samples (373.67 on average)
zero assignment comparison - 1,225 samples in total (408.33 per configuration on average)

The paired comparison is highly applicable to the auto-merging strategy, as it proved successful in answering questions, largely due to its extensive context size.

1/3

Image 9 - Paired comparison of the auto-merging strategy configurations

Hierarchical
Copied!

A thorough analysis of the samples was conducted for the previous RAG strategies, so it will be skipped for the remaining strategies, including the hierarchical. Additionally, the paired comparison is excluded, as this strategy includes 48 configurations, resulting in an exceptionally small common subset of samples across configurations.

1/2

Image 13 - Unpaired comparison of the hierarchical strategy configurations

Hypothetical question
Copied!

There were two flavors of the hypothetical question, one with a static, and another with a dynamic number_of_questions. Due to the hypothetical question's overall poor performance, the paired comparison is excluded.

1/2

Image 15 - Unpaired comparison of the hypothetical question strategy configurations (static number of questions)

1/2

Image 17 - Unpaired comparison of the hypothetical question strategy configurations (dynamic number of questions)

Issues
Copied!

To get a sense of the scale, there were 29,224 answered questions and 8,242 unanswered questions. For a rough estimation of the number of API calls, suppose that each metric requires a single LLM call (although some need more or even none for purely embedding-based metrics). This gives 29,224 × 7 + 8,242 × 3 = 229,294k requests!

This amount of requests combined with the prompts filled with the contexts and detailed instructions for in-context learning are definitely going to hit both RPM (requests per minute) and TPM (tokens per minute) limitations of the AI model deployment.

What if the evaluation pipeline crashes in the middle of the process? Since just retrying everything isn’t feasible because the evaluation is costly, it’s recommended to evaluate the dataset in batches rather than all at once and implement the logic for progress tracking to know where it stopped. There is one more thing worth being aware of, it’s hard to cover all edge cases at once, as some components in the pipeline that one considered reliable may suddenly start raising exceptions when faced with higher workloads. Moving from straightforward code snippets from docs to a robust pipeline takes time.

Another concern was the occurrence of NaN values. While they had a considerable impact during synthetic test set generation, NaN values were less frequent during evaluation, as shown in the following table:

Distribution of NaN values across metrics Image 19 - Distribution of NaN values across metrics

One reason for this is that the evaluation metrics pipeline is far less complex than test set generation.

Price
Copied!

As expected, most tokens are utilized for LLM input.

Token usage by AI model Image 20 - Token usage by AI model

This translates to the following pricing diagram:

Price by AI model Image 21 - Price by AI model

Price Distribution of AI Models Image 22 - Price Distribution of AI Models

Bringing all stages together, the evaluation stage emerges as the most expensive. However, there is more to consider. The stages are ordered as follows:

synthetic test set generation
embedding
retrieval and answer generation
evaluation

When new test set samples are generated, all subsequent stages must be executed. If the test set remains unchanged and different RAG strategy configurations are to be tested, certain optimizations can be implemented. For instance, the same embeddings can be reused for different retrieval options (e.g., adjusting context size and specific parameters). In this case, test set generation and re-embedding can be skipped. However, evaluation remains inevitable and must be performed for each change in the preceding stages.

Price by stage Image 23 - Price by stage

The total price is 203.79€.

Summary
Copied!

This article presents a comprehensive evaluation of various retrieval-augmented generation (RAG) strategies using AI models, specifically analyzing how different configurations impact evaluation metrics. The study employs a synthetic dataset of questions, reduced from 510 to 423 samples after filtering for valid ground_truth entries. Each configuration is tested across 37,647 answers generated by the LLM, resulting in 29,224 answered and 8,242 unanswered cases. The evaluation utilized a range of metrics to assess answer quality and context relevance, with particular attention given to cases where no answer was generated. Different comparison methods, including paired, unpaired, and zero assignment, were applied to capture variations in strategy performance. Results showed that configurations with smaller chunks generally yielded better results. Practical challenges in scaling evaluations, noting API limitations, and handling NaN values are also discussed. Cost considerations emphasized that evaluation was the most resource-intensive step, incurring a cost of 151.75€.

Conclusion
Copied!

The article explains three approaches to interpreting evaluation results: paired, unpaired, and zero assignment comparison. The paired approach works best when the strategy configurations being compared consistently produce results and when the number of configurations is small enough to increase the likelihood that each sample receives an answer across all configurations being tested.

An emphasis was placed on the metrics that exhibited unusual patterns, like context precision and context entity recall. The context precision penalized configurations that retrieved more chunks, while the context entity recall had problems dealing with large contexts. The approach to the entity extraction turned out to be insufficient as the majority of important entities were simply skipped by the LLM. A more in-depth discussion of metrics will be the subject of future work.

Ragas evaluation encountered significantly fewer NaN value issues than synthetic test set generation. This improvement is largely due to the evaluation pipeline’s relative simplicity, which reduces potential errors.

In conclusion, Ragas provides a solid foundation for evaluating RAG metrics. However, the reliability and robustness of some metrics may be questionable when applied to larger-scale evaluations. Given the diverse range of domains and use cases for LLMs, it’s challenging to create a one-size-fits-all evaluation library. Therefore, customization is crucial to enhance the quality of evaluation metrics. One effective approach is to log LLM calls to gain deeper insights into the factors influencing metric performance.

This article concludes the RAG in practice series, offering valuable insights into effective strategies and future research directions.

Large Language ModelsRetrieval StrategyRetrieval Augmented GenerationAI Blog Series

Facebook X LinkedIn

Keep ReadingCheck out more blogs

Ragas Test Set Generation BreakdownDelving into RAG application testing with Ragas, this article explores manual vs. automated test set creation using LLMs, focusing on prompt design and the generation process for accurate and efficient testing.

by Luka Panic17 min read

Development

The Science Behind RAG TestingExplore the vital step of evaluating Retrieval-Augmented Generation (RAG) applications. Learn key metrics like precision, recall, faithfulness, and relevance, essential for crafting accurate and reliable AI-generated responses.

by Teo Jeftimov9 min read

Development

IntroductionCopied!

SetupCopied!

DatasetCopied!

Ragas scoreCopied!

ComparisonCopied!

Paired comparisonCopied!

Unpaired comparisonCopied!

Zero assignment comparisonCopied!

ResultsCopied!

BasicCopied!

Sentence windowCopied!

Auto-mergingCopied!

HierarchicalCopied!

Hypothetical questionCopied!

IssuesCopied!

PriceCopied!

SummaryCopied!

ConclusionCopied!

Introduction
Copied!

Setup
Copied!

Dataset
Copied!

Ragas score
Copied!

Comparison
Copied!

Paired comparison
Copied!

Unpaired comparison
Copied!

Zero assignment comparison
Copied!

Results
Copied!

Basic
Copied!

Sentence window
Copied!

Auto-merging
Copied!

Hierarchical
Copied!

Hypothetical question
Copied!

Issues
Copied!

Price
Copied!

Summary
Copied!

Conclusion
Copied!