Blog

RAG in practice - Answer Generation

Discover how various parameters affect performance in Retrieval-Augmented Generation (RAG). This article explores factors like chunk size, overlap, and context size, examining their impact on the large language model's ability to answer questions. It also delves into error analysis and the cost breakdown of running the full pipeline, including token usage and pricing considerations.

RAG in practice - Answer Generation
Development25 min read
Luka Panic
LukaPanicLuka Panic2024-10-09
2024-10-09
Large Language ModelsAI Blog SeriesRetrieval StrategyRetrieval Augmented Generation
Large Language Models
AI Blog Series
Retrieval Strategy
Retrieval Augmented Generation

Introduction

With the synthetic test set in place and the corpus embedded, the next challenge is to answer the questions. This process occurs in three phases: Retrieval, Augmentation, and Generation (RAG), with a detailed explanation of each to follow. The previous article focused on understanding various embedding parameters and their role in different strategies. Now, it’s time to explore how adjusting these parameters impacts the LLM's ability to answer questions using retrieved contexts and how the properties of these contexts influence the reasoning process.

Setup

The following AI models are employed:

  • LLM: gpt-4o-mini version 2024-05-13 with temperature set to 0.1
  • embedding model: text-embedding-ada-002 version 2

Retrieval

To supplement the table of embedding parameters, the retrieval-specific parameters are illustrated in the same format (the embedding parameters are also used in the retrieval):

Retrieval parameters Image 1 - Retrieval parameters

The top n relevant chunks are retrieved based on the given question. If these chunks belong to the same document, they are ordered by their position, or index, within that document. The number of retrieved chunks is controlled by the limit parameter, but how is this limit determined, and what factors influence it? To answer these questions, it is necessary to consider the outcomes of the answer generation process. The goal is to compare the effectiveness of different strategies, as well as various configurations within the same strategy. A key criterion in this comparison is fairness. Typically, it is common to compare the results of two mathematical models by varying a single parameter while keeping the others constant.

However, in RAG strategies, this approach is not directly applicable, and here’s why. The retrieval process, i.e., search for relevant chunks, in a basic strategy involves three parameters: chunk size, overlap, and limit. The standard approach of varying a single parameter works when adjusting chunk overlap while keeping chunk size and limit constant. For example, with a chunk size of 256 and a limit of 4 (forming a context of 1024 tokens), the overlap can be adjusted (e.g., 0%, 10%, 20%). The issue arises when chunk size is varied while the limit remains fixed, or vice versa. In such cases, the context size changes. For instance, comparing chunk sizes of 256 and 1024 with a fixed limit of 1 results in the latter configuration having a context four times larger, and thus four times more expensive. Therefore, the fairness criterion for comparison should be based on context size. Specifically, varying multiple retrieval parameters is acceptable as long as the context size remains constant.

How are context sizes calculated across different strategies? Is it even possible to find the exact context size for each strategy? In the upcoming sections, expressions for calculating context size will be presented for each strategy, serving as an intermediate step in defining the retrieval parameters.

Basic

In basic strategy, context size is simply found as the product of chunk size and limit.

context size=chunk size×limitcontext\ size = chunk\ size × limit

The limit can be derived as:

limit=context sizechunk sizelimit = \frac{context\ size}{chunk\ size}

The context size is a matter of agreement, for this study, it’s set to 2048 tokens. The chunk size is known from the embedding parameters for the basic strategy listed in the previous article within the following table:

Basic strategy embedding configurations Image 2 - Basic strategy embedding configurations

Applying the expression for the limit yields the following retrieval configurations:

Basic strategy retrieval configurations Image 3 - Basic strategy retrieval configurations

Sentence window

Although the strategy is called the sentence window, in this case, neighbor chunks are used instead of individual sentences. These neighboring chunks are selected based on a parameter called range. The range is simply an integer that specifies how many chunks are taken before and after the chunk identified as relevant by the initial vector search.

To illustrate this, if the limit is set to 3 and the range is set to 2, the system will first retrieve the top 3 most relevant chunks. Then, for each of these chunks, 2 chunks before and 2 chunks after will be added to the final selection. It translates into the following expression for context size:

context size=chunk size×limit×(1+2×range)context\ size = chunk\ size × limit × (1 + 2 × range)

Proceeding to determine the limit, as in the basic strategy and all subsequent strategies, the limit is derived from the context size. However, the range is also unknown and must be set to an integer value so that the limit also results in an integer. Since (1+2×range)(1 + 2 × range) is always odd, it was decided to use three different context sizes instead of the standard 2048 tokens:

  • 2560
  • 3072
  • 3584

The upcoming tables contain the retrieval configurations for each context size respectively:

Sentence window strategy retrieval configurations (context size: 2560) Image 4 - Sentence window strategy retrieval configurations (context size: 2560)

Sentence window strategy retrieval configurations (context size: 3072) Image 5 - Sentence window strategy retrieval configurations (context size: 3072)

Sentence window strategy retrieval configurations (context size: 3584) Image 6 - Sentence window strategy retrieval configurations (context size: 3584)

Auto-merging

Auto-merging is a bit more complex. Within the scope of this experiment, it supports only the minimal requirement of two levels of chunking. Therefore, the following calculations are based on this assumption.

The complexity stems from potential outliers, specifically child chunks that do not get merged. While it is not possible to define the exact context size for auto-merging, its upper and lower boundaries can be determined. Examples can help clarify this, so let’s start with the first one.

Ex. 1: Assuming the following retrieval parameters are provided:

chunk_size = 512
child_chunk_size = 128
merging_threshold = 0.5
limit = 2

and the scenario illustrated below occurs:

Illustration of Ex. 1 Image 7 - Illustration of Ex. 1

The small squares represent child chunks, while the dashed squares surrounding them indicate parent chunks. The filled squares represent chunks retrieved through the vector search, and the hatched squares are those merged into the parent chunk when the merging threshold is met.

The first step is to examine edge cases, specifically when the maximum context is obtained versus the minimum. The example showcases the maximum context scenario, as two relevant chunks were found within the same parent, and the merging threshold of 0.5 (or 50%) is met. This allows the remaining child chunks to be merged, forming a parent chunk. Therefore, the context size can be expressed as a product of the maximum number of parent chunks and the parent chunk size:

context sizemax=parent chunksmax×chunk sizecontext\ size_{max} = parent\ chunks_{max} × chunk\ size

The upper bound for this example can now be determined, but the maximum number of parent chunks remains unknown. It can be calculated using the following ratio:

parent chunksmax=child chunk size×limitchunk size×merging thresholdparent\ chunks_{max} = \frac{child\ chunk\ size × limit}{chunk\ size × merging\ threshold}

where the numerator represents the context size of basic retrieval, before the merging step. The denominator indicates the minimum total size of child chunks within a parent required for merging to occur. Substituting the values yields:

parent chunksmax=128512×20.5=1parent\ chunks_{max} = \frac{128}{512} × \frac{2}{0.5} = 1 context sizemax=1×512=512context\ size_{max} = 1 × 512 = 512

which aligns with the provided sketch.

Ex. 2: This example uses the same parameter values as in Ex. 1, but with the limit set to 3. The theoretical retrieval outcome is shown in the image below:

Illustration of Ex. 2 Image 8 - Illustration of Ex. 2

Substituting into the previously established expressions gives:

parent chunksmax=128512×30.5=1.5parent\ chunks_{max} = \frac{128}{512} × \frac{3}{0.5} = 1.5 context sizemax=1.5×512=768context\ size_{max} = 1.5 × 512 = 768

Which is incorrect! In this case, there is an outlier, a child chunk that was not merged. Therefore, a mechanism is needed to account for this outlier. The maximum number of parent chunks is 1.5, but it should be an integer. Closer inspection reveals that the information about outliers is contained within the remainder. Applying the floor function to the existing expression provides the actual maximum number of parent chunks:

parent chunksmax=child chunk size×limitchunk size×merging thresholdparent\ chunks_{max} = \left\lfloor\frac{child\ chunk\ size × limit}{chunk\ size × merging\ threshold}\right\rfloor

The outlier can be addressed by finding the remainder, which represents the minimum number of child chunks. This is calculated as:

child chunksmin=(child chunk size×limit)mod(chunk size×merging threshold)child\ chunks_{min} = (child\ chunk\ size × limit) \mod (chunk\ size × merging\ threshold)

Remember, obtaining the maximum context size requires the maximum number of parent chunks and the minimum number of child chunks. Now, these can finally be composed together:

context sizemax=parent chunksmax×chunk size+child chunksmin×child chunk sizecontext\ size_{max} = parent\ chunks_{max} × chunk\ size + child\ chunks_{min} × child\ chunk\ size

Substituting the given values results in the correct context size:

parent chunksmax=128512×30.5=1.5=1parent\ chunks_{max} = \frac{128}{512} × \frac{3}{0.5} = \left\lfloor 1.5\right\rfloor=1 child chunksmin=(128×2)mod(512×0.5)=1child\ chunks_{min} = (128 × 2) \mod (512 × 0.5) = 1 context sizemax=1×512+1×128=640context\ size_{max} = 1 × 512 + 1 × 128 = 640

Ex. 3: This example uses the same parameter values as in Ex. 1, but the retrieval outcome, represented in the sketch, is slightly different:

Illustration of Ex. 3 Image 9 - Illustration of Ex. 3

In this case, two outliers are indicated by the absence of hatched squares, as the merging threshold was not met. This example illustrates a scenario with the minimum context size. It’s the most straightforward case:

context sizemin=child chunk size×limitcontext\ size_{min} = child\ chunk\ size × limit

Once the expressions for the maximum and minimum context sizes are derived, a limit can be established. A decision must be made between adhering to the upper or lower bound. In this case, the lower bound is selected, meaning the limit will be based on the minimum context size of 2048 tokens. However, even with the minimum fixed, the maximum context size may vary. To ensure a fair comparison, these cases should be treated separately. The following table displays the retrieval configurations for a maximum context size of 8192 tokens:

Auto-merging strategy retrieval configurations (max context size: 8192) Image 10 - Auto-merging strategy retrieval configurations (max context size: 8192)

and for a maximum context size of 16384 tokens:

Auto-merging strategy retrieval configurations (max context size: 16384 Image 11 - Auto-merging strategy retrieval configurations (max context size: 16384)

Another key factor to consider is the merging threshold. Note that while the minimum context size doesn’t depend on it, the maximum does. When selecting the merging threshold, the relationship between child and parent chunk sizes must be considered. For instance, if a parent consists of four chunks, setting the threshold to 0.33 wouldn’t make sense. Meaningful values would be 0.25, 0.5, or 0.75. Additionally, the merging threshold affects the limit, which must remain an integer value.

Hierarchical

There are many different flavors of each strategy, meaning nothing is set in stone. However, it’s important to highlight the nuances of a specific implementation. In this experiment, the hierarchical strategy is realized through two levels of hierarchy. At the first level, the chunks are summarized. The limit parameter determines the number of summaries searched at the first level. Then, within each summary, the child limit defines how many child chunks are searched at the second level of the hierarchy. Finally, the context size is defined as:

context size=child chunk size ×limit× child limitcontext\ size = child\ chunk\ size\ × limit × \ child\ limit

Proceeding to the retrieval configurations, a total of 48 were selected. To avoid clutter, only a few are shown in the table, but enough to get an idea:

Hierarchical strategy retrieval configurations Image 12 - Hierarchical strategy retrieval configurations

Hypothetical question

Similar to the basic strategy, the context size is found as:

context size=chunk size×limitcontext\ size = chunk\ size × limit

This results in the following configurations:

Hypothetical question strategy retrieval configurations Image 13 - Hypothetical question strategy retrieval configurations

Recap

The journey of exploring RAG strategy configurations began with embedding and progressed through to retrieval. The following table provides a brief summary of the number of different setups used in the scope of this experiment:

Count of embedding and retrieval configurations by strategy Image 14 - Count of embedding and retrieval configurations by strategy

For the basic strategy, these numbers are the same, as the parameters, such as chunk size and overlap, remain unchanged. However, strategies like sentence window introduce additional retrieval-specific parameters, such as range, which open up new possibilities, leading to a greater number of retrieval configurations.

Before moving into the next phase, it's important to note that retrieval spans the entire corpus. Specifically, the corpus consists of five documents, and when searching for relevant chunks, all documents are searched, not just the one from which the question is derived. As a result, the search space is significantly expanded.

Augmentation

After the relevant chunks are retrieved using the specified retrieval parameters, they are combined into contexts and injected into prompts. In most cases, the target context size was set at 2048 tokens. However, due to the nature of document-based chunking, which was explored in detail in the previous article, it became clear that deviations from the expected chunk size can significantly impact the final context size. Now, it’s time to look at the concrete numbers.

Basic

Reduction is the difference between the expected context size and mean context size that is obtained while answering the questions from the test set.

Basic strategy retrieval configurations and mean context sizes Image 15 - Basic strategy retrieval configurations and mean context sizes

The decline is expected due to the characteristics of the chunking process, as explained in the previous article. Although the mean context size is useful, to better align with the specific values, the following histogram illustrates the context sizes for the first row from the previous table:

Histogram of context sizes (expected size: 2048 tokens) Image 16 - Histogram of context sizes (expected size: 2048 tokens)

Sentence window

Sentence window yields similar results in regard to chunk size and reduction relationship:

Sentence window strategy retrieval configurations and mean context sizes (context size: 2560 tokens) Image 17 - Sentence window strategy retrieval configurations and mean context sizes (context size: 2560 tokens)

Sentence window strategy retrieval configurations and mean context sizes (context size: 3072 tokens Image 18 - Sentence window strategy retrieval configurations and mean context sizes (context size: 3072 tokens)

Sentence window strategy retrieval configurations and mean context sizes (context size: 3584 tokens Image 19 - Sentence window strategy retrieval configurations and mean context sizes (context size: 3584 tokens)

Auto-merging

This strategy is particularly interesting due to the wide range between the minimum and maximum context sizes. It’s difficult to grasp the actual context sizes when dealing with ranges like [2048, 16384]. Is the mean skewed toward the upper or lower bound, or does it lie somewhere in the middle? The following table provides the answer:

Auto-merging strategy retrieval configurations and mean context sizes (context size: 8192 tokens Image 20 - Auto-merging strategy retrieval configurations and mean context sizes (context size: 8192 tokens)

Auto-merging strategy retrieval configurations and mean context sizes (context size: 16384 tokens) Image 21 - Auto-merging strategy retrieval configurations and mean context sizes (context size: 16384 tokens)

Several factors influence the context size. Building upon the extensive analysis of boundaries, understanding how often merging occurs is valuable. This measure is referred to as the merge rate, calculated as the ratio of merged chunks to the limit:

merge rate=merged chunkslimitmerge\ rate = \frac{merged\ chunks}{limit}

Note that merged chunks don’t necessarily originate from the same parent. The following tables present the means for merged and non-merged chunks, along with the merge rate:

Auto-merging strategy retrieval configurations and merging information (context size: 8192 tokens) Image 22 - Auto-merging strategy retrieval configurations and merging information (context size: 8192 tokens)

This time, the parameters are consolidated into a single column for improved readability. The outcomes require minimal explanation as they are straightforward. The mean of merged chunks, and consequently the mean merge rate, is nearly twice as high for configurations with a chunk size of 1024 and a child chunk size of 128 compared to those with chunk sizes of 2048 and 128. Given that the merging threshold is a constant 0.25, for the 1024 chunk size, at least 2 out of 8 child chunks need to appear among the top relevant chunks for merging to occur. For the 2048 chunk size, 4 out of 16 are required. This means the latter configuration demands twice the incidence of relevant chunks. As a result, the lower merge rate (don’t confuse it with the merging threshold) leads to smaller context sizes, as shown in the table.

And the configurations with 16384 tokens:

Auto-merging strategy retrieval configurations and merging information (context size: 16384 tokens) Image 23 - Auto-merging strategy retrieval configurations and merging information (context size: 16384 tokens)

To conclude the discussion on auto-merging, there is strong evidence that child chunk overlap facilitates the merge rate. In every case, increasing the overlap led to a higher merge rate. This is unsurprising, as overlap introduces redundant information. If this redundant information proves relevant in the vector search, both overlapping chunks will receive higher relevance scores.

Hierarchical

In terms of the hierarchical strategy, there isn’t much to elaborate on regarding the context size. Although there are 48 configurations in total, the following shortened table is sufficient to observe the pattern of increasing reduction as the overlap increases.

Hierarchical strategy retrieval configurations and mean context sizes Image 24 - Hierarchical strategy retrieval configurations and mean context sizes

Hypothetical question

The reduction in context size for the hypothetical question is inversely proportional to the chunk size.

Hypothetical questions strategy retrieval configurations and mean context sizes Image 25 - Hypothetical questions strategy retrieval configurations and mean context sizes

Generation

Now moving on to the most thrilling part, answer generation. The test set includes 423 questions, and when combined with 89 retrieval configurations (based on 50 embedding configurations), thoroughly examined in the previous sections, it is expected that 37,647 answers will be generated.

A key element in this process is the answer generation prompt, available in the Appendix. The LLM is tasked with returning a string if it can answer the question based on the provided context, or returning a null value if it cannot. The following analysis outlines the preliminary success rate of the answer generation process. However, it is important to note that the correctness of successfully generated answers (i.e., those with non-null values) is unknown. At this point, the analysis is purely quantitative.

The upcoming diagrams label the x-axis with Unanswered and Answered. The Unanswered category represents questions with null answers, while the Answered category shows those with string answers. The y-axis reflects the retrieval parameters, grouped for readability.

Basic

In the context of the basic strategy, the y-axis parameters are represented as a tuple:

  • chunk_size
  • chunk_overlap
  • limit

It is important to note that the context size is fixed at 2048 tokens across all configurations. Upon examining the results, finer granularity appears to be more effective, as configurations with smaller chunk sizes (256 tokens) have the highest number of questions answered.

Answered and unanswered questions based on basic strategy parameters Image 26 - Answered and unanswered questions based on basic strategy parameters

The Augmentation section revealed that smaller chunk sizes result in reduced actual context sizes because of the document-based chunking, meaning they carry less information. So, why did the configurations with smaller chunk sizes perform significantly better? Looking at the edge cases, specifically chunk sizes of 256 and 2048 tokens, the limit provides the answer.

For the configuration with a chunk size of 2048 tokens, a single relevant chunk is retrieved. In contrast, the 256-token configuration retrieves eight relevant chunks. Even though the chunk with 2048 tokens might have the highest cosine similarity score, it may not necessarily contain the answer. Having the limit set to one means no additional chunks can be retrieved if the first one doesn't provide the answer. Why not raise the limit? Doing so would exceed the context size, undermining the basis for a fair comparison.

At first glance, the inconsistency is subtle, but there is a slight mismatch in the sum of answered and unanswered questions. For example, consider the configuration (256, 0, 8), which has 37 unanswered and 379 answered questions, totaling 416. However, the test set contains 423 questions. In fact, all the configurations shown in the diagram have a total of 416 questions. To achieve a fair comparison, excluding instances with errors is necessary. Specific errors will be examined later on.

When an error occurs while generating an answer for a particular question, such as in the (256, 0, 8) configuration, the answers to that question (even if successful for other configurations) are excluded from the comparison. This ensures that only answers to the same set of questions are compared across all strategies, providing a consistent comparison.

Sentence window

Speaking of the sentence window strategy, the y-axis represents:

  • chunk_size
  • range
  • limit

Three different context sizes were used:

  • 2560
  • 3072
  • 3584

with their corresponding diagrams to follow respectively:

Answered and unanswered questions based on sentence window strategy parameters (context size: 2560 tokens) Image 27 - Answered and unanswered questions based on sentence window strategy parameters (context size: 2560 tokens)

Answered and unanswered questions based on sentence window strategy parameters (context size: 3072 tokens) Image 28 - Answered and unanswered questions based on sentence window strategy parameters (context size: 3072 tokens)

Answered and unanswered questions based on sentence window strategy parameters (context size: 3584 tokens) Image 29 - Answered and unanswered questions based on sentence window strategy parameters (context size: 3584 tokens)

Following up on the basic strategy, smaller chunk sizes yield better results.

Auto-merging

As part of the auto-merging, the y-axis represents:

  • chunk_size
  • child_chunk_size
  • child_chunk_overlap
  • merging_threshold
  • limit

The minimum context size is fixed at 2048 tokens, with two sets of configurations having different maximum context sizes. The diagram below illustrates configurations that utilize a maximum context size of 8192 tokens:

Answered and unanswered questions based on auto-merging strategy parameters (max context size: 8192 tokens) Image 30 - Answered and unanswered questions based on auto-merging strategy parameters (max context size: 8192 tokens)

Note that there are three groups, each consisting of three subsequent configurations. Within each group, the child_chunk_overlap varies, while all other parameters remain fixed. When observing the numbers of (un)answered questions, it’s difficult to draw a definitive conclusion about which chunk overlap is most effective. For instance, the 10% overlap has the highest number of unanswered questions in the first two groups but shows a drop in the third.

It's important to highlight that when describing the diagram, the first item on the y-axis refers to the value closest to the origin, not the first one viewed from top to bottom. Therefore, when discussing the first configuration, it refers to (1024, 128, 0, 0.25, 16), not (2048, 256, 20, 0.25, 8).

Putting the brief digression aside, the smaller chunk sizes once again show better performance. Specifically, when comparing the second and third group, both with a chunk_size of 2048 tokens but differing in child_chunk_size (128 and 256 tokens, respectively), the first one performs better.

On the other hand, three configurations include a maximum context size of 16384 tokens:

Image 31 - Answered and unanswered questions based on auto-merging strategy parameters (max context size: 16384 tokens) Image 31 - Answered and unanswered questions based on auto-merging strategy parameters (max context size: 16384 tokens)

Compared to the second group in the previous diagram, the merging_threshold has been lowered from 0.25 to 0.125. With a chunk_size of 2048 and a child_chunk_size of 128 tokens, merging now requires that 2 children be found within the top 16 relevant chunks, as opposed to 4 in the previous case. This adjustment stimulated the merging rate, consequently increasing noise as the average context size grew more than twofold.

Moreover, these two setups, differing only in the merging threshold and, as a result, the maximum context size, share the same top 16 relevant chunks. The 16384-token variant includes all the same chunks as the 8192-token variant but with additional content due to the higher merge rate. Although the comparison isn’t entirely fair due to the difference in maximum context size and the clear advantage of the 16384 variant, the 8192 variant consistently outperforms it by 10-20% in terms of answered questions. It also benefits from a significantly smaller context size, thus reducing latency and cost.

Hierarchical

Proceeding to the hierarchical strategy, the y-axis represents:

  • chunk_size
  • child_chunk_size
  • child_chunk_overlap
  • limit
  • child_limit

The upcoming diagram includes 48 retrieval configurations:

Answered and unanswered questions based on hierarchical strategy parameters Image 32 - Answered and unanswered questions based on hierarchical strategy parameters

Groups with a smaller chunk size (as is common in this analysis) are performing better. However, this is more than just due to the granularity and increased chances of retrieving important information by using a greater number of smaller chunks. Another factor, thoroughly examined in the previous article, relates to the summary size. In short, it was found that summarizing larger chunks produces summaries of insufficient length, leading to information loss.

Answered and unanswered questions based on hierarchical strategy parameters for a group (2048, 128, 0, …) Image 33 - Answered and unanswered questions based on hierarchical strategy parameters for a group (2048, 128, 0, …)

Focusing on the group (2048, 128, 0, ...), it's immediately evident that the product of limit and child_limit is 16, which corresponds to the exact number of 128-token chunks needed to fill a 2048-token context. However, the challenge lies in selecting the right combination of these parameters to yield the best results. Recall that the limit parameter is used to search through the summaries, while child_limit searches the child chunks within the original chunk that was summarized. The optimal configuration seems to lie in between, as searching 8 summaries and 2 corresponding child chunks produces the best results. Although setting both parameters to 4 may initially appear balanced, it proves insufficient according to the results.

In summary, the limit should be higher than the child_limit to cover a broader area. A lower child_limit appears adequate for identifying critical information within the parent chunk. This holds until an edge case is met, such as searching 16 summaries but only one child chunk within each original chunk. This pattern is consistent across other groups as well.

Hypothetical question

In the hypothetical question strategy, the y-axis represents:

  • chunk_size
  • chunk_overlap
  • limit

The context size is fixed to 2048 tokens. There are two major groups of configurations, the first one featuring a constant number_of_questions parameter set to 3:

Answered and unanswered questions based on hypothetical question strategy parameters (constant number of questions) Image 34 - Answered and unanswered questions based on hypothetical question strategy parameters (constant number of questions)

Notice that the influence of generating such a small number of questions resulted in the first instance within this analysis where the number of unanswered questions exceeded the number of answered questions, as warned in the last article.

An issue was resolved by adopting a dynamic approach to the number_of_questions parameter, where the LLM was prompted to generate 2 questions for every 128 tokens in the chunk:

Answered and unanswered questions based on hypothetical question strategy parameters (dynamic number of questions) Image 35 - Answered and unanswered questions based on hypothetical question strategy parameters (dynamic number of questions)

There is still a decline in answered questions as the chunk size increases, a pattern previously noted across different strategies. However, compared to the basic strategy, the number of unanswered questions remains high. This suggests that the retrieval pipeline struggled to locate relevant chunks using the hypothetical questions. The issue may require a closer examination of the hypothetical question generation prompt, or it could indicate that the entire hypothetical question strategy underperforms compared to other methods and is unreliable.

Issues

Compared to embedding, answer generation encountered more issues, all of which were related to deserialization. In total, there were 181 errors: 178 were caused by the LLM returning a dictionary, and in the remaining 3 cases, the LLM returned a list. This raises a reasonable question: are certain RAG strategies more prone to such errors? The following diagram shows the distribution of errors across different strategies:

Number of errors by RAG strategy Image 36 - Number of errors by RAG strategy

The answer generation backed by the hierarchical strategy stands out with 71 errors, significantly more than some of the remaining strategies. However, this does not imply that the hierarchical or any other strategy is inherently prone to deserialization errors caused by improperly formatted JSON returned by the LLM. The role of retrieval strategies is to provide relevant chunks to build a context, which is then injected into the prompt. A lengthy context may introduce noise, leading the LLM to overlook formatting instructions such as the following (these instructions are part of the answer generation prompt, see the Appendix for the full prompt):

The output should be a well-formatted JSON object that conforms to the example below
("answer" is either string or null):
{"answer": "some answer"}

If you don't know the answer, return:
{"answer": null}

While it's been established that no particular strategy is the direct cause of the deserialization issues, the focus shifts to the length of the context, which at first glance seems like a reasonable explanation. However, there is another factor to consider. In this benchmark, long contexts were not heavily utilized. Given that the LLM used for generating answers is gpt-4o-mini, with an input size of up to 128k tokens, contexts ranging from 2k to 16k tokens (the latter being an extreme case linked exclusively to the auto-merging strategy) do not appear excessively long. Moreover, there were instances where answer generation failed due to deserialization issues despite the context being only 1k tokens long. Thus, there’s more to it than just the length of the context.

Since the error distribution across strategies didn’t reveal much, it’s necessary to explore other factors to better understand the nature of the problem. One of these is determining which set of questions the LLM finds difficult to answer correctly and analyzing their characteristics. Is it a variety of questions that are occasionally challenging, or a limited range that consistently leads to formatting errors? The answer lies in the following diagram:

Number of errors by test set sample identifier Image 37 - Number of errors by test set sample identifier

The majority of errors were caused by test set samples (i.e., questions) with the IDs 4a17db87 (responsible for 61 errors) and 7e042592 (responsible for 56 errors), so let’s take a closer look. Recall that two types of deserialization errors occurred: one where the answer was formatted as a dictionary and another where it was formatted as a list. The two identifiers mentioned are related to the dictionary formatting issue. The (question, context, answer) tuples that reveal the actual formatting problems are shown below. Note that the contexts have been shortened for clarity and readability.

Identifier: 4a17db87

Question:

What are the funding provisions and reporting requirements for the Independent Mexico Labor Expert Board?

Context:

Independent Mexico Labor Expert Board

...

4673. Funding
The United States shall provide necessary funding to support the work of the Board, including with respect to translation services and personnel support.
(Pub. L. 116113, title VII, § 733, Jan. 29, 2020, 134 Stat. 87.)

Answer:

{
  "answer": {
    "funding_provisions": "The United States shall provide necessary funding to support the work of the Board, including with respect to translation services and personnel support.",
    "reporting_requirements": {
      "assessment": [
        "the efforts of Mexico to implement Mexico’s labor reform",
        "the manner and extent to which labor laws are generally enforced in Mexico"
      ],
      "compliance_determination": "The report may include a determination that Mexico is not in compliance with its labor obligations."
    }
  }
}

Identifier: 7e042592

Question:

What are the key dates and conditions for the Nat'l Bank Act in the VI and Guam?

Context:

The Act entitled “An Act to provide a national currency secured by a pledge of United States bonds, and to provide for the circulation and redemption thereof,” approved June 3, 1864, shall be known as “The National Bank Act.”
(June 20, 1874, ch. 343, § 1, 18 Stat. 123.)

...

40. Virgin Islands; extension of National Bank Act
The National Bank Act, as amended [12 U.S.C. 21 et seq.], and all other Acts of Congress relating to national banks, shall, insofar as not locally inapplicable after July 19, 1932, apply to the Virgin Islands of the United States.
(July 19, 1932, ch. 508, 47 Stat. 703.)

41. Guam; extension of National Bank Act
The National Bank Act [12 U.S.C. 21 et seq.], and all other Acts of Congress relating to national banks, shall, insofar as not locally inapplicable after August 1, 1956, apply to Guam.
(Aug. 1, 1956, ch. 852, § 2, 70 Stat. 908.)

Answer:

{
  "answer": {
    "Guam": {
      "date": "August 1, 1956",
      "conditions": "The National Bank Act and all other Acts of Congress relating to national banks shall apply insofar as not locally inapplicable."
    },
    "Virgin_Islands": {
      "date": "July 19, 1932",
      "conditions": "The National Bank Act and all other Acts of Congress relating to national banks shall apply insofar as not locally inapplicable."
    }
  }
}

Finally, here is an example of a question that caused two out of three errors by formatting the answer as a list instead of a string.

Identifier: 2e9de31b

Question:

What countries are classified as CAFTA–DR countries under the Dominican Republic-Central America Free Trade Agreement?

Context:

4001. Purposes

(Pub. L. 10953, § 2, Aug. 2, 2005, 119 Stat. 463.)

1 to approve and implement the Free Trade Agreement between the United States, Costa Rica, the Dominican Republic, El Salvador, Guatemala, Honduras, and Nicaragua entered into under the authority of section 3803(b) of this title;

Answer:

{
  "answer": [
    "Costa Rica",
    "Dominican Republic",
    "El Salvador",
    "Guatemala",
    "Honduras",
    "Nicaragua"
  ]
}

Listed errors can be addressed through prompt engineering. The answer generation prompt is written in a minimalist style, describing what the output should look like, but not what it shouldn’t. Providing examples of both correct and incorrect output formats is a way to go. This technique, which involves using concrete examples, is often referred to as a few-shot prompting.

To conclude this section, it's important to highlight the impact of the described issues. A total of 181 errors occurred, meaning 181 fewer answers were available for further analysis than expected. Considering that 37,466 answers were generated in total, the error rate stands at 0.48%, which is negligible. However, efforts should still be made to resolve these formatting issues and reduce their frequency, especially since there is space for improvement.

Price

In the answer generation pipeline, the token consumption of the embedding model is low because it is only used to embed the questions from the synthetic test set. Additionally, there is a significant gap between the input and output token usage of the LLM. This outcome is expected, as the generated answer contains far fewer tokens than the entire prompt sent to the LLM, which includes extensive context, a question, and instructions.

Token usage by AI model Image 38 - Token usage by AI model

Described token usages are reflected in the following prices, amounting to a total of 12.76€:

Price by AI model Image 39 - Price by AI model

As shown in the upcoming pie chart, the LLM accounts for 99.5% of the price share in answer generation.

Price Distribution of AI Models Image 40 - Price Distribution of AI Models

Summary

The article delves into the process of answering questions from a synthetic test set using the Retrieval Augmented Generation (RAG). After generating a test set and embedding the corpus, the focus shifts to how adjustments in retrieval parameters affect the performance of a language model when answering questions based on relevant retrieved contexts.

The article outlines the setup, specifying the models used and their configurations. It discusses the retrieval process, exploring how varying parameters like chunk size and overlap impact the effectiveness of retrieving relevant information. Different strategies are compared, emphasizing the need for fairness in comparing the results by maintaining a consistent context size.

The augmentation phase involves combining the retrieved chunks into a prompt for the LLM. Variations in context sizes and the deviations from expected values are analyzed across different strategies, and the influence of retrieval configurations on final context size is explored.

In the generation phase, attention is directed toward analyzing the model's ability to provide answers to the questions. The impact of retrieval strategies on the number of answered and unanswered questions is examined, highlighting the importance of adjusting parameters for optimal results. Additionally, the article examines errors in answer generation and addresses their causes, along with considerations for improving performance.

Finally, the discussion concludes with an analysis of the cost involved in running the entire pipeline, breaking down token usage and the associated pricing, particularly emphasizing the cost implications of the generation phase.

Conclusion

The analysis was conducted in three phases. It began with the retrieval phase, where various parameters were explored and the foundation for comparing context sizes was consolidated. This set up the scene for the augmentation phase, during which the actual context sizes were the focus of investigation.

A clear pattern emerged across most strategies: the mean context size increased with larger chunk sizes and decreased as chunk overlap grew, even though the expected context size remained constant. However, auto-merging proved to be an exception. It was observed that increasing chunk overlap enhances the merge rate when relevant information is contained within the overlap, resulting in longer contexts. Augmenting the prompt with relevant context leads to answer generation. A thorough examination revealed some interesting findings.

The most resource-intensive strategies, those utilizing LLMs, specifically the hierarchical and hypothetical question strategies, did not meet expectations in terms of the benefit-to-cost ratio. A key issue is the high percentage of unanswered questions in the test set, with the hypothetical question strategy performing the worst, leaving 24% of questions unanswered at best.

To improve the underperformance of LLM-based strategies, the most effective starting point is prompt engineering, focusing on what the current prompt lacks. However, if this approach does not yield sufficiently improved results, the meaningfulness of these strategies is in question.

Furthermore, prompt engineering can enhance answer generation, which is prone to formatting issues in some cases. These are especially problematic because they result in the loss of valuable data needed for evaluation. This can be addressed through few-shot prompting or by using a loop in which the LLM re-evaluates the answer to ensure it conforms to the specified schema.

Three described phases produced the contexts and answers, essential assets for the upcoming analysis. However, their quality remains unknown. Finally, the stage is set for evaluation, which concludes the RAG in practice series.

Appendix

Prompt for Answer Generation

Input keys: context and question

Output format: {"answer": "..."} or {"answer": null}

You are a helpful assistant that answers given question using ONLY PROVIDED CONTEXT.
You are not allowed to use any previous knowledge.

The output should be a well-formatted JSON object that conforms to the example below
("answer" is either string or null):
{"answer": "some answer"}

If you don't know the answer, return:
{"answer": null}

<context_start>
{context}
<context_end>

<question_start>
{question}
<question_end>
Get blog post updates
Sign up to our newsletter and never miss an update about relevant topics in the industry.