Explore the embedding process in Retrieval-Augmented Generation (RAG) strategies. This article provides insights into the chunking process, LLM-generated summaries and hypothetical questions, offering valuable analysis on embedding configurations and their impact on answer generation.
Exploring the outcomes of the embedding process driven by various Retrieval-Augmented Generation (RAG) strategies, applied to titles from the U.S. Code, the upcoming analysis builds upon the test set sample generation discussed in a previous article. Investigating a wide range of options, meticulously examined in the RAG strategies series of articles, provides valuable insights into how RAG systems perform in practice. These strategies include:
The embedding process is powered by a queue-based architecture, which, along with its development, has been detailed in the previous article. Without excessive elaboration, the focus shifts to technical matters.
The following AI models are employed:
gpt-4o-mini
version 2024-05-13
with temperature
set to 0.1
text-embedding-ada-002
version 2
The same LLM is used for both summary and hypothetical question generation.
The embedding configurations are represented as the Cartesian product of various parameters for each strategy. A few exceptions exist, but they are explicitly noted. The most common parameters across different strategies are chunk size and chunk overlap. The former is measured by the number of tokens, while the latter represents the percentage of the corresponding chunk size in tokens.
chunk_sizes = [256, 512, 1024, 2048]
chunk_overlaps = [0, 10, 20]
For the basic strategy, 12 different embedding configurations were chosen - 4 chunk_sizes
× 3 chunk_overlaps
. For clarity, it can be seen in the table below:
Image 1 - Basic strategy embedding configurations
Likewise, a number of configurations is determined for the remaining strategies.
chunk_sizes = [128, 256, 512]
The sentence window has 3 configurations. Chunk overlap is not used, since neighboring chunks will be retrieved.
chunk_sizes = [1028, 2048]
child_chunk_sizes = [128, 256]
child_chunk_overlaps = [0, 10, 20]
A child chunk size of 256 is exclusively paired with a chunk size of 2048, so auto-merging has 9 configurations. When multiple levels of chunking are involved, the overlap is applied only at the final level. Applying chunk overlap to the parent chunk would result in derived child chunks containing redundant information. The same approach will be used in the upcoming hierarchical strategy.
chunk_sizes = [512, 1024, 2048]
child_chunk_sizes = [128, 256, 512]
child_chunk_overlaps = [0, 10, 20]
A child chunk size of 256 is exclusively paired with the chunk sizes of 1028 and 2048, while 512 is exclusively paired with 2048, leaving the hierarchical strategy with 18 configurations.
chunk_sizes = [256, 512, 1024, 2048]
chunk_overlaps = [0]
number_of_questions = ...
This strategy introduces a number_of_questions
parameter, specifically the number of questions generated per chunk, which must be chosen wisely as it has a significant impact. Two different approaches to determining this parameter will be discussed:
In the first case, the number of questions will be set to a constant of 3 across different chunk sizes. One can already anticipate the consequences of such a choice. In contrast, using the second approach, the number of questions is determined as:
where the tokens per question is chosen so that dividing the chunk size by it results in an integer. Within this experiment, it will be set to 128, meaning that for a chunk size of 256, 2 questions will be generated, for 512, 4 questions will be generated, and so on. In the end, this strategy has 8 configurations.
In the upcoming sections, the emphasis shifts to the outputs of the embedding. Most strategies produce chunks and corresponding embeddings, with the addition of the hypothetical question strategy, which generates questions, and the hierarchical strategy, which generates summaries. The properties of these LLM-generated artifacts are hard to predict, making the analysis of nuances like these particularly interesting.
Special emphasis is placed on understanding the embedding parameters used by each strategy, as this knowledge will be essential for comprehending the retrieval parameters later on, as shown in the table below.
Image 2 - Embedding parameters
Once the embedding configurations are chosen, the chunking follows as the first and integral part of the embedding process. Chunking was accomplished with TextChunker
, specifically using a SplitPlainTextParagraphs
method. This type of chunking is often referred to as document-based chunking according to this article. The upcoming diagram illustrates the consequences of such a choice.
Image 3 - Chunk sizes
The x-axis represents the chunk size provided as a parameter to the chunking method, and will sometimes also be referred to as the expected chunk size. Once chunking is completed for a given chunk size and overlap, a list of chunks is obtained, and the y-axis shows the mean of these chunks sizes. Values on both axes are measured in tokens.
Shifting to the interpretation of the displayed data, it is observed that the size of the obtained chunks deviates from the expected value. When looking at the value 256 on the x-axis, the corresponding value on the y-axis is clearly below 256. This deviation occurs because the splitting method segments the text by paragraphs to preserve the document’s structure. In contrast, fixed-size chunking would produce chunks of exact sizes but at the expense of awkward splits, cutting through words or sentences.
Another noticeable trend is the decline in mean chunk size as the overlap increases.
However, this chart doesn’t provide much insight into the distribution of different values. Thus, it’s time to involve some descriptive statistics, which, when paired with the upcoming histograms, reveal interesting findings.
Image 4 - Histogram of chunk sizes (expected size: 256 tokens)
expected_chunk_size = 256
mean = 205.17
median = 219
std_dev = 44.45
sample_size = 11418
Image 5 - Histogram of chunk sizes (expected size: 512 tokens)
expected_chunk_size = 512
mean = 441.71
median = 473
std_dev = 79.01
sample_size = 5314
Image 6 - Histogram of chunk sizes (expected size: 1024 tokens)
expected_chunk_size = 1024
mean = 932.39
median = 983
std_dev = 126.26
sample_size = 2520
Image 7 - Histogram of chunk sizes (expected size: 2048 tokens)
expected_chunk_size = 2048
mean = 1939.58
median = 2005
std_dev = 191.15
sample_size = 1212
Deviation is particularly noticeable with smaller chunk sizes, such as 256, as illustrated in the first histogram. The actual chunk sizes for larger expected values tend to be less skewed. The Kernel Density Estimate (KDE) curves are closer to the mean, and more bins with higher frequency fall within (recall that encompasses 68% of the values). For example, the range for an expected chunk size of 256 is:
Meanwhile, for an expected chunk size of 2048, the range is within:
Notably, the expected chunk size of 256 doesn’t even fall within the interval. What causes this discrepancy? Simply put, the chunker performs better with larger chunk sizes because it has more “control” over the content. It’s easier to organize and fit paragraphs smaller than the chunk size, leading to more consistent chunking as the chunk size increases.
There is a significant drawback to this chunking issue. For instance, if a strategy retrieves two chunks with an expected size of 256 tokens each, forming a context of 512 tokens, but instead retrieves the two most relevant chunks with only 205 tokens each, the resulting context size is 410 tokens which is 20% smaller than the intended 512. This reduction in context size can lead to a loss of important information, potentially compromising the ability to provide a detailed answer or, in some cases, any answer at all.
The actual sizes of the chunks deviate even more from the expected value when there are multiple levels of chunking. The Configurations section introduced the auto-merging and hierarchical strategy, both based on two levels of chunking. Let’s focus on the chunk size of 256, derived from a parent chunk of 1024 tokens (this time, the x-axis represents the obtained chunk sizes):
Image 8 - Histogram of chunk sizes at the second level of chunking (expected size: 256 tokens)
expected_chunk_size = 1024
expected_child_chunk_size = 256
mean = 198.08
median = 212
std_dev = 50.92
sample_size = 11996
as well as from a parent chunk of 2048 tokens:
Image 9 - Histogram of chunk sizes at the second level of chunking (expected size: 256 tokens)
expected_chunk_size = 2048
expected_child_chunk_size = 256
mean = 202.45
median = 216
std_dev = 46.48
sample_size = 11743
Condensing the information into a table results in:
Image 10 - Statistics of chunk sizes (expected size: 256 tokens)
The first row represents the basic case, without parent chunks. There is a noticeable trend of increasing sample size and deviation, while the mean shifts away from the expected 256 tokens, as the chunker has less content to work with.
In the worst-case scenario, a chunk consists of just a few words or letters. Sometimes, the chunks contained content that offered no value to the answer generation, such as the following:
Summarizing the chunks is an integral part of the hierarchical strategy. The prompt used for generating summaries, along with the prompts for hypothetical questions and answers, can be found in the Appendix. Using this prompt with the LLM yields noteworthy results. The diagram below illustrates the relationship between chunk size and the mean summary size:
Image 11 - Mean of summary sizes by chunk size
Clearly, the summary sizes do not scale proportionally with the chunk sizes. As chunk sizes double, the mean summary sizes only increase slightly, resulting in a loss of information, which in turn impacts the quality of retrieval. Intuitively, one would expect that doubling the chunk size would lead to a summary roughly twice as large. Now, let's examine how this affects the total size of all summaries in terms of tokens.
Image 12 - Total size of summaries by chunk size
Considering that the entire corpus contains 2,342,281 tokens, summarizing chunks of 512 tokens resulted in summaries totaling 20.44% of the corpus. For a chunk size of 1024 tokens, this percentage drops to 11.64%, and for 2048 tokens, it decreases further to 6.18%. Ideally, the summary sizes should remain proportionally consistent across different chunk sizes.
The generation of hypothetical questions relies on a dedicated prompt, as listed in the Appendix. The Embedding section discussed two approaches for determining the number of questions per chunk. Elaborating on the prompt, the number of questions is not a fixed quantity that will necessarily be generated, instead, it represents an upper bound. This means that an LLM may generate fewer questions, or possibly none at all.
One particularly interesting aspect of the generated questions is their length. In light of AI models, this will be measured in terms of the tokens. The first approach, with a constant of 3 questions per chunk, yields the following results:
Image 13 - Mean of hypothetical question sizes by chunk size
The chunk size isn’t highly correlated with the length (size) of the questions. One of the main factors is that each question is generated based on specific information within the chunk, rather than the entire chunk. Therefore, the results obtained are not surprising.
The key aspect to consider is the total number of questions, both generated and skipped:
Image 14 - Number of hypothetical question sizes by chunk size (constant number of questions)
Each bar grouping represents the total number of questions expected to be generated. But what does "expected" mean in this context? Consider the following embedding configuration, which corresponds to the first bar grouping:
chunk_size = 256
chunk_overlap = 0
number_of_questions = 3
As mentioned in the Chunking section, for this chunk size and overlap, the sample_size
was 11,418. This means that the expectation was to generate three questions for each of the 11,418 chunks, resulting in a total of 34,254 questions. However, in practice, only 29,406 questions were generated, while 4,848 were skipped.
The term "skipped" refers to cases where the LLM was prompted to generate a specific number of questions (in this instance, 3 questions per chunk), but produced fewer than expected. For example, if the model generated only 2 questions from a given chunk instead of the expected 3, the missing question would be considered "skipped". These skipped questions can occur due to limitations in the model’s ability to extract the questions, or a lack of sufficient information within the chunk to derive meaningful questions.
To stay on point, the repercussions of using a constant number of questions are becoming evident, as significantly fewer questions are generated for larger chunk sizes. This issue is addressed in the second approach, where the number of questions becomes a function of chunk size, specifically, 1 question is generated for every 128 tokens within each chunk:
Image 15 - Number of hypothetical question sizes by chunk size
The bars are now almost aligned, but the drop in generated questions is noticeable for the chunk size of 2048. Based on the results, it seems the LLM finds it easier to generate more questions from medium sized chunks. For example, it generates more questions overall when instructed to generate 8 questions from two chunks of 1024 tokens each, rather than 16 questions at once from a chunk of 2048 tokens.
Expressing data as percentages results in the following chart:
Image 16 - Percentage of generated and skipped hypothetical questions by chunk size
Given this narrow range of chunk sizes, it’s difficult to make a clear conclusion, but chunk sizes between 512 and 1024 seem to be a sweet spot.
An analysis of the hypothetical questions focuses on quantity. It doesn’t provide any insight into the quality of these questions until the evaluation is performed.
The embedding generation process, specifically GenerateEmbeddingAsync
from SemanticKernel
, failed three times due to the TextChunker
producing empty chunks. Fortunately, this did not affect the overall embedding process, as the empty chunks carried no information. The following error was triggered:
'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.
Another issue pertains to incorrect formatting by the LLM. Instead of returning a string, it returned an object, which caused the Deserialize
method from JsonSerializer
to fail with the following error:
The JSON value could not be converted to System.String.
Generated summary:
{
"summary": {
"ship": "transfer physical possession of grain for transportation",
"false_terms": "false, incorrect, misleading",
"deceptive_practices": "deceptive loading, handling, weighing, or sampling",
"export_elevator": "grain facility in the U.S. for shipping grain abroad",
"export_port_location": "recognized port in U.S. or Canada for grain export",
"official_weighing": "certification of grain quantity by official personnel",
"supervision_of_weighing": "oversight by official personnel to ensure weighing accuracy",
"intracompany_shipment": "shipment of grain between owned facilities within the U.S."
}
}
This error compromised the quality of the testing. Since the generated summary was incorrectly formatted and, as a result, not properly parsed, it was neither embedded nor stored in the database, leading to the loss of valuable information. If a question from the synthetic test set pertains to the chunk from which the summary should have been generated, an answer to that question won’t be found during retrieval. The good thing is that only a single error of this type occurred, making its impact negligible compared to the 9045 summaries generated across the entire corpus.
After thorough analysis, it’s time to bring the long-awaited pricing information. The pricing details are derived from token consumption, which becomes useful when considering a model switch. It's important to note that such a conversion is meaningful only when the replacement model uses the same tokenizer. Without further digression, let's outline the components of the pipeline that incur costs.
In the embedding pipeline, the LLM serves two primary roles:
Meanwhile, the embedding model is used to generate embeddings for:
Given that a significant number of the chosen embedding configurations rely solely on the embedding model, it's expected that the embedding model will dominate over the LLM in terms of token consumption, as shown below:
Image 17 - Token usage by AI model
These token usages are accompanied by the corresponding prices, summing to a total of 15.47€:
Image 18 - Price by AI model
The total costs of the embedding process across models are nearly equalized:
Image 19 - Price Distribution of AI Models
This article provides a detailed analysis of the embedding process using various Retrieval-Augmented Generation (RAG) strategies applied to titles from the U.S. Code. The embedding process relies on a queue-based architecture, using the AI models gpt-4o-mini
and text-embedding-ada-002
.
Key parameters such as chunk sizes and overlaps are outlined for each strategy, and the process of chunking text is explained. The chunking method used, based upon splitting text by paragraphs, causes deviations from expected chunk sizes, particularly at smaller sizes. This leads to potential issues with context size reduction, which can hinder the quality of retrieval in the future.
Several issues were identified, including empty chunks and incorrect LLM output formats that caused deserialization errors. Despite these issues, the process was largely successful, with only minor disruptions.
Finally, the article touches on the costs associated with the embedding process, noting that the embedding model consumed more tokens than the LLM, due to the nature of chosen strategies and options. Total costs amounted to 15.47€.
The analysis of the embedding results primarily focuses on examining the properties of the chunks, summaries, and hypothetical questions, as well as their connection to various embedding parameters. This forms the foundation for a deeper understanding of the forthcoming retrieval process.
A significant emphasis was placed on the chunking process, which is the first operation in all strategies and thus critical to the overall workflow. Text chunking was executed using the document-based chunking method, but several caveats were identified. One key issue involved the generation of empty chunks, which caused embedding API calls to fail.
Another important factor was the distribution of chunk sizes, which can significantly impact the overall context size.
The experiment also provided valuable insights into the preferred lengths of LLM-generated summaries. It was observed that the length of the summaries did not scale proportionally with the increase in chunk size.
Regarding the hypothetical question strategy, one of the most challenging decisions was determining the sufficient number of questions to generate for different chunk sizes. After extensive analysis, it was concluded that the number of questions should be a function of the chunk size, ensuring a more balanced approach.
Exploring various embedding configurations opens the door to optimizing the retrieval pipeline. The true effects of these configurations will become apparent when the retrieval pipeline is activated and the system begins answering the questions from the synthetic test set.
All of these findings set the stage for the next phase, which will be covered in the upcoming article, where the retrieval pipeline is put to the test.
Input keys: numberOfQuestions
and text
Output format: {"questions": []}
We are creating an app, and you are a professor of law.
Your task is to generate SPECIFIED number of DOMAIN RELATED questions that the USER MAY ASK.
The questions should be generated based only on the provided <text> and not prior knowledge.
The questions should not be connected with each other.
The answers for generated questions must be contained within given <text>.
If you can't generate any meaningful question, return a JSON with empty list.
Here are the examples of a bad question, a reason why it is bad and a good question where its fixed.
Example 1:
BAD: What does the 'uniformed service' refer to in this text?
REASON: Don't use phrases like "refer to in this text".
GOOD: What is the uniformed service?
Example 2:
BAD: When should vessels observe the regulations prescribed by the Surgeon General from the provisions of subsections (a) and (b) of this section?
REASON: Don't use phrases like "subsections (a) and (b) of this section".
GOOD: When should vessels observe the regulations prescribed by the Surgeon General?
<banned_phrases>:
- in this section
- in this text
- in this context
- in given text
AVOID using <banned_phrases> and other phrases similar to them in your questions, i.e. imagine like you are creating questions for an exam and students DON'T have an access to the materials.
You must generate 0-{numberOfQuestions} questions!
The output should be a well-formatted JSON object that conforms to the example below:
{"questions": []}
where "questions" is a list containing 0-{numberOfQuestions} questions without <banned_phrases>.
<text_start>
{text}
<text_end>
Input key: text
Output format: {"summary": "..."}
You are an assistant trained in creating summaries.
When writing the summary, try to include important keywords.
Create a summary of given <text>.
Just summarize the <text>, don't write phrases like "The text contains..."!
The output should be a well-formatted JSON object that conforms to the example below:
{"summary": "some summary"}
<text_start>
{text}
<text_end>