Are you interested in building a Retrieval-Augmented Generation (RAG) application? In previous articles, we introduced fundamental concepts such as retrieval strategies, database choices, vector search indices, and the Ragas evaluation framework. Now let's tie all this together into a comprehensive solution.
Our primary goal was to evaluate various retrieval strategies and determine which ones yield the best performance in different contexts. We sought to develop a general-purpose RAG application that could be fine-tuned for specific use cases in a production environment. This involved selecting suitable strategies, an evaluation framework, and the right database and index configurations.
To ensure a thorough evaluation, we needed to select a range of retrieval strategies to test. These included basic index retrieval, hierarchical index retrieval, hypothetical questions, and context enrichment strategies such as sentence window retrieval and auto-merging retrieval. Each of these, as already mentioned, we have covered in previous articles, so if you do not know what they are, go check them out. Each strategy has its own strengths and weaknesses, making it essential to evaluate them against each other under controlled conditions.
Our evaluation framework of choice was the Ragas framework. Ragas provided us with a robust set of metrics to measure performance, including context recall, faithfulness, relevance, and context precision. By using Ragas, we could objectively compare the effectiveness of each retrieval strategy and make data-driven decisions. If you want to know more, go check this out, we have a couple of articles on Ragas as well.
Choosing the right database is extremely important when building any application, but even more so with RAG applications due to the high level of memory usage indices used in vector search incur. We explored both traditional and native vector databases, weighing their pros and cons, and we even wrote a short article about that. Traditional databases offered maturity and reliability, while native vector databases provided specialised environments for handling vectors efficiently.
Postgres stood out to us as a vector database option due to its versatility and extensive indexing capabilities, including Flat, HNSW, and IVFFlat indices. These indices accommodate a range of requirements, from precise searches for smaller datasets to scalable, memory-efficient solutions for large-scale applications. Although, Postgres may not achieve the performance levels of specialised vector databases optimised for high-dimensional data searches and real-time processing and certain indices may result in higher memory consumption, for our case study it made perfect sense due to it being basically free on Azure for the amount of data in corpus we focused on.
When implementing a Retrieval-Augmented Generation (RAG) application, thoroughly analysing the given domain is critical for ensuring the application's success and effectiveness. In our case study, we had the flexibility to choose our domain, and we selected the legal domain, specifically focusing on the 54 titles of the US Code of Law. This decision was guided by several key considerations:
After establishing the key considerations around the domain, the next step is to determine the scope of the data you want to use and how you intend to collect and manage it. Domains that experience frequent updates or rapid changes require a robust strategy for continuous data integration. For example, in the financial markets, keeping the application updated with the latest market data, regulatory changes, and economic indicators ensures that the generated content remains relevant and accurate. This ties into the importance of building a sound infrastructure for data ingestion, management, and versioning.
The US Code of Law, our chosen domain, changes at a slower pace, making it an excellent candidate for our case study. This slower rate of change reduces the frequency and, with it, the cost of re-embedding data. Although versioning is crucial in the legal context to track when specific laws came into force, we decided to simplify our case study by not addressing versioning at this stage.
Given our goal to move quickly and not spend excessive time on data collection and dataset maintenance, we chose to mock this part of the solution. In a production environment, having a robust data ingestion pipeline is vital. Although we explored various ETLT (Extract, Transform, Load, and Transform) solutions, we ultimately decided to assume the existence of a solution that would fill our blob storage with relevant documents.
Despite the availability of many good ETLT solutions with free connectors for different data sources, these typically yield unstructured data that requires further refinement and poses challenges for parsing and chunking. The decision to mock the data ingestion process also stemmed from our desire to create a case study that is generic and easily reusable in a production environment. Data ingestion pipelines often end up being highly customised based on specific application needs and the nature of your data.
By focusing on the core aspects of our RAG application rather than the intricacies of data ingestion, we aimed to save time and concentrate on elements that could be effectively reused in future implementations.
We initially experimented with LangChain, a mature and robust framework known for its extensive features and tools that facilitated the development of our RAG application. However, we soon found that Semantic Kernel, although newer, provided several compelling advantages that aligned more closely with our project needs. Our team is highly familiar with the Microsoft ecosystem, and Semantic Kernel's design is specifically tailored for .NET developers. This alignment significantly reduced the learning curve, accelerating our development process and allowing us to implement and test different retrieval strategies more efficiently.
Additionally, the strategic benefits of Microsoft's partnership with OpenAI played a crucial role in our decision. The partnership provides seamless integration capabilities, such as creating an instance of ChatGPT on Azure with just a few clicks. This ease of deployment, along with the robust support infrastructure offered by Microsoft and OpenAI, made Semantic Kernel an attractive choice for our project, ultimately guiding our decision to adopt it over other frameworks.
In our exploration of Retrieval-Augmented Generation (RAG) applications, we also investigated how chunk size and chunk overlap influence performance. Chunking is a critical process in preparing documents for vector search, ensuring that text inputs stay within token limits for embedding models. Proper chunk sizing is essential not only for maintaining the integrity of the content, but also for reducing noise. Larger chunks tend to encapsulate more context, potentially including irrelevant or extraneous information, which can introduce noise and reduce the precision of the search and retrieval processes. Conversely, smaller chunks can minimise noise by focusing more narrowly on specific content, but they may also risk losing valuable contextual information necessary for accurate understanding. Therefore, finding an optimal chunk size is crucial to balancing between preserving context and minimising noise, enhancing the overall performance of the RAG system.
We used the TextChunker
from Microsoft.SemanticKernel.Text
to create document chunks. To evaluate performance, we experimented with various chunk sizes of 256, 512, 1024, 1536 and 2048 tokens, along with different overlap sizes. This testing aimed to find the optimal balance between chunk size and overlap to maintain context and enhance retrieval accuracy.
Testing multiple strategies with varying chunk sizes and overlaps significantly increased the number of embeddings required, presenting a challenge in managing and processing the data efficiently. This sets the stage for the next section, where we address the challenges and solutions related to embedding.
To embed a document using various strategies, we follow specific steps for each: Basic, Hierarchical, Hypothetical Questions, Sentence Window, and Auto-Merging. When combined with variable chunk sizes and overlaps, we end up with 125 different combinations, meaning a single document could be embedded 125 times. Costs can escalate quickly: for example, the US Code of Law contains approximately 64 million words, with an average words-to-token ratio of x1.365. Using OpenAI's text-embedding-3-large model at $0.00013 per 1,000 tokens, embedding this corpus once costs around $8.43. If we assume each of the 125 combinations costs the same as the cheapest one—a highly optimistic estimate—the total cost would be approximately $1,053.75. However, this is most definitely an underestimate, as some strategies include interacting with an LLM in their process, which can be even pricier. Now imagine working with high volume of high velocity data that needs to be re-embedded frequently.
This emphasises the need for robust, fail-safe solutions to avoid redundant processes and highlights the importance of optimising the embedding workflow to manage costs effectively, while also being able to precisely estimate the cost of embedding a corpus of text.
Each strategy involves some of the four distinct operations: chunking (C), summarising (S), question framing (Q), and embedding (E). Let's break down each of the strategies and the operations they require to try and identify optimisation opportunities.
This is an oversimplification that ignores multiple chunk sizes and overlaps, which would result in many more small trees; however, the concept can be demonstrated nonetheless.
Let's now break these trees down into individual branches and select only distinct ones among them. Out of seven individual branches, we are left with just four.
If we now join these back into a tree by going through levels of these branches one by one and joining them into a single node if the operation type matches, we will end up with this:
Right off the bat, we can see that we managed to reduce the number of embed operations from 7 to 4, a 42.86% reduction, which is huge. Further on, even though the chunking process is not charged since it is something we do ourselves locally, it still results in a reduction in time and RAM expenditure. Here we reduced number of chunking operations from 7 to 2, a 71.43% reduction.
Again, this is a simplification since we actually have 125 small trees like the ones shown in the first picture, but these percentages are still applicable and valid. One other thing worth mentioning, this is useful in the scope of our case study since we wanted to benchmark strategies with multiple different chunk sizes and overlaps, but it is not far fetched to imagine having a production solution like this which allows users to pick and choose different strategies and chunk sizes and overlaps for different types of documents. Or even multiple strategies for the same document for different types of queries. In scenarios like that, this approach is still very useful.
Now that we have established what operations need to be done in order to embed a document, let's explore ways to do it. We receive a user request for embedding a certain document in all possible combinations of strategies and chunk sizes and overlaps, what next? Well, we cannot process this request right away, this is a long-running process because embedding takes quite some time. Models that we use have a maximum throughput defined as tokens per minute, which acts as a bottleneck that slows down the process. For that reason, we will have to use some kind of background service.
So, when a user sends a request, we create an operation tree, store it in the database, and return 200 OK to the user, letting him know his request will be processed. You can notice we used Mongo. The reason for that, as already stated, is the fact that we wanted to move fast, and storing objects like trees in mongo is trivial as opposed to relational database.
Let's try to figure out how this background service should look like. We have multiple different operations we need to be able to execute: chunking, question framing, summarising, and embedding. The question arises: how should we segment our code? Do we create a single background service type that executes all these operations itself, or do we create multiple background services, one for each operation type? Well, it depends on what we want to achieve. As we already mentioned we wanted to move fast in producing this case study but we also wanted to create a scalable solution that can easily become production ready application.
Both of these options can scale horizontally, the first one offers simplicity since it reduces infrastructure needs for communication between services, but the second one offers more granularity in scaling. Even though models act as bottlenecks, meaning that you will definitely need quite few instances of each model before your background services cannot keep up with them and they require another instance, it is not an impossible scenario. For that reason, we chose the second approach because, as we said, we wanted this case study to be transferable to a production environment and we wanted to keep our options open. Also it separates the code into more clear logical units with clear responsibilities, making it more readable and maintainable. The overhead of handling multiple background services and their need for communication, for the time being, we solved by hosting the services within same service host.
So now we know we have multiple background services, each with its own responsibility. We also introduced a queue for each background service to draw tasks from and we ended up with something like this:
In this setup, chunking background service loads operation tree from the database, it executes the chunking operation and enqueues messages for summarisation, question framing, and embedding. Summarising and question framing services execute their part of the process, sending requests to the LLM and enqueuing results for the embedding service to consume. Finally, embedding background service embeds the chunks by sending them to the embedding model.
But this solution has its problems. Mainly, we have no way of tracking how much progress has been made on a single task. It could be minutes or hours, but we have no way of knowing.
To be able to track the progress of an embedding job optimised with an operations tree, we need to know how many steps each operation tree node has. The issue is you cannot know that until you chunk your document, since the number of chunks will determine how many discrete tasks like question framing or summarising you will have. So we repurposed chunking background service in a way that, besides chunking and enqueuing those chunks, it also calculates the number of steps for the rest of the tree nodes, and we gave it a new name to reflect its newly gained responsibilities - the preprocessing background service.
Now we can have all other background services update the state of the nodes upon processing a message and thus report back the amount of progress that has been made. The only issue now is that we have multiple processes competing for the same resource in the database. We can lock the resource to ensure that the number of steps is incremented correctly, but we risk creating a bottleneck if the load gets too high. For that reason, we created a new status tracking background service along with its queue that dequeues multiple status messages sent by other background services and batch updates the database, not only eliminating the risk of contention but also reducing the number of roundtrips to the database. Note that newly created Preprocessing Background Service now updates tree nodes with number of steps.
Another problem we had to solve was predetermining the price of embedding a document a certain way. These models can get pricey and if you’re dealing with a domain with high volume and high velocity, you need a way to know the amount of money embedding something will cost you.
The main issue we had to solve here was the fact that there are some unknowns that you’re dealing with. Models are usually priced by the number of input tokens in case of embedding models and both input and output tokens in case of LLMs. The issue is when you turn chunks into questions and summaries, you do not know how much output tokens this will generate, and thus you do not know the exact price. For that reason we turn to heuristics.
The number of input tokens you will send out towards models is something you can calculate, but for the number of output tokens, you have to rely on estimates based on the average number of output tokens each type of operation generates for a certain number of input tokens. Make sure to note that the number of input tokens is larger than the number of tokens in the document due to optimisations we made with the operations tree, since now we are executing embedding for multiple strategies and chunk size combinations at once.
This offered us two paths we can take. One was to precisely calculate the number of input tokens in the operation tree and, based on estimates, try to calculate the number of output tokens that will produce as well as the price. But this approach is problematic because to determine the exact number of input tokens you need to execute preprocessing and chunk the entire document you're trying to embed, and that takes time. If you’re dealing with a large enough document it will be tens of seconds and that amounts to a bad user experience. The other approach is to base the estimate on the number of tokens in the document only, information that can be made readily available, and the types of operations our strategy selection will result in. This gives a slightly less precise estimate, but executes momentarily.
To be able to track previous token consumption we need a similar solution like we had with status tracking since number of output tokens is available only with model response, and for that reason, we added yet another background service and accompanying queue.
And there we have it, this solution is what we needed before proceeding to the next step, which is actual embedding and testing the quality of our RAG implementation. To be fair, there is still room for improvement as this is not a production-ready solution, but it can easily be made so. These background services can be containerised and hosted individually, internal queues we use to exchange messages can be replaced with a message broker that will offer queue message persistence, and Postgres database can be replaced with a native vector database if needed, but the core of the RAG application is there: a robust, scalable document processing pipeline.
In this article, we tried to illustrate all the different choices you will have to make when embarking on a journey of building a RAG application. We highlighted the importance of selecting a proper strategy and setting up an evaluation framework so you can make informed decisions when trying to improve your solution. We also talked about the importance of data quality and availability, issues of privacy, and other issues you will encounter when tackling a certain domain. And finally, we proposed a solution design that will be robust, scalable, and easily portable to real world scenarios outside this case study.
All this was to set the stage for our next article, where we will put this solution to the test along with multiple different retrieval strategies in an attempt to find out how they perform on our dataset of choice, that is US Code of Law. Stay tuned for more!