Blog

The Science Behind RAG Testing

Explore the vital step of evaluating Retrieval-Augmented Generation (RAG) applications. Learn key metrics like precision, recall, faithfulness, and relevance, essential for crafting accurate and reliable AI-generated responses.

The Science Behind RAG Testing
Development9 min read
Teo Jeftimov
TeoJeftimovTeo Jeftimov2024-05-01
2024-05-01
AI Blog SeriesSoftware DevelopmentRetrieval Augmented GenerationQuality Assurance
AI Blog Series
Software Development
Retrieval Augmented Generation
Quality Assurance

So you’ve collected your documents, embedded them, stored them in a database, implemented a context retrieval technique, and set up answer generation with an LLM. Congratulations, you’ve built your RAG application. Or have you?

If you have no idea what I’m talking about you can go check out our other articles on building RAG applications here.

In all seriousness, while you might think that implementation of all the aforementioned steps gives you a finished product that you can just start using to aid you and your users in the domain of your choice, you would indeed be very wrong. There is one step that you have missed and it’s arguably the most important one: evaluation.

Core concepts

So, what is evaluation in the context of RAG applications? Evaluation of RAG applications is a topic that is widely discussed and everybody seems to have their own names for what they are evaluating when looking at RAG application results. And while the nomenclature changes depending on which of the myriad links you click from your Google search, the core concepts of evaluation are something that everyone seems to agree on (even though they call them differently).

These core concepts are split into two groups, the so called retrieval metrics, which evaluate the quality of the retrieval of domain data in your application, these are: context precision(or context relevance) and context recall (everyone seems to agree on the naming of this one). And then we have the generation metrics, which evaluate the quality of the end result of your application, the answers. They are: faithfulness (correctness, factuality) and relevance (answer relevancy, similarity).

Retrieval metrics

Context precision

Context precision is, put in basic terms, the ratio of signal to noise within the retrieved context. What does this mean though? Ideally, the retrieved context for a query would only contain information that is essential to providing the answer to the query, but that is not always the case. Context precision therefore looks at how much essential information vs how much “useless” information the retrieved context contains.

Let’s take the question:

“What is a carburetor and what is it made from?”

An example of a high precision context would be:

“Carburetor, device for supplying a spark-ignition engine with a mixture of fuel and air. Components of carburetors usually include a storage chamber for liquid fuel, a choke, an idling (or slow-running) jet, a main jet, a venturi-shaped air-flow restriction, and an accelerator pump.”

It includes a definition for a carburetor and its components, nothing else.

Now let’s take an example of a low precision context:

“A carburetor (also spelled carburettor or carburetter) is a device used by a gasoline internal combustion engine to control and mix air and fuel entering the engine. The primary method of adding fuel to the intake air is through the Venturi tube in the main metering circuit, though various other components are also used to provide extra fuel or air in specific circumstances. Since the 1990s, carburetors have been largely replaced by fuel injection for cars and trucks, but carburetors are still used by some small engines (e.g. lawnmowers, generators, and concrete mixers) and motorcycles. In addition, they are still widely used on piston engine driven aircraft. Diesel engines have always used fuel injection instead of carburetors, as the compression-based combustion of diesel requires a greater precision and pressure of fuel-injection.”

This context also contains a definition for a carburetor but it also contains information of the inner workings of one and some historical information both of which drown out the essential information which we are most interested in.

Context recall

Context recall measures whether the retrieved context contains all the information required for generating an answer to the query. In order to measure this we need something called a “ground truth”. A ground truth is simply the correct answer to the query. When we have a ground truth we can check if the retrieved context contains all the information provided in it.

If we take the question:

“What is the capital of Poland and where is it located?”

The ground truth for this question is:

“The capital of Poland is Warsaw and it’s located in east-central Poland.”

An example of a high context recall would be:

“Warsaw is the capital and largest city of Poland. It is located on the Vistula River, in east-central Poland, roughly 260 kilometres from the Baltic Sea and 300 kilometres from the Carpathian Mountains.”

This context contains all of the information from the ground truth and is therefore ideal for answering our question.

A low context recall would be something like this:

“Warsaw is the capital and largest city of Poland. In 2012 Warsaw was ranked as the 32nd most liveable city in the world by the Economist Intelligence Unit. It was also ranked as one of the most liveable cities in Central and Eastern Europe. Today Warsaw is considered an Alpha– global city, a major international tourist destination and a significant cultural, political and economic hub.”

This context contains the information needed to answer the first part of our question, but fails to provide a concrete source for answering the second part of our question which ranks it low on the context recall scale.

Generation metrics

Faithfulness

Faithfulness measures how factually accurate the generated answer is in relation to the provided context. This is determined by taking the claims made in the generated answer and checking if they can be backed by information in the provided context.

For example, if we have the following question and context pair:

“When and where was Dennis Ritchie born?”   “Dennis MacAlistair Ritchie (born September 9, 1941) was an American computer scientist. He is best known for creating the C programming language and, with long-time colleague Ken Thompson, the Unix operating system and B programming language.”

A high faithfulness answer would be:

“Dennis Ritchie was born in America on 9th September 1941.”

A low faithfulness answer would be:

“Dennis Ritchie was born in America on 15th March 1941.”

Relevance

Relevance measures how relevant the generated answer is to the posed question. Basically, it checks that the generated answer completely covers the concepts that were mentioned in the question and doesn’t provide any redundant information.

For example, if we again take the question:

“What is the capital of Poland and where is it located?”

And then we take answers like:

“The capital of Poland is Warsaw and it’s located in east-central Poland.”   “The capital of Poland is Warsaw.”

The first one covers all the information that we required in the question (name of the capital and its location) making it a high relevance answer and the second one gives us some of the information (name of the capital) but leaves out the second part (the location) making it a low relevance answer.

Custom metrics

While using the 4 metrics explained in this article will give you all the general information you need to work on, and improve your application, so it can always give the best quality answers, there is no reason you can’t think of and implement measurements of your own.

Each RAG application is its own entity and uses a different knowledge base so it’s never a bad idea to have some custom metrics that the application can be tested against. And who better to think of these metrics than the person most familiar (I sincerely hope so) with the inner workings of the application, you, the creator. So don’t be afraid to get creative and implement metrics that you think would be most beneficial in giving you the best end product.

How to test

Okay, so we’ve covered the whats of RAG evaluation, now it’s time for the hows. How can you implement the evaluation concepts that we discussed in this article? Well, that’s simple, you just need to take your knowledge base, comb through it, find chunks that are prime examples for asking some questions, then ask your application those questions, and voilà you have everything you need to check that data against the metrics. Simple. And extremely time and resource consuming.

The good thing is, LLM’s can come to our rescue in this process. But wait, don’t we need actual people to check all these things like ground truths, relevance, precision? According to this research paper by the LMSYS group, which took the results of strong LLM judges like GPT-4 and compared them to results generated by experts and results generated by crowdsourcing, not really. They found an agreement of over 80%, the same level of agreement between humans. So now we know that we can safely use LLMs to evaluate our applications but we still don’t know the exact how of it.

Luckily, AI is a fast growing space and we already have several frameworks which have emerged dedicated to solving our exact issue. Frameworks like DeepEval, MLFlow, Ragas, Deepchecks, Arize AI, and TruLens. All of these frameworks use the core concepts that we went over in this article (although as I mentioned, the nomenclature can differ) so basically all you have to do is pick one.

Of course, when choosing you should consider things like the available resources you have, what you want to test your application for (other than the core concepts), the language you are using for your application and whatever else you think is relevant to your choice.

The best evaluation choice for our application was Ragas, so if you want to know more about that, you can check out our article where we dive into it here[link].

Conclusion

In the fast growing space of AI, LLMs and RAG, there are many things that can change from week to week even. Frameworks come and go, models get replaced by better versions, new database solutions emerge, etc. However, one thing that we can be almost certain of, is that the need for evaluation will remain constant. Evaluation is an unavoidable and crucial step for the success of your RAG application.

Summary

In this article we went over the core concepts of RAG evaluation, what each of the concepts entails, why to add additional metrics to your evaluation, and ultimately how to find the best implementation for your RAG evaluation needs. Armed with all this knowledge you are now ready to optimize your application and have the best product you can have. Best of luck in your development!

Get blog post updates
Sign up to our newsletter and never miss an update about relevant topics in the industry.