Optimize your LLM prompts for Retrieval-Augmented Generation (RAG) with the JSON-based, and few-shot techniques. The article explores strategies for reducing hallucinations, improving context alignment, and addressing formatting errors. Extensive testing demonstrates the benefits of structuring prompts with clear instructions, separating context chunks, and leveraging in-context learning examples.
Creating an effective Retrieval-Augmented Generation (RAG) solution can be challenging, as developers lean toward various exotic strategies to maximize performance. Indexing and retrieval pipelines have evolved from straightforward playing with chunks and hierarchies to complex, Frankenstein-like implementations incorporating everything from sparse embeddings to fine-tuned agents.
The process is designed to extract the relevant context for generating answers. Not using this context effectively would undermine the entire effort. This final step is the essence of this article, with its foundation rooted in prompt engineering.
When designing a prompt, separating static and dynamic content is crucial. This is where the concept of a prompt template comes into play. A prompt template includes static content with placeholders for dynamic content, injected at runtime in the form of input variables. In Retrieval-Augmented Generation, the minimum set of these inputs typically includes the retrieved context and the user’s question.
The foundation of every prompt lies in detailed instructions, including a clearly defined response format. Since LLMs are prone to hallucinations, it’s important to minimize this behavior by specifying how to handle cases where information is incomplete or unavailable. Rather than generating a potentially misleading response, the model can be directed to return a null
value indicating it cannot provide an answer.
The upcoming prompt, originally introduced in Answer Generation, adheres to the outlined requirements.
You are a helpful assistant that answers given question using ONLY PROVIDED CONTEXT.
You are not allowed to use any previous knowledge.
The output should be a well-formatted JSON object that conforms to the example below
("answer" is either string or null):
{"answer": "some answer"}
If you don't know the answer, return:
{"answer": null}
<context_start>
{context}
<context_end>
<question_start>
{question}
<question_end>
This approach is called zero-shot prompting because it relies solely on instructions and doesn’t include input-output examples.
The prompt was pushed to its limits, with over 37k API calls to OpenAI’s gpt-4o-mini
. Such extensive testing revealed interesting findings. To keep it simple, a basic strategy will be examined, specifically one with the context size of 2,048 tokens and the following parameters:
chunk_size = 256
chunk_overlap = 0
limit = 8
Applied to the test set of 423 questions, it produced:
null
answersA closer inspection of the logs revealed that, in certain cases, the context was misaligned. Before diving into the details, let’s clarify how the context is constructed.
There is nothing special about it, relevant chunks of text are retrieved and concatenated using the double newline \n\n
. However, this approach introduces a potential issue: the chunks themselves may contain additional newlines, and their exact number is unknown unless regular expressions (that are not ideal) are used to detect them. Sometimes, this creates the illusion that content from neighboring chunks is connected, even though it isn't, leading to hallucinations.
Structuring the prompt as a JSON is one approach to addressing the problem with context. The data from the previous prompt remains intact but is presented in a different structure, as shown below:
{
"instruction": "...",
"contexts": [
"{context_0}",
"{context_1}",
"{context_2}",
...
],
"question": "{question}"
}
Where the instruction is copy-pasted:
You are a helpful assistant that answers given question using ONLY PROVIDED CONTEXT.
You are not allowed to use any previous knowledge.
The output should be a well-formatted JSON object that conforms to the example below
("answer" is either string or null):
{"answer": "some answer"}
If you don't know the answer, return:
{"answer": null}
Notice how the chunks are now kept as separate contexts. Although it doesn’t seem much, the results speak for themselves:
null
answersWith a 48.65% decrease in null
answers, it’s evident that improved prompt design facilitates the LLM’s reasoning process.
Now, it’s time to get rid of formatting errors.
Formatting issues were examined in detail here. In short, the formatting issues fell into two categories: the LLM returned the answer
as a list or a dictionary instead of a string. This makes it an ideal scenario to apply few-shot prompting, a technique for in-context learning that incorporates input-output examples. The upgraded prompt format, shown in the JSON below, now includes an examples
field:
{
"instruction": "...",
"examples": [
...
],
"contexts": [
"{context_0}",
"{context_1}",
"{context_2}",
...
],
"question": "{question}"
}
The first example specifies that the LLM should not return the answer as a list, even though lists are often the most intuitive format for enumeration:
{
"input": {
"contexts": [
"A car is a machine designed for transportation. It includes various components that enable its operation. The engine powers the car, while the wheels allow it to move. The transmission transfers power from the engine to the wheels. To ensure safety, the brakes are used to stop or slow down the car, and the steering wheel lets the driver control its direction."
],
"question": "What are the main parts of a car?"
},
"incorrect_output_format": {
"answer": ["engine", "wheels", "transmission", "brakes", "steering wheel"]
},
"correct_output_format": {
"answer": "The main parts of a car are the engine, wheels, transmission, brakes, and steering wheel."
}
}
The second example resolves the issue of answers being returned in dictionary format:
{
"input": {
"contexts": [
"Different activities and industries contribute varying amounts of greenhouse gas emissions. For instance, a coal-fired power plant produces around 2.2 pounds of CO₂ per kilowatt-hour of electricity, while natural gas power plants emit about half that amount at 1.1 pounds per kilowatt-hour. Transportation also contributes significantly, with the average gasoline-powered car emitting approximately 4.6 metric tons of CO₂ annually. Air travel is another major source, with long-haul flights emitting about 0.2 metric tons of CO₂ per 1,000 kilometers per passenger."
],
"question": "How do emissions from coal-fired power plants compare to those from natural gas power plants?"
},
"incorrect_output_format": {
"answer": {
"coal_fired_power_plant_emissions": {
"per_kWh": "2.2 pounds of CO₂"
},
"natural_gas_power_plant_emissions": {
"per_kWh": "1.1 pounds of CO₂"
}
}
},
"correct_output_format": {
"answer": "Coal-fired power plants produce about 2.2 pounds of CO₂ per kilowatt-hour, while natural gas power plants emit approximately 1.1 pounds of CO₂ per kilowatt-hour."
}
}
Answering questions utilizing the few-shot prompting technique yields the following results:
null
answersAll errors are gone, which is nice, but there’s a caveat. The examples provided for in-context learning need to be carefully designed, as they may unintentionally introduce undesired side effects. For instance, consider the answer
from the first example:
{
"answer": "The main parts of a car are the engine, wheels, transmission, brakes, and steering wheel."
}
While the output format is correct, the LLM may interpret it as a requirement for longer answers. Conversely, for certain scenarios, a brief and concise response may be more suitable:
{
"answer": "Engine, wheels, transmission, brakes, and steering wheel."
}
In the end, nuances like this depend on the use case, but it’s important to be aware of them.
An alternative to handling formatting issues is implementing a chat-like approach, where previous messages are included in the prompt, and the LLM is instructed to fix the formatting.
To tie everything together, different prompt optimizations will be compared. The following table summarizes the outcomes of answer generation for the basic strategy with the parameters mentioned earlier:
chunk_size = 256
chunk_overlap = 0
limit = 8
Image 1 - Prompt optimization impact on answer generation backed by basic strategy
Applying these prompt optimizations to other strategy configurations with larger context yields even better results. One such strategy is auto-merging, whose chosen configuration operates with a minimum context size of 2,048 tokens and a maximum context size of 8,192 tokens, using the following parameters:
chunk_size = 1024
child_chunk_size = 128
child_chunk_overlap = 0
merging_threshold = 0.25
limit = 16
Image 2 - Prompt optimization impact on answer generation backed by auto-merging strategy
Two prompt optimization techniques were thoroughly examined. The initial zero-shot prompt struggled with many unanswered, i.e. null
, responses. Since the goal is to maximize answered questions, the results were not feasible.
The shift to a JSON-based prompt format yielded significant improvements, particularly in presenting the context as a list of distinct, relevant chunks (contexts). This organization enhanced the LLM’s reasoning by preventing chunks from being blended, thereby simplifying data extraction.
In-context learning examples were incorporated to fix incorrectly formatted outputs and were easily adapted to the JSON format. Yet, the use of examples can lead to negative side effects, as the LLM may extrapolate unwanted behaviors.
Ultimately, providing well-structured prompts is critical, particularly when dealing with long and noisy contexts.
Optimizing prompts for Large Language Models (LLMs) in Retrieval-Augmented Generation (RAG) systems is crucial for reducing hallucinations and enhancing context alignment. Clear instructions and a structured JSON-based prompt design effectively separate context chunks, while few-shot learning examples improve response quality and eliminate formatting errors. Extensive testing demonstrates that well-structured prompts significantly boost performance, especially when handling lengthy and complex contexts.