Blog

Decoding Ragas Using LangSmith

Exploring the integration of LangSmith, a monitoring platform for AI language systems, with Ragas, a framework for synthetic test set generation and RAG evaluation. LangSmith's detailed tracing capabilities are vital for tracking the dynamic state of a system in production, ensuring seamless performance and insightful analysis.

Decoding Ragas Using LangSmith
Development13 min read
Luka Panic
LukaPanicLuka Panic2024-06-12
2024-06-12
Large Language ModelsAI Blog SeriesRetrieval Augmented GenerationLangSmithMonitoring
Large Language Models
AI Blog Series
Retrieval Augmented Generation
LangSmith
Monitoring

Introduction

Previous articles explored the generation of a synthetic test set and evaluation in Ragas, relying on prompts that facilitated grasping these processes. Now, it’s time to investigate how to capture these prompts in the first place. Obtaining complete logs of generation/evaluation iterations is beneficial in multiple ways, from cost and performance analysis to practical examination and comprehension of underlying mechanisms running the Ragas framework. Acknowledging how these mechanisms operate is crucial in reducing fails, especially in test set generation. This is where LangSmith comes in, a platform designed to streamline and enhance the monitoring of AI language systems.

Four pillars of LangSmith

Without further hesitation, according to the LangChain team, LangSmith is

“…a platform to help developers close the gap between prototype and production”

Although creating a basic RAG pipeline for demo purposes may be straightforward, scaling it to a production level introduces significantly more complex challenges. When initial prompts written for trivial examples start breaking, developers are forced to make adjustments through further experiments with the problematic user input until they achieve satisfactory results. After fixing one issue, a new set of potential problems arises. Will the rest of the system function properly after implementing changes to address the existing problem? This leads us into a loop of adjustments and debugging.

LangSmith comes into play by offering a suite of utilities to speed up and automate certain parts of the described process. The four pillars of LangSmith are:

  • debugging
  • testing
  • evaluating
  • monitoring

Moving forward, the focus shifts to the monitoring aspect.

Basic monitoring concepts

Regardless of which aspect of an LLM-oriented app is being monitored, whether it be test set generation, evaluation, or end-user usage, the basic components of LangSmith’s monitoring system remain the same. Therefore, it’s important to become familiar with these.

We begin a narrative with one of the most frequently used terms in LangSmith: trace. A single trace represents a full execution of the pipeline and may include one or multiple LLM calls. Traces are organized within projects, which are straightforward to create, requiring only a project's name. The UI for each project features four main tabs:

  • Runs: contains a table with traces.
  • Threads: contains a table with threads (this will be discussed later in the article).
  • Monitor: features diagrams with various metrics, such as trace count, token count, latency, etc.
  • Setup: comprises information about the required environment variables and boilerplate code.

Project in LangSmith Image 1 - Project in LangSmith

The Runs tab offers a variety of filtering options and information about each run as shown in the following image:

Runs Image 2 - Runs

This image includes traces for the test set generation process, but let's set that aside for now and focus on the concepts instead. The first column represents the status of an LLM call (success, pending or error), followed by the LLM's name. Input and output are key points of interest. Additionally, there is information such as time, latency, token count, and cost. Comparing the cost and latency between different models is useful for cost and performance analysis. Selecting any record in the table opens the following screen:

Run Image 3 - Run

This screen encompasses full input and output messages under the Run tab. We will skip the Feedback section and move on to Metadata, which encapsulates system-oriented information such as:

  • Invocation parameters
  • Metadata
  • Options
  • Runtime

A concise overview of the monitoring concepts provides a foundation for exploring Ragas' spaghetti generation and evaluation mechanisms in detail.

LangSmith integration with Ragas

Currently, Ragas integrates with LangSmith for tracing the evaluation process. The explodinggradients team plans to implement the use of Ragas metrics in LangChain evaluations in the future. Meanwhile, although not officially supported, tracing the synthetic test set generation process is possible with a little bit of hacking.

Ragas evaluation tracing

Before proceeding to evaluation, creating a dedicated project for tracking evaluation would be nice to keep things organized. Then, we are referring to the Setup tab and adding the following variables to the environment:

LANGCHAIN_TRACING_V2 = true
LANGCHAIN_ENDPOINT = "https://api.smith.langchain.com"
LANGCHAIN_API_KEY = "<your-api-key>"
LANGCHAIN_PROJECT = "<your-project-name>"

The upcoming Python code showcases a boilerplate configuration needed to track the Ragas evaluation with LangSmith:

from datasets import load_dataset
from langchain.callbacks.tracers import LangChainTracer
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas import evaluate
from ragas.metrics import *
from ragas.metrics.critique import *

amnesty_qa = load_dataset('explodinggradients/amnesty_qa', 'english_v2')
dataset = amnesty_qa['eval']

tracer = LangChainTracer()
llm = ChatOpenAI(...)
embeddings = OpenAIEmbeddings(...)

evaluate(
    dataset=dataset,
    metrics=[faithfulness, ...],
    llm=llm,
    embeddings=embeddings,
    callbacks=[tracer]
)

The explodinggradients/amnesty_qa dataset will be used for this demo, so feel free to explore it. This dataset consists of 20 samples, and here is a JSON representation of a single row to give you a sense of the subject:

{
  "question": "Which private companies in the Americas are the largest GHG emitters according to the Carbon Majors database?",

  "ground_truths": ["The largest private companies in the Americas that are the largest GHG emitters according to the Carbon Majors database are ExxonMobil, Chevron, and Peabody."],

  "answer": "According to the Carbon Majors database, the largest private companies in the Americas that are the largest GHG emitters are:\n\n1. Chevron Corporation (United States)\n2. ExxonMobil Corporation (United States)\n3. ConocoPhillips Company (United States)\n4. BP plc (United Kingdom, but with significant operations in the Americas)\n5. Royal Dutch Shell plc (Netherlands, but with significant operations in the Americas)\n6. Peabody Energy Corporation (United States)\n7. Duke Energy Corporation (United States)\n8. TotalEnergies SE (France, but with significant operations in the Americas)\n9. BHP Group Limited (Australia, but with significant operations in the Americas)\n10. Rio Tinto Group (United Kingdom/Australia, but with significant operations in the Americas)\n\nPlease note that the rankings may change over time as new data becomes available.",

  "contexts": ["The private companies responsible for the most emissions during this period, according to the database, are from the United States: ExxonMobil, Chevron and Peabody.\nThe largest emitter amongst state-owned companies in the Americas is Mexican company Pemex, followed by Venezuelan company Petróleos de Venezuela, S.A."]
}

Moving on to the integration part, the LangChainTracer is a critical component that sends POST requests to the LangSmith endpoint, thereby delivering data that is displayed in the UI. Once instantiated, it is accepted by the evaluate method within the callbacks list.

After the evaluation is finished, let’s head to the Runs tab:

Ragas evaluation run Image 4 - Ragas evaluation run

Evaluation took 199.89 seconds and the total cost is approximately 0.21 USD for the gpt-3.5-turbo-1106 model. There is only one row in the table because all calls that belong to the same evaluation are grouped. This approach is particularly useful when there are multiple evaluations, preventing hundreds of traces from being mixed up together. The final result featuring all available metrics in Ragas is shown below:

Ragas evaluation results Image 5 - Ragas evaluation results

As previously mentioned, the dataset consists of 20 samples (rows). Therefore, the final result for each metric is determined by by averaging the values of that metric across all rows. For further analysis, let's select a single row. Any row will do, for example, by expanding row 1, the following hierarchy is observed:

 Ragas evaluation run expanded Image 6 - Ragas evaluation run expanded

Here is a row 1 from the data set:

{
  "question": "Which companies are the main contributors to GHG emissions and their role in global warming according to the Carbon Majors database?",
  
  "ground_truth": "According to the Carbon Majors database, the main contributors to GHG emissions and their role in global warming are fossil fuel companies. These companies, both state-owned and private, have produced almost a trillion tons of GHG emissions in 150 years. The database shows that 100 existing fossil fuel companies, along with eight that no longer exist, are responsible for 71% of all GHG emissions since 1988. In the Americas, the private companies that have contributed the most emissions are ExxonMobil, Chevron, and Peabody, all from the United States. Among state-owned companies in the Americas, the largest emitter is Mexican company Pemex, followed by Venezuelan company Petróleos de Venezuela, S.A. It is important to note that while people with fewer resources, particularly from countries in the global South, do not significantly contribute to climate change, they are the ones most affected by its impacts. Approximately half of the global population lives in areas that are \"very vulnerable\" to climate change, and it is people with limited development opportunities who face the greatest risks. This unequal impact disproportionately affects the human rights of those with fewer resources and greater vulnerability in the context of climate change in the global South. Additionally, between 2010 and 2020, human mortality due to climate disasters was 15 times higher in vulnerable regions and populations.",
  
  "answer": "According to the Carbon Majors database, the main contributors to greenhouse gas (GHG) emissions and their role in global warming are primarily fossil fuel companies. These companies extract, produce, and sell fossil fuels, which when burned release carbon dioxide (CO2) and other GHGs into the atmosphere. The top contributors include:\n\n1. Saudi Arabian Oil Company (Saudi Aramco): As the world's largest oil producer, Saudi Aramco is responsible for a significant portion of global GHG emissions. It is estimated to have contributed around 59.26 gigatons of CO2 equivalent (GtCO2e) between 1965 and 2017.\n\n2. Chevron Corporation: Chevron is one of the largest oil and gas companies globally and has contributed approximately 43.35 GtCO2e between 1965 and 2017.\n\n3. ExxonMobil Corporation: ExxonMobil is another major oil and gas company and has contributed around 41.90 GtCO2e between 1965 and 2017.\n\n4. BP plc: BP is a multinational oil and gas company and has contributed approximately 34.02 GtCO2e between 1965 and 2017.\n\n5. Royal Dutch Shell plc: Shell is one of the largest oil and gas companies globally and has contributed around 31.95 GtCO2e between 1965 and 2017.\n\n6. TotalEnergies SE: TotalEnergies is a French multinational oil and gas company and has contributed approximately 31.91 GtCO2e between 1965 and 2017.\n\n7. ConocoPhillips Company: ConocoPhillips is an American multinational energy corporation and has contributed around 20.67 GtCO2e between 1965 and 2017.\n\n8. PetroChina Company Limited: PetroChina is the largest oil and gas producer in China and has contributed approximately 17.26 GtCO2e between 1965 and 2017.\n\n9. Peabody Energy Corporation: Peabody Energy is the largest private-sector coal company globally and has contributed around 15.39 GtCO2e between 1965 and 2017.\n\n10. Glencore plc: Glencore is a multinational commodity trading and mining company and has contributed approximately 15.11 GtCO2e between 1965 and 2017.\n\nThese companies, along with others in the fossil",
  
  "contexts": [
    "In recent years, there has been increasing pressure on these major contributors to reduce their greenhouse gas emissions and transition to more sustainable practices. Many of these companies have made commitments to reduce their carbon footprint and invest in renewable energy sources. However, the role of these companies in global warming remains a contentious issue, with ongoing debates about their responsibility and accountability for the environmental damage caused by their operations.",
    
    "The Carbon Majors database identifies a small group of companies responsible for the majority of global greenhouse gas emissions. These companies include major oil and gas producers, as well as coal mining companies, whose activities have significantly contributed to climate change. The data from the Carbon Majors database highlights the disproportionate impact of these companies on the environment and the urgent need for action to address their emissions.",
    
    "- Fossil fuel companies, whether state or private, play a key role in the emissions that drive global warming.\n- According to the Carbon Majors database, 100 existing fossil fuel companies (and eight that no longer exist) have produced almost a trillion tons of GHG emissions in 150 years.\n- These 100 companies are responsible for 71% of all GHG emissions since 1988.\n- In the Americas, the private companies responsible for the most emissions during this period, according to the database, are from the United States: ExxonMobil, Chevron, and Peabody.\n- The largest emitter amongst state-owned companies in the Americas is Mexican company Pemex, followed by Venezuelan company Petróleos de Venezuela, S.A.\n- Despite the fact that people with fewer resources, in particular from countries in the global South, do not make significant contributions to climate change, it is these people who are most affected by it.\n- Approximately half of the global population lives in areas that are \"very vulnerable\" to climate change, and it is people with considerable development constraints who are most exposed.\n- This disproportionately affects the human rights of people with fewer resources and greater risk in the context of climate change in the global South.\n- Between 2010 and 2020, human mortality due to climate disasters was 15 times higher in vulnerable regions and populations."
  ]
}

The traces allow us to perceive the big picture of evaluation in Ragas since they include an organized and sorted order of prompts for obtaining results for each metric. For instance, LangSmith captured two traces for the faithfulness metric which aligns with its explanation from the previous article. Particularly, the statement extraction and faithfulness judgment prompts from the shortened version of the faithfulness schema were captured:

Faithfulness schema (shortened) Image 7 - Faithfulness schema (shortened)

Another interesting metric is context precision which took three prompts. It was represented as:

Context precision schema (shortened) Image 8 - Context precision schema (shortened)

Multiple LLM calls are coming from multiple contexts (shown in the row 1 section), thus the number of calls for this metric is not fixed as was the case for faithfulness.

Nothing except the final result was logged for the answer similarity because it doesn’t rely on the LLM. Another interesting metric is answer correctness which reuses the answer similarity and combines it with the F1 score.

Ragas test set generation tracing

Albeit not officially supported, the test set generation tracing is more important than evaluation since its cost is significantly higher. Since the generator methods do not accept the LangChainTracer as they do with the evaluation, we can pass it to LangChain’s OpenAI clients as follows:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.callbacks.tracers import LangChainTracer
from langchain.callbacks.tracers import LangChainTracer
from ragas.testset.generator import TestsetGenerator

documents = ... # load your documents
test_size = ... # number of rows to be generated

tracer = LangChainTracer()

generator_llm = ChatOpenAI(callbacks=[tracer], ...)
critic_llm = ChatOpenAI(callbacks=[tracer], ...)
embeddings = OpenAIEmbeddings(...)

test_generator = TestsetGenerator.from_langchain(
	generator_llm, critic_llm, embeddings
)
distributions = {
	simple: 0.25, 
	reasoning: 0.25, 
	multi_context: 0.25, 
	conditional: 0.25
}
test_set = test_generator.generate_with_langchain_docs(
	documents, test_size, distributions
)

This code organizes the traces in a structure similar to the following examples (note: this is a shortened version, as there are many more traces):

Ragas test set generation runs Image 9 - Ragas test set generation runs

The absence of groups in comparison to the evaluation tracing is noteworthy. The evaluation employed a ragas evaluation group that aggregated all traces. This approach can become unwieldy when dealing with a large volume of documents, making it difficult to navigate and analyze individual traces within the never-ending list. However, metadata offers a solution to this problem. It will be utilized within a concept that hasn't been covered yet, so let's proceed with that.

LangSmith leverages threads, akin to conversational messaging systems, to organize traces. Both critic_llm and generator_llm are assigned the same thread_id :

# skipped imports and clients instantiation...

for document in documents:
	thread_id = uuid.uuid4()

	generator_llm.metadata['thread_id'] = thread_id
	critic_llm.metadata['thread_id'] = thread_id

	test_generator = TestsetGenerator.from_langchain(
		generator_llm, critic_llm, embeddings
	)
	distributions = {
		simple: 0.25, 
		reasoning: 0.25, 
		multi_context: 0.25, 
		conditional: 0.25
	}
	test_set = test_generator.generate_with_langchain_docs(
		[document], test_size, distributions
	)

Note that in the provided code, the test set samples are generated for each document one by one because we aim to store traces of each document in a separate thread. Nonetheless, there is a downside to this approach, or more precisely, a limitation. But first, let's dig a little deeper into Ragas.

During test set generation, Ragas splits documents into smaller chunks. When multiple documents are provided, these chunks end up together in the document store. This way, Ragas can generate complex questions based on chunks from different documents, a functionality that is considered very useful. On the other hand, there are cases when documents are not connected, or when one explicitly wants to generate questions from each document separately, without overlapping. In such cases, having a dedicated thread for each document can be useful.

Getting back to tracing, under the Runs tab, there is a filtering option:

Filtering runs Image 10 - Filtering runs

As the traces start appearing, thread_ids will become available, enabling the filtering of traces. If one gets confused with the operator selection, it is important to remember that LangSmith’s filtering UI is aligned with a query language. By clicking the copy icon, the following query will be copied to the clipboard:

and(eq(metadata_key, "thread_id"), eq(metadata_value, "24d537de-416c-40ec-96bc-549acc4549f1"))

In short, the is operator is essentially the equals operator. Nevertheless, there is more to threads in LangSmith, to be more precise there is a dedicated Threads tab where the threads are listed as shown:

Threads Image 11 - Threads

Expanding one of these threads results in the following screen:

Thread Image 12 - Thread

Each trace represents a standard input-output pair, where the HUMAN label corresponds to the system prompts from Ragas. Finally, we achieved a clean and organized 1:1 document-to-thread mapping. This setup allows us to analyze the complexity of the test set generation and inspect whether the generation fails on certain documents due to ethical concerns or content deemed inappropriate by the LLM, among other factors.

Conclusion

LangSmith offers a variety of features tailored to LLM-oriented applications, but this article primarily focuses on monitoring to provide strong evidence for the test set generation and evaluation schemas discussed in previous articles. The process began by composing formulas and prompts within schemas to ease the learning curve, followed by monitoring to demonstrate that these schemas accurately represent the system's actual behavior.

The setup of LangSmith for tracking Ragas evaluations and organizing traces into threads was demonstrated, enhancing clarity during test set generation. This detailed tracing is crucial for identifying issues and understanding system behavior.

Overall, the combination of LangSmith and Ragas not only enhances the monitoring process but also equips developers with essential tools to refine their AI-powered applications effectively. This approach allows for the editing of Ragas native prompts and exploring the effects of various adjustments. For instance, concerns about aspect critique, as discussed in the evaluation article, can be addressed by adjusting and testing critique prompts using LangSmith's UI utilities. These capabilities may be further explored in future articles.

Summary

The article discusses the integration of LangSmith, a platform designed for monitoring AI language systems, with Ragas, a framework built for syntetic test set generation and RAG evaluation. LangSmith aids developers in transitioning from prototypes to production by offering utilities for debugging, testing, evaluating, and monitoring AI systems. The monitoring aspect is crucial for cost and performance analysis, and understanding the underlying mechanisms of Ragas.

The integration with Ragas involves tracing the evaluation process. With some modifications, it’s possible to trace the synthetic test set generation to a certain extent. This article explains how to set up a dedicated project for evaluation tracing using LangSmith and provides a Python code example for tracking Ragas evaluations. Additionally, it discusses organizing traces into threads, each corresponding to a document, to trace the test set generation process that includes multiple documents.

The ultimate goal of this article is to complete the narrative begun with Ragas Test Set Generation Breakdown and Ragas Evaluation: In-Depth Insights, creating a cohesive unit designed to provide a deeper understanding of the advanced underlying mechanisms of Ragas.

Get blog post updates
Sign up to our newsletter and never miss an update about relevant topics in the industry.