Evaluating a RAG Agent using Langtrace

Karthik Kalyanaraman

⸱

Cofounder and CTO

Mar 27, 2025

Introduction

In this post, we’ll walk through how to build a simple Retrieval-Augmented Generation (RAG) agent using Agno and Weaviate, and how to trace and evaluate its performance using Langtrace.

Setup

We start by setting up our environment. First, we import necessary libraries and initialize our tracing with Langtrace.

For this, we will be using the simple RAG example from Agno's docs under Weaviate.

from agno.agent import Agent
from agno.knowledge.pdf_url import PDFUrlKnowledgeBase
from agno.vectordb.search import SearchType
from agno.vectordb.weaviate import Distance, VectorIndex, Weaviate
from dotenv import find_dotenv, load_dotenv
from langtrace_python_sdk import langtrace

_ = load_dotenv(find_dotenv())

langtrace.init()

vector_db = Weaviate(
    collection="recipes",
    search_type=SearchType.hybrid,
    vector_index=VectorIndex.HNSW,
    distance=Distance.COSINE,
    local=True,
)
# Create knowledge base
knowledge_base = PDFUrlKnowledgeBase(
    urls=["https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf"],
    vector_db=vector_db,
)
knowledge_base.load(recreate=False)  # Comment out after first run

# Create and use the agent
agent = Agent(
    knowledge=knowledge_base,
    search_knowledge=True,
    show_tool_calls=True,
)

agent.print_response("How to make Gluai Buat Chi?", markdown=True)

The knowledge base PDF is all about Thai curry recipes. Once the agent is set up, we can execute it and retrieve responses.

Tracing and Evaluation

Once the agent starts running, we can observe traces generated by Langtrace. These traces provide insights into various aspects of the agent's interactions, from tool calls to the execution times of individual components.

We’ll use the Langtrace interface to track traces and dive deeper into the Human Evaluations tab to evaluate the agent’s performance across different metrics. These include:

LLM Metrics: Model Accuracy, Query Relevance, Tool Suggestion
VectorDB Metrics: Retrieval Accuracy

All of these metrics are evaluated on a binary scale (pass/fail) for simplicity.

Categories of Evaluation

Langtrace allows us to categorize the evaluations into three distinct layers: LLM, VectorDB, and Framework. This is especially useful for RAG systems, where performance can be affected by multiple components. By categorizing the evaluations, you can identify and optimize specific areas, ensuring that the system performs well overall.

Manual Evaluation

With the traces in place, we go through each trace and manually evaluate the appropriate layer (span) to get a better understanding of the performance. Each evaluation helps us to fine-tune the system, ensuring high accuracy across the board.

Scoring and Confidence

Langtrace calculates the overall scores for each metric, showing both the median and average scores. The confidence score reflects the reliability of the evaluation. The more traces we evaluate, the higher the confidence in the results.

Conclusion

In conclusion, using Langtrace with Agno and Weaviate for building and evaluating RAG agents provides a robust solution for tracking performance and pinpointing areas for improvement. The detailed trace information, combined with structured human evaluations, offers a comprehensive view of the system’s performance, helping developers optimize their models and vector databases more effectively.

Ready to try Langtrace?

Try out the Langtrace SDK with just 2 lines of code.

Get Started

Star on Github

433

Ready to deploy?

Try out the Langtrace SDK with just 2 lines of code.

Get Started

Want to learn more?

Check out our documentation to learn more about how langtrace works

Documentation

Join the Community