Why You Need Evaluations in your LLM App and How Langtrace Can

Oluwadabira Akinwumi

Software Engineer

Jul 31, 2024

Since the release of ChatGPT by OpenAI, Large Language Models(LLM’s) have been all the rave. Numerous Large Language Models (LLMs) have since been launched. These models include advancements from both major tech companies and open-source communities. Notable examples are Anthropic’s Claude, Mistral, and Google’s Bard

As more companies integrate Large Language Models into their products and services, it becomes essential to evaluate the accuracy of these models' responses and refine prompts to achieve optimal results. This is where LLM evaluations come in. Langtrace makes it easy run LLM evaluations.

But why would I even need to run evaluations?

Let’s set the stage. We are an EdTech company building an AI chatbot to help students with their homework. There are a multitude of large language models we can choose to build our chatbot with. We also need to factor in cost of the models while also making sure our chatbot is accurate.

How can we choose the appropriate model? We can choose the correct model by running evaluations.

Running evaluations gives us the ability to benchmark the results of our chatbot against other models to test its accuracy

For the sake of this tutorial we are going to use OpenAI’s GPT 3.5-turbo model to build our chatbot. We will run an evaluation against Open AI’s GPT-4o model and compare results. Lets get started.

Lets go over how to run evaluations for our AI chatbot in Langtrace using python

All the code we use in this tutorial is available in this repository.

Steps

Firstly if you have not signed up for Langtrace, head over and create and account.

After you have signed up, create a new project and generate an api key. Make sure to keep that key you will need it for later.

Note: If you need to regenerate an api key for an already existing project. You can do so in the settings tab of your project. However your previous API key will no longer work.

Follow along to run evaluations on our AI chatbot

  • We need to Install the Langtrace python sdk and Inspect AI.

pip install langtrace-python-sdk
pip install inspect-ai==0.3.14

Make sure you add the LANGTRACE_API_KEY to your environment variables as shown below.

export LANGTRACE_API_KEY=<your-api-key>
  • Initialize Langtrace in your project. Make sure to replace the YOUR_LANGTRACE_API_KEY placeholder with your actual API key which we got at the start of this article.

  • Run your AI chatbot

python3 main.py
  • As you ask your bot questions you will be able to see the traces in Langtrace.

  • Create a dataset. After you create the dataset, copy the dataset ID. You will need it later

  • Head over to the annotations tab in your Langtrace Dashboard. Select your traces and add them to your newly created dataset.

  • Write an evaluation script to run your evaluations. Let’s call it example_eval.py

from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import model_graded_qa
from inspect_ai.solver import chain_of_thought, self_critique

@task
def example_eval():
    return Task(
        dataset=csv_dataset("langtracefs://<datasetId>"), //make sure you put your own datasetID
        plan=[
            chain_of_thought(),
            self_critique()
        ],
        scorer=model_graded_qa()
    )
  • Run the evaluation script from your shell.

For the purpose of this tutorial we are running the evaluation script against the GPT 4o model. However we can select a variety of models to run our evaluation against, to learn more check out the Inspect AI documentation

inspect eval example_eval.py --model openai/gpt-4o --log-dir langtracefs://<datasetId>

Note: In order to run the command above you will need to add your OPENAI API Key to the environment variables like so:

export OPENAI_API_KEY=<your_openai_api_key>
  • View your results

Head over to the evaluations tab on your Langtrace dashboard to view your results.

The target column represents the results from our chatbot while the output column represents the output from the GPT-4o model which we ran our evaluation against.

Conclusion

Our AI chatbot used the OpenAI’s GPT-3.5-turbo, and we ran our evaluation against OpenAI's GPT-4o model. From our evaluation we can see that the GPT 3.5-turbo model is just as accurate as the GPT-4o model and is appropriate for our chatbot.

From a cost standpoint:

  • GPT 3.5-turbo model: $3 per 1 million input tokens and $6 per 1 million output tokens

  • GPT 4o model: $5 per 1 million input tokens and $15 per 1 million output tokens

Since GPT 3.5-turbo is just as accurate for our purpose but much cheaper we can go with the GPT 3.5-turbo model and save a ton of money. By running evaluations in Langtrace we were able to make the right decision and save a ton of money!

Of course this is just one use case but in general when working with Large Language Models at scale, evaluations are very useful in ensuring accuracy of our models and choosing a model.

Ready to Get the best out of your LLM Models?

Sign up for Langtrace today and start running comprehensive evaluations to ensure the accuracy and effectiveness of your models. Don’t ship on just vibes. Make data driven decisions and be more confident in your LLM Applications.

Support Our Open Source Project

If you found this tutorial helpful, please consider giving us a star on GitHub . Your support helps us continue to improve as well as share valuable resources with the community. Thank you!

Ready to try Langtrace?

Try out the Langtrace SDK with just 2 lines of code.

Ready to deploy?

Try out the Langtrace SDK with just 2 lines of code.

Want to learn more?

Check out our documentation to learn more about how langtrace works

Join the Community

Check out our Discord community to ask questions and meet customers