Research

How To Evaluate LLM Powered Applications

Karthik Kalyanaraman

Karthik Kalyanaraman

· 5 min read
Evaluating LLMs

Evaluations for Large Language Models (LLMs) powered applications, often used in Natural Language Processing (NLP) and other AI systems, are different from Model Evaluation leaderboards like Huggingface Open LLM leaderboard.

If you are an enterprise leveraging or looking to leverage LLMs, spend very little time on model evaluation leaderboards. Pick the most powerful model to start with and invest in evaluating LLM responses in the context of your product and use case.

Know your metrics

Ultimately, what matters is the performance assessment of your product, ensuring that users are getting a high-quality experience. This includes evaluating metrics such as relevance, accuracy, and fluency of the LLM outputs.

Are you building a summarization tool for content creation? Consider using custom datasets to evaluate and fine-tune your model.

  • Manually evaluate the results and come up with your own thesis of what a good summary should look like that will solve the your user's pain point.
  • Are you building AI-powered chatbots for customer support? Look at a few responses, gather human feedback, and determine the key metrics that matter the most for your use case.

Its important to keep things simple and hyper optimize for your use case before jumping into measuring all the metrics that you found on the internet.

Write unit tests to capture basic assertions

Basic assertions includes things like,

  • looking for a specific word in every response.
  • making sure the generated response obeys a specific word count.
  • making sure the generated response costs less than $ x and uses less than n tokens.

These kinds of unit tests act as a first line of defense and will help you catch basic issues quickly. Incorporating these unit tests into your evaluation framework ensures that your evaluation tools are effective from the start. If you are using python, you can use pytest to write these simple unit tests. There is no need to buy or adopt any fancy tools for this.

Use LLMs to evaluate the outputs

One of the popular approaches these days is to use a more powerful LLM to evaluate(or)grade the output of the LLM in use. This approach works well if you clearly know what metrics you care about which are often a bit subjective and specific to your use case.

The first step here is to identify a prompt that can be used for running a powerful LLM to grade the outputs.

There are nice opensource tools like Promptfoo and Inspect AI which already has built in support for model graded evaluations and unit tests which can be used as for starters.

Collect User Feedback

This is easier said than done. Especially for new products where there is not enough users to get quality feedback from to start with. But its important to make contact with reality as quickly as possible and get creative around getting this feedback - Ex: using it yourself, asking your network, friends and family to use it etc.

The goal here is to set up a system where you can diligently track the feedback and constantly tweak and iterate on the quality of the outputs. Establishing this feedback loop is extremely important.

Look at your data

No matter how many charts and visualizations you can create on top of your data, there is no proxy to looking at your data - both test and production data. In some cases, it may not be possible to do this when you are operating in a highly secure/private environment. But, you need to figure out a way to collect and look at all the LLM generations closely, especially in the early days. This will inform not just the quality of the outputs the users are experiencing, but also push you in the direction of identifying what metrics actually make sense for your use case.

Manually Evaluate

LLM-based evaluations are not foolproof, and you need to continuously tweak and improve the prompts and grading scale based on data to mitigate issues like hallucinations and toxicity in outputs. The way to collect this data is by manually evaluating the outputs yourself. This will help you understand how far apart LLM evals is drifting from the real criteria that you want to evaluate against. It's important to measure this drift and make sure LLM evals track closely with manual evals at most times.

Save your model parameters

Saving the model parameters you are using and tracking the responses along with the model parameters will help you with measuring the quality your product for that specific set of model parameters. This becomes useful when you are noticing a regression in the quality of when you are upgrading to a new model version or swapping out to a completely different model.

Email us your thoughts at [email protected]. We would love to hear about your experience managing your LLM powered product in terms of quality, accuracy, and benchmarking. Additionally, sharing insights on comparative analysis methods you use would be valuable.

Karthik Kalyanaraman

About Karthik Kalyanaraman

Cofounder and CTO - Langtrace AI, Scale3 Labs