top of page

The business leader's guide to AI evals




In business, you’ve always been told to know your numbers—sales figures, revenue, engagement metrics. As AI integrates deeper into your strategy, it's time to add evaluations (evals) to that list. Without them, you won't have a clear understanding of whether your AI is functioning properly, staying aligned with business goals, and avoiding risks.


Clem Delangue, co-founder & CEO at Hugging Face, sums it up: “Evaluation is one of the most important steps, if not the most important, in AI.” Without a clear eval strategy, your AI application could drift off course, costing you time, money, and credibility.


This is an emerging area of research and leaders like Andrew Ng have identified it as a barrier to improving systems. Let’s say you have an AI agent and you want to make sure you have another agent keeping an eye on it so the results are more accurate. How will you know if the new agent is working? And how to improve it?


I’ve spent the last several months interviewing AI and ML experts and researching to learn more about AI and sharing it (pretty similar to the Feynman technique). This one I found pretty tricky, please let me know how I did and how to improve it. 


What Are Evaluations (Evals)?

According to OpenAI, evaluation means testing the quality of outputs from your AI models and applications. Evaluations ("evals") validate that your LLMs are stable, reliable, and aligned with business needs.


There are two key evaluation types:

  1. Model Evaluation – Measures the LLM’s output accuracy.

  2. System Evaluation – Tests the overall system interacting with your LLM (e.g., through APIs).


Types of Evaluations

There are several ways to evaluate your AI models, each with different benefits:


  • LLM as a judge: Using another LLM to assess generated outputs.

  • User Feedback: Having humans evaluate AI responses.

  • Golden Datasets: Comparing LLM results to pre-approved, high-quality answers.

  • Business Metrics: Linking AI performance to KPIs, such as conversion rates or customer satisfaction.


Why Do Evals Matter?

Evaluations give you control over how your AI impacts the business:


  • Stability: Regular checks ensure your LLM doesn’t degrade over time or when updated.

  • Risk Management: Catching errors early prevents potential damage to your brand or customer experience.

  • Consistency: Ensures the model delivers the same level of quality across interactions.

  • Alignment: Keeps the AI focused on your strategic priorities.


Where Can They Go Wrong?

Evals aren’t foolproof, and here’s where you might stumble:


  • Cost: Repeated calls to the model to run evals can become expensive.

  • Privacy & Security: Data shared in evaluations needs to be handled with care.

  • Rushed Releases: If you push too fast, you risk introducing errors that could have been caught with more thorough evaluations.


Your role as a leader 

  • Align Metrics to Business Goals: Ensure you Make sure the metrics you're evaluating match what matters most to your business and customers.

  • Start Early: Don’t wait until problems arise—integrate evaluations into your development cycle from the start.

  • Build a Toolkit: Your evaluation toolkit should be flexible and customizable, growing with your AI applications.

  • Encourage Learning: Invest in ongoing learning and development, as evaluation methods continue to evolve.

  • Integrate Across Teams: Ensure your AI team collaborates with product, design, engineering, and customer support teams to ensure feedback is incorporated into the eval process.


Evaluations aren’t just technical checkboxes—the are an important part of  AI success. By investing in a robust evaluation strategy, you’ll safeguard your AI from risk while ensuring it drives real business outcomes.


Image: a watercolor and acrylic marker experiment


Resources 





LLM Evaluation: Everything You Need To Run, Benchmark LLM Evals (Aparana Dhinkakaran and Ilya Reznik on Arize)


All about LLM Evals (Christmas Carol on Medium)


bottom of page