Mastering LLM Evals: Best Practices, Techniques, and Tools for Reliable AI Applications

By Nick Rogers in AI — Dec 4, 2024

/imagine Mastering LLM Evals: Best Practices, Techniques, and Tools for Reliable AI Applications --ar 16:9

Applications leveraging LLMs are fundamentally software, and like any quality software, they should be rigorously tested. Of course, the type of testing you want to employ will depend on the type of job you are doing with LLMs. Working closely with models will require a different type of testing than working as an AI engineer who builds on top of the output of those models. Either way, it's a good idea to become familiar with the types of evaluation that these systems require.

The way you test language models is through evals (short for evaluations) - these are like unit tests for your model. Whether you developed the model yourself, or are involved in prompt engineering or building on top, evals are a critical and important part of the process.

Categories of LLM Evaluation

Evaluation methods for LLMs can be categorized into three broad types, balancing automation with manual insights:

Deterministic Tests These pass/fail tests check for factual accuracy, such as verifying that "Paris" is returned as the capital of France. They are ideal for objective tasks like retrieval and logic validation but lack flexibility for nuanced contexts.
LLM-as-Judge A second LLM evaluates the outputs of the primary model. This method is widely used in production for scoring based on criteria such as correctness, fluency, or sentiment. Despite its scalability, risks include biases introduced by the evaluating LLM and potential feedback loops.
Human Feedback (or "Vibes") Manual evaluation provides context-sensitive judgments on system performance. For example, users might assess a chatbot's ability to maintain engaging conversations. While subjective, this approach captures the nuances missed by automated tools.

Historical Context: From NLP to LLMs

Before LLMs, natural language processing (NLP) relied on metrics like BLEU (2002), which measured word overlap but struggled with semantic equivalence. Techniques evolved with models like BERTScore (2019), which introduced contextual embeddings to better assess semantic similarity and was a significant leap in aligning machine understanding with human meaning.

Modern LLM Evaluation Techniques

Recent advancements have emphasized diverse metrics and testing paradigms. The first wave of LLM applications were mostly focused on question-answering and summarization.

Task-Specific Metrics For tasks like summarization or code generation, metrics assess task completion, relevance, and quality. For example, code outputs might be evaluated for successful compilation or adherence to functional requirements.
Hallucination and Bias Detection Tests like TruthfulQA and customized frameworks like Giskard help identify inaccuracies and biases in model outputs, ensuring reliability and fairness in real-world deployments.
AI-Evaluating-AI LLMs often evaluate one another to handle massive datasets efficiently. Although this approach reduces human overhead, combining it with manual checks mitigates risks like feedback loops or lack of contextual understanding.

These types of metrics can include numeric scores, categorical (eg: true/false), or even multi-output (eg: concise/verbose). The OpenAI Cookbook recommends using newer models to evaluate older ones (though that is also the most expensive approach).

Depending on the use case of the model, the accuracy and precision of evals can be critical. We generally have four possible outcomes: true positive, false positive, false negative, and true negative. Consider the case of an AI system that labels text as toxic or non-toxic. Preventing toxic outputs is a noble goal, but a system that produces too many false positives will quickly become frustrating to users and can result in undesirable churn.

Available Frameworks For Testing

Numerous tools and frameworks simplify evaluation. While most are pretty python/ML specific, I have included what looks like a promising new typescript library at the end of this list.

LangSmith for bias and safety testing,
Weights & Biases for integrated experiment tracking,
NVIDIA NeMo for automated benchmarking, and more.
EvaLite - early preview of typescript first framework for testing (built by one of the premier voices in the typescript community)

Best Practices for LLM Evaluation

Combine Methods: Use automated testing for scalability and human evaluation for depth.
Iterate Continuously: Regularly assess model updates before pushing new changes to users.
Focus on Context: Tailor tests to the application’s domain, whether for chatbots, summarization, or retrieval-augmented generation (RAG).
Guard Against Risks: Mitigate potential failures like hallucination, toxicity, or bias.