Evaluations
Evaluate the quality of your models/tasks using state-of-the-art evals

Overview
Kiln includes a complete platform for ensuring your tasks/models are of the highest possible quality. It includes:
Access multiple SOTA evaluation algorithms (G-Eval, LLM as Judge)
Compare and benchmark your judges against human evals to find the best possible evaluator for your use case
Test a variety of different methods of running your task (prompts, models, fine-tunes) to find which perform best
Easily manage datasets for eval sets, golden sets, human ratings through our intuitive UI, including automatic synthetic data generation.
Generate evaluators automatically. Using your task definition we'll create an evaluator for your task's overall score and task requirements
Utilize built-in eval templates for toxicity, bias, jailbreaking, and other common eval scenarios
Integrate evals with the rest of Kiln: use synthetic data generation to build eval sets, or use evals to evaluate fine-tunes
Optional: Python Library Usage
New to evals? We suggest reading our blog post Many Small Evals Beat One Big Eval, which walks you through how to setup eval tooling, and how to create an eval culture on your team.
Video Guide
Concepts Overview
This is a quick summary of all of the concepts in creating evals with Kiln:
Eval (aka Evaluator): defines an evaluation goal (like "overall score" or "toxicity"), and includes dataset definitions to use for running this eval. You can add many evals to a task, each for different goals.
Score: an output score for an eval like "overall score", "toxicity" or "helpfulness". An eval can have 1 or more output scores. These have a score type: 1-5 star, pass/fail, or pass/fail/critical.
Judges: methods of running an Eval. A judge includes a judge algorithm, judge instructions, and judge model/provider. An eval can have many judges, and Kiln will help you compare them to find which judge best correlates to human preferences.
Task Run Methods: methods of running your task. A task run method includes a prompt, model, model provider and options (temperature, top_p, etc). A task can have many run methods. Once you have an Eval, you can use it to find an optimal run-method for your task: the run method which scores the highest, using your eval.
The Workflow
Working with Evals in Kiln is easy. We'll walk through the flow of creating your first evaluator end to end:

Creating an Eval
From the "Eval" tab in Kiln's UI, you can easily create a new evaluator.
Pick a Goal / Select a Template
Kiln has a number of built-in templates to make it easy to get started.
Overall Score and Task Requirement Scores: Generate scores for the requirements you set up when you created this task, plus an overall-score. These can be compared to human ratings from the dataset UI.
Kiln Issue Template: evaluate an issue or bug you've seen in your task. You'll describe the issue and provide examples. Kiln will help generate synthetic data to reproduce the issue which can help you ensure your fix works. For advanced issues, Kiln can generate synthetic training data for fine-tuning a model to avoid this issue.
Built-in Templates: Kiln includes a number of common templates for evaluating AI systems. These include evaluator templates for measuring toxicity, bias, maliciousness, factual correctness, and jailbreak susceptibility.
Custom Goal and Scores: If the templates aren't a good fit, feel free to create your own Eval from scratch using the custom option. However, prefer the "issue" template where possible as it's integrated into synthetic data generation.
Select a template, edit if desired, and save your eval.
Add a Judge to your Eval
The Eval you created defines the goal of the eval, but it doesn't include the specifics of how it's run. That's where judges come in — they define the exact approach of running an eval. This includes things like the judge algorithm, the judge model/provider, and judge prompt.
Select a judge model & provider
Select the model you want the judge to use (including which AI provider it should be run on).
Select an Judge Algorithm
Kiln supports two powerful eval algorithms:
LLM as Judge
Just like the name says, this approach uses LLMs to judge the output of your task. It combines a "thinking" stage (chain of thought/reasoning), followed by asking the model to produce a score rubric matching the goals you laid out in the eval.
G-Eval
G-Eval is an enhanced form of LLM as Judge. It looks at token output probabilities (logprobs) to create a weighted score. For example, if the model had a 51% chance of passing an eval and 49% chance of failing it, G-Eval will give the more nuanced score of 0.51, where LLM-as-Judge would simply pass it (1.0). The G-Eval paper (Liu et al) compares G-eval to a range of alternatives (BLEU, ROUGE, embedding distance scores), and shows it can outperform them across a range of eval tasks.
Advanced: Customize a Task Description
The evaluator model can almost always perform better if you give it a high level summary of the task. Keep this short, usually just one sentence. We'll add more detailed asks of the evaluators in the next section.
This will be pre-populated from your eval, and customizing it is optional.
Advanced: Customize Evaluation Steps / Thinking Steps
Both Kiln eval algorithms give the model time to "think" using chain-of-thought/reasoning before generating the output scores. Your judge defines an ordered list of evaluation instructions/steps, giving the model steps for "thinking through" the eval prior to answering. If you selected a template when creating the eval, Kiln will automatically fill in template steps for you. You can edit the templates as much as you wish, adding, removing and editing steps.
This will be pre-populated from your eval, and customizing it is optional.
Python Library Usage [optional]
It's possible to create evals in code as well. Just be aware judges are called EvalConfigs in our library.
Create your Eval Datasets
An eval in Kiln includes two datasets:
Eval dataset: specifies which part of your dataset is used when evaluating different methods of running your task.
Golden dataset: specifies which part of your dataset is used when trying to find the best judge for this task. This dataset will have human ratings, so we can compare judges to human preference.
This section will walk you through populating both of your eval datasets.
Defining your Dataset with Tags
When first creating your eval, you will specify a "tag" which defines each eval dataset as a subset of all the items in Kiln's Dataset tab. To add/remove items from your datasets, simply add/remove the corresponding tag. These tags can be added or removed anytime from the "Dataset" tab.
Don't worry if your dataset is empty when creating your eval, we'll guide you through adding data after its creation.
By default, Kiln will suggest appropriate tags and we suggest keeping the defaults. For example, the overall-score template will use the tags "eval_set" and "golden", while the toxicity template will use the tags "toxicity_eval_set" and "toxicity_golden".
Populating the Dataset with Synthetic Data
Most commonly, you'll want to populate the datasets using synthetic data. Clicking Add Eval Data
then Synthetic Data
from the Evals UI; this will launch the synthetic data gen tool with the proper eval tags already populated. See our synthetic data generation guide for details on generating synthetic data.
We suggest at least 160 data samples per eval. Difficult or subjective tasks may require more.
An appropriate data gen template will be populated when you enter data-gen via an eval. You can customize this template to guide data generation. See the docs for details.
Tagging the Eval Datasets
If you've launched the synthetic data generator from the evals UI, it will know to assign the needed tags in the apppriopiate ratio. There's no need to manually tag dataset items.

Add Human Ratings
Next we'll add human ratings, so we can measure how well our eval judge performs compared to a human. If you have a subject matter expert for your task, get them to perform this step. See our collaboration guide for how to work together on a Kiln project.
The Rate Golden Dataset
button in the eval screen will take you to the dataset view filtered to your golden dataset (the items which need ratings). Once fully rated, this will get a checkmark and you can proceed to the next step

You can use the left/right keyboard keys to quickly move between items. Only the golden dataset needs ratings, not the eval_set.
Finding the Ideal Judge
You added a Judge to your eval above. However, we don't actually know how well this judge works. Kiln includes tools to compare multiple judges, and find which one is the closest to a real human evaluator.
It may seem strange, but yes… one of the first steps of building an eval is to judge judges (not a typo). It sounds complicated, but Kiln makes it easy.
Run Evals on your Golden Set
Open your eval from the "Evals" tab, then click the "Compare Judges" button. From the "Compare Judges" screen, click the "Run Eval" button.
This will run your eval on the golden dataset, once with each judge.
Once complete, you'll have a set of metrics about how well the judge's scoring matched human scores.
Add Judges and Compare to Find the Best
One score in isolation isn't helpful. You'll want to add additional judges to see which one performs best. Kiln makes it easy to compare judges. We suggest trying a range of options:
Try both judge types: G-Eval and LLM as Judge
Try a range of different models: you may be surprised which model works best as an evaluator for your task. Be sure to try SOTA models, like the latest models from OpenAI and Anthropic. Even if you prefer open models, it can be good to know how far you are from these benchmarks.
Try custom eval instructions, not just the template contents.
Once you've added multiple judges, you can compare scores to find the best evaluator for your task. You're looking for the score which appears highest in the table, which means the least deviation from human scores. On some scoring methods higher scores are better (Kendall's, Spearman) and on others lower is better (MSE, MAE); the table will be sorted so the best are at the top.
Understanding Correlation Scores
There's no benchmark good/bad score for an evaluator; it all depends on your task difficulty.
For an easy and highly deterministic task, you might be able to find many judges which achieve near perfect scores, even with small eval models and default prompts.
For a highly subjective task, it's likely no evaluator will perfectly match the human scores, even with SOTA models and custom prompts. It's often the case that two humans can't match each other on subjective tasks. Try a range of judges, and pick the one with the best score.
The more subjective the task, the more beneficial a larger and more diverse golden dataset becomes.
Select the Winning Judge
Once you have a winner, click the "Set as default" button to make this judge the default for your eval.

Finding the Ideal Run Method
Now that we have an evaluator we trust, we can use it to rapidly evaluate a variety of methods of running our task. We call this a "Run Method" and it includes the model (including fine-tunes), the model provider, and the prompt.
Return to the "Evaluator" screen for your eval, and add a variety of run methods you want to compare. We suggest:
A range of models (SOTA, smaller, open, etc)
A range of prompts: both Kiln's auto-generated prompts, and custom prompts
A range of model parameters: temperature, top_p, etc
Some model fine-tunes of various sizes, created by Kiln fine tuning
Once you've defined a set of run methods, click "Run Eval" to kick off the eval. Behind the scenes, this is performing the following steps:
Fetching the input data from your eval dataset (eval_set tag)
Generating new output for each input, using each run method you defined for each input
Running your evaluator on each result, collecting scores
Comparing Run Methods
Once done, you'll have results for how each run method performed on the eval.
These results are easy to interpret compared to the judge comparisons. Each score is simply the average score from that run method. Assuming we want to find the run method that produces the best content, simply find the highest average score.
Congrats! You've used systematic evals to find an optimal method for running your task!
Comparing Run Methods Over Many Evals
In the last step, you found the ideal run method for a specifc eval. However, over time your team will generate many evals.
When you want to try a new model or prompt, you'll want to make sure the new method is better, not just on a single eval, but across all prior evals.
Kiln's compare view makes it easy to compare run methods across many evals. It also lets you compare the cost difference of each method:

Click "Compare" in the evals tab to launch this feature:

Philosophy: AI Product Evals work Best with Many Small Evals
At Kiln we believe if creating an eval takes less than 10 minutes, your team will create them when they spot issues or fix bugs.
When evals become a habit instead of a chore, your AI system becomes dramatically more robust and your team moves faster.
Read our manifesto guide on how to setup evals for your team. It covers:
Many Small Evals Beat One Big Eval, Every Time
The Benefits of Many Small Evals
Evals vs Unit Testing
3 Steps to Set Up Your Team for Evals and Iteration
Next Steps: Iterate and Expand
Congrats, you've found an optimal method of running your task!
However there's always room to improve.
Iterate on Methods
You can repeat the processes above to try new judges or run-methods. Through more searching, you may be able to find a better method and improve overall performance.
You can iterate by trying new prompts, more models, building custom fine-tuned models, or trying new state of the art models as they are released.
Expand your Dataset
Your understanding of your model/product usually gets better over time. Consider adding data to your dataset over time (both eval_set and golden). This can come from real users, bug reports, or new synthetic data that comes from a better understanding of the problem. As you add data, re-run both sub-evals (judge and run-method) to find the best judge and run-method for your task.
Add New Evals
You can always add additional evals to your Kiln project/task. Try some of our built-in templates like issue evals, bias, toxicity, factual correctness, or jailbreak susceptibility — or create your own from scratch!
Most commonly, you'll collect a list of "Issue" evals over time. This set of evals helps you work with confidence that new changes aren't regressing old issues.
Read out blog Many Small Evals Beat One Big Eval, Every Time for a quality strategy that scales as your product and team grow.
Optional: Python Library Usage
For developers, it's also possible to use evals from our python library.
Be aware, in our library task run methods are called TaskRunConfigs and judges are called EvalConfigs.
See the EvalRunner, Eval, EvalConfig, EvalRun, and TaskRunConfig classes for details.
Last updated