Evaluations
Evaluate the quality of your models/tasks using state of the art evals
Last updated
Evaluate the quality of your models/tasks using state of the art evals
Last updated
Kiln includes a complete platform for ensuring your tasks/models are of the highest possible quality. It includes:
Access multiple SOTA evaluation methods (G-Eval, LLM as Judge)
Compare and benchmark your eval methods against human evals to find the best possible evaluator for your use case
Test a variety of different methods of running your task (prompts, models, fine-tunes) to find which perform best
Easily manage datasets for eval sets, golden sets, human ratings through our intuitive UI
Generate evaluators automatically. Using your task definition we'll create an evaluator for your task's overall score and task requirements
Utilize built-in eval templates for toxicity, bias, jailbreaking, and other common eval scenarios
Integrate evals with the rest of Kiln: use synthetic data generation to build eval sets, or use evals to evaluate fine-tunes
Optional: Python Library Usage
This is a quick summary of all of the concepts in creating evals with Kiln:
Eval (aka Evaluator): defines an evaluation goal (like "overall score" or "toxicity"), and includes dataset definitions to use for running this eval. You can add many evals to a task, each for different goals.
Score: an output score for an eval like "overall score", "toxicity" or "helpfulness". An eval can have 1 or more output scores. These have a score type: 1-5 star, pass/fail, or pass/fail/critical.
Evaluation Methods: a method of running an Eval. An eval method includes an eval algorithm, eval instructions, eval model, and model provider. An eval can have many eval-methods, and Kiln will help you compare them to find which eval-method best correlates to human preferences.
Task Run Methods: a method of running your task. A task run method includes a prompt, model and model provider. A task can have many run methods. Once you have an Eval, you can use it to find an optimal run-method for your task: the run method which scores the highest, using your eval.
Working with Evals in Kiln is easy. We'll walk through the flow of creating your first evaluator end to end:
From the "Eval" tab in Kiln's UI, you can easily create a new evaluator.
Kiln has a number of built-in templates to make it easy to get started.
We recommend starting with the "Overall Score and Task Requirements" template.
Overall Score and Task Requirement Scores: Generate scores for the requirements you setup when you created this task, plus an overall-score. These can be compared to human ratings from the dataset UI.
Generalized Templates: Kiln includes a number of common templates for evaluating AI systems. These include evaluator templates for measuring toxicity, bias, maliciousness, factual correctness, and jailbreak susceptibility.
Custom Goal and Scores: If the templates aren't a good fit, feel free to create your own Eval from scratch using the custom option.
Select a template, edit if desired, and save your eval.
The Eval you created defines the goal of the eval, but it doesn't include the specifics of how it's run. That's where eval-methods come in — they define the exact approach of running an eval. This includes things like the eval algorithm, the eval model, the model provider, and instructions/prompts.
Kiln supports two powerful eval algorithms:
LLM as Judge
Just like the name says, this approach uses LLMs to judge the output of your task. It combines a "thinking" stage (chain of thought/reasoning), followed by asking the model to produce a score rubric matching the goals you laid out in the eval.
G-Eval
G-Eval is an enhanced form of LLM as Judge. It looks at token output probabilities (logprobs) to create a weighted score. For example, if the model had a 51% chance of passing an eval and 49% chance of failing it, G-Eval will give the more nuanced score of 0.51, where LLM-as-Judge would simply pass it (1.0). The G-Eval paper (Liu et al) compares G-eval to a range of alternatives (BLEU, ROUGE, embedding distance scores), and shows it can outperform them across a range of eval tasks.
Since G-Eval requires logprobs (token probabilities), only a limited set of models work with G-Eval. Currently it only works with GPT-4o, GPT-4o Mini, Llama 3.1 70b on OpenRouter, and Deepseek R1 on Openrouter.
Unfortunately Ollama doesn't support logprobs yet.
Select LLM as Judge if you want to use Ollama or models other than the ones listed above.
The evaluator model can almost always perform better if you give it a high level summary of the task. Keep this short, usually just one sentence. We'll add more detailed asks of the evaluators in the next section.
Both Kiln eval algorithms give the model time to "think" using chain-of-thought/reasoning before generating the output scores. Your eval method defines an ordered list of evaluation instructions/steps, giving the model steps for "thinking through" the eval prior to answering. If you selected a template when creating the eval, Kiln will automatically fill in template steps for you. You can edit the templates as much as you wish, adding, removing and editing steps.
Finally, select the model you want the eval method to use (including which AI provider it should be run on).
It's possible to create evals in code as well. Just be aware eval methods are called EvalConfigs in our library.
An eval in Kiln includes two datasets:
Eval dataset: specifies which part of your dataset is used when evaluating different methods of running your task.
Eval Method dataset: specifies which part of your dataset is used when trying to find the best evaluation method for this task.
This section will walk you through populating both of your eval datasets.
When first creating your eval, you will specify a "tag" which defines each eval dataset as a subset of all the items in Kiln's Dataset tab. To add/remove items from your datasets, simply add/remove the corresponding tag. These tags can be added or removed anytime from the "Dataset" tab.
Don't worry if your dataset is empty when creating your eval, we can add data after its creation.
By default, Kiln will suggest appropriate tags and we suggest keeping the defaults. For example, the overall-score template will use the tags "eval_set" and "golden", while the toxicity template will use the tags "toxicity_eval_set" and "toxicity_golden".
"Golden" is a term often used in data science, to describe a "gold standard" dataset, used to compared different methods/approaches.
If you're creating multiple evals for task, it's usually beneficial to maintain separate datasets for each eval. For example, a "toxicity" eval dataset will likely be filled with negative content you wouldn't want in your overall-score eval. Kiln will suggest goal-specific tags by default.
Most commonly, you'll want to populate the datasets using synthetic data. Follow our synthetic data generation guide to generate data for this eval across a range of topics. We suggest at least 100 data samples per eval.
We suggest using the "topic tree" option in our synthetic data gen tool to ensure your eval dataset is diverse.
For the "overall score" eval template, the default data generation UX should work well without any custom guidance. However, evals like bias, toxicity and jailbreaking you'll want to generate data with specific guidance that ensures the dataset includes the necessary content (toxic data, biased data, maliciousness, etc). The following templates can be added to the "Human Guidance" option in synthetic data gen UI, to help generate (in)appropriate content.
Golden eval datasets work best if they have a range of ratings (some pass, some fail, some of each star-score).
If your dataset doesn't have enough variation, you may see "N/A" scores when comparing evaluators.
If after rating your golden set doesn't a range of content (for example, one score always passes or always fails), generate some additional content for the missing cases. You can use human guidance to do this, see "Guidance Templates" below for examples.
Once you've generated your data, open the "Dataset" tab in Kiln.
Filter your dataset to only the content you just generated (they will all be tagged with an automatic tag such as synthetic_session_12345).
Use the "Select" UI to select a portion of your dataset for your eval-dataset. 80% is a good starting point. Add the tag for your eval dataset, which is "eval_config" if you kept the default tag name. Note: if you you generated data using synthetic "topics", make sure to include a mix of each topic in each sub-dataset.
Select only the remaining items, and add the tag for your eval method dataset, which is "golden" if you kept the default tag name (or something like "toxicity_golden" if you used a different template than the default).
Filter the dataset to both tags (eval_config and golden) to double check you didn't accidentally add any items to both datasets.
Validation Set
For rigorous AI evaluation, you'll want to add a third set as well: a validation set. This set is reserved until the end, so your final assessment isn't contaminated by seeing early results from the test set (eval_set).
You can create this set now, or generate it later.
Next we'll add human ratings, so we can measure how well our evaluator performs compared to a human. If you have a subject matter expert for your task, get them to perform this step. See our collaboration guide for how to work together on a Kiln project.
Assuming you're working on the "overall score" template: filter your Dataset view to the "golden" tag, click the first item, add ratings, and repeat until all items in your golden dataset are rated. You can use the left/right keyboard keys to quickly move between items. Only the golden dataset needs ratings, not the eval_set.
If you're working on another template (toxicity, bias, etc) or a custom eval, you need to add each eval-score to your task, or else the rating UI for it won't appear. Open Settings > Edit Task, then add a task requirement with a matching name and type for each of the output scores in your eval. Then proceed with the instructions above (substituting the correct tags).
You added an "eval method" to your eval above. However, we don't actually know how well this eval method works. Kiln includes tools to compare multiple eval methods, and find which one is the closest to a real human evaluator.
It may seem strange, but yes… one of the first steps of building an eval to evaluate evaluation methods. It sounds complicated, but Kiln makes it easy.
Open your eval from the "Evals" tab, then click the "Compare Evaluation Methods" button. From the "Compare Evaluation Methods" screen, click the "Run Eval" button.
This will run your evaluator on the golden dataset, once with each eval method.
Once complete, you'll have a set of metrics about how well the eval method's scoring matched human scores.
One score in isolation isn't helpful. You'll want to add additional eval methods to see which one performs best. Kiln makes it easy to compare eval-methods. We suggest trying a range of options:
Try both eval methods: G-Eval and LLM as Judge
Try a range of different models: you may be surprised which model works best as an evaluator for your task. Be sure to try SOTA models, like the latest models from OpenAI and Anthropic. Even if you prefer open models, it can be good to know how far you are from these benchmarks.
Try custom eval instructions, not just the template contents.
Once you've added multiple eval methods, you can compare scores to find the best evaluator for your task. On this screen you're looking for the lowest scores, which mean the least deviation from human scores.
There's no benchmark good/bad score for an evaluator; it all depends on your task.
For an easy and highly deterministic task, you might be able to find many eval-methods which achieve near perfect scores, even with small eval models and default prompts.
For a highly subjective task, it's likely no evaluator will perfectly match the human scores, even with SOTA models and custom prompts. It's often the case that two humans can't match each other on subjective tasks. Try a range of eval methods, and pick the one with the best score (which is the highest score if using the default Kendall Tau comparison).
The more subjective the task, the more beneficial a larger and more diverse golden dataset becomes.
If you see "N/A" scores in your correlation table, it means more data is needed. This can be one of two cases
Simply not enough data: if your eval method dataset if very small (<10 items) it can be impossible to produce confident correlation scores. Add more data to resolve this case.
Not enough variation of human ratings in the eval method dataset: if you have a larger dataset, but still get N/A, it's likely there isn't enough variation in your dataset for the given score. For example, if all of the golden samples of a score pass, the evaluator won't produce a confident correlation score, as it has no failing examples and everything is a tie. Add more content to your eval methods dataset, designing the content to fill out the missing score ranges. You can use synthetic data gen human guidance to generate examples that fail.
Once you have a winner, click the "Set as default" button to make this eval-method the default for your eval.
Now that we have an evaluator we trust, we can use it to rapidly evaluate a variety of method of running our task. We call this a "Run Method" and it includes the model (including fine-tunes), the model provider, and the prompt.
Return the "Evaluator" screen for your eval, and add a variety run methods you want to compare. We suggest:
A range of models (SOTA, smaller, open, etc)
A range of prompts: both Kiln's auto-generated prompts, and custom prompts
Some model fine-tunes of various sizes, created by Kiln fine tuning
Once you've defined a set of run methods, click "Run Eval" to kick off the eval. Behind the scenes, this is performing the following steps:
Fetching the input data from your eval dataset (eval_set tag)
Generating new output for each input, using each run method you defined for each input
Running your evaluator on each result, collecting scores
Once done, you'll have results for how each run method performed on the eval.
These results are easy to interpret compared to the eval method comparisons. Each score is simply the average score from that run method. Assuming we want to find the run method that produces the best content, simply find the highest average score.
Congrats! You've used systematic evals to find an optimal method for running your task!
Congrats, you've found an optimal method of running your task!
However there's always room to improve.
You can repeat the processes above to try new eval-methods or run-methods. Through more searching, you may be able to find a better method and improve overall performance.
You can iterate by trying new prompts, more models, building custom fine-tuned models, or trying new state of the art models as they are released.
Your understanding of your model/product usually gets better over time. Consider adding data to your dataset over time (both eval_set and golden). This can come from real users, bug reports, or new synthetic data that comes from a better understanding of the problem. As you add data, re-run both sub-evals (eval-method and run-method) find the best eval-method and run-method for your task.
You can always add additional evals to your Kiln project/task. Try some of our built in templates like bias, toxicity, factual correctness, or jailbreak susceptibility — or create your own from scratch!
For developers, it's also possible to use evals from our python library.
Be aware, in our library task run methods are called TaskRunConfigs and eval methods are called EvalConfigs.
See the EvalRunner, Eval, EvalConfig, EvalRun, and TaskRunConfig class for details.