Reviewing and Rating
Ratings help multi-shot prompting, fine-tuning, evals, and more
Last updated
Ratings help multi-shot prompting, fine-tuning, evals, and more
Last updated
Kiln includes a rating interface for rating dataset entries. This can be used to score the quality of the generated data, or to evaluate the quality of a model.
There are two methods of defining rating options:
Adding requirements to your task definition in Settings > Edit Task will add a rating option to every sample in your dataset
Each rating option has a number of parameters:
Name: the name of the requirement, which will appear in the rating UI. Limited in length to fit in the UI, but you can add more content in the instructions field below.
Rating Type: one of 5-star, pass/fail, pass/fail/critical.
Priority: how important this criteria is to the task.
5-star: a 1-5 star rating.
Pass/Fail: A binary pass/fail rating.
Pass/Fail/Critical: A ternary pass/fail/critical rating. It can be useful to add the "critical" level when there are criteria where some failures are exceptionally important to avoid. For example, a customer service bot could have a "tone" criteria, where casual/slang language would be a failure, but profanity or insulting the user would be critical.
Custom: you can define a custom rating scale when using python library. However, you won't be able to use custom ratings in the Kiln UI.
Kiln uses ratings in a variety of ways:
After creating an , each output score will be available as a rating option for every sample in its evaluation method dataset (golden dataset).
Instructions: more details about the requirement. These will be available to reviewers in the UI (under the icon).
In evals, ratings of your eval method dataset (golden dataset) are used to benchmark and compare methods of evaluating your task. This helps you find the .
Kiln's may incorporate highly rated samples into a prompt. For example, multi-shot or few-shot prompts will automatically incorporate highly rated samples. These filters to examples 4+ stars, and prefers 5-star ratings if available.
When creating a , you may optionally filter the training data to highly rated content.
When using the , you can access ratings.