Reviewing and Rating
Ratings help multi-shot prompting, fine-tuning, evals, and more
Last updated
Ratings help multi-shot prompting, fine-tuning, evals, and more
Last updated
Kiln includes a rating interface for rating dataset entries. This can be used to score the quality of the generated data, or to evaluate the quality of a model.
Your rating requirements are defined as part of your task definition, in the requirements section. You can set them up when you initially create your task, or add/edit them in Settings > Edit Task.
Name: the name of the requirement, which will appear in the rating UI. Limited in length to fit in the UI, but you can add more content in the instructions field below.
Rating Type: one of 5-star, pass/fail, pass/fail/critical.
Priority: how important this criteria is to the task.
An "Overall" rating is always available, even if your task has zero requirements
Kiln uses ratings in a variety of ways:
In our automatic multi-shot prompting only highly rated examples are used. The generated prompt will filters to examples 4+ stars, and prefers 5-star ratings if available.
When creating a fine-tuning dataset, you may optionally filter the training data to highly rated content.
When using the python library, you can access ratings.
5-star: a 1-5 star rating.
Pass/Fail: A binary pass/fail rating.
Pass/Fail/Critical: A ternary pass/fail/critical rating. It can be useful to add the "critical" level when there are criteria where some failures are exceptionally important to avoid. For example, a customer service bot could have a "tone" criteria, where casual/slang language would be a failure, but profanity or insulting the user would be critical.
Custom: you can define a custom rating scale when using python library. However, you won't be able to use custom ratings in the Kiln UI.
Instructions: more details about the requirement. These will be available to reviewers in the UI (under the icon).