Reviewing and Rating

Ratings help multi-shot prompting, fine-tuning, evals, and more

Kiln includes a rating interface for rating dataset entries. This can be used to score the quality of the generated data, or to evaluate the quality of a model.

Defining Rating Options

There are two methods of defining rating options:

Adding requirements to your task definition in Settings > Edit Task will add a rating option to every sample in your dataset
After creating an Eval, each output score will be available as a rating option for every sample in its evaluation method dataset (golden dataset).

Rating Option Parameters

Each rating option has a number of parameters:

Name: the name of the requirement, which will appear in the rating UI. Limited in length to fit in the UI, but you can add more content in the instructions field below.
Instructions: more details about the requirement. These will be available to reviewers in the UI (under the icon).
Rating Type: one of 5-star, pass/fail, pass/fail/critical.
Priority: how important this criteria is to the task.

An "Overall" rating is always available, even if your task has zero requirements

Rating Types:

5-star: a 1-5 star rating.
Pass/Fail: A binary pass/fail rating.
Pass/Fail/Critical: A ternary pass/fail/critical rating. It can be useful to add the "critical" level when there are criteria where some failures are exceptionally important to avoid. For example, a customer service bot could have a "tone" criteria, where casual/slang language would be a failure, but profanity or insulting the user would be critical.
Custom: you can define a custom rating scale when using python library. However, you won't be able to use custom ratings in the Kiln UI.

How Ratings are Used

Kiln uses ratings in a variety of ways:

In evals, ratings of your eval method dataset (golden dataset) are used to benchmark and compare methods of evaluating your task. This helps you find the ideal evaluation method.
Kiln's automatic prompt generators may incorporate highly rated samples into a prompt. For example, multi-shot or few-shot prompts will automatically incorporate highly rated samples. These filters to examples 4+ stars, and prefers 5-star ratings if available.
When creating a fine-tuning dataset, you may optionally filter the training data to highly rated content.
When using the python library, you can access ratings.

PreviousPrompts NextCollaboration

Last updated 1 month ago