Kiln AI Docs
Kiln Website
  • Welcome to Kiln AI
  • Docs
    • Quickstart
    • Models and AI Providers
    • Synthetic Data Generation
    • Fine Tuning Guide
    • Evaluations
    • Guide: Train a Reasoning Model
    • Reasoning & Chain of Thought
    • Prompts
    • Reviewing and Rating
    • Collaboration
    • Organizing Datasets
    • Structured Data / JSON
    • Keyboard Shortcuts
    • Privacy
    • Repairing Responses
    • Troubleshooting & Logs
    • Contact Us
  • Developers
    • Python Library Setup
    • Rest API
    • Kiln Data Model
Powered by GitBook
On this page
  • Requirements: Defining Rating Options
  • How Ratings are Used
  • Rating Types:
  • Use Ratings to Improve your Quality
  1. Docs

Reviewing and Rating

Ratings help multi-shot prompting, fine-tuning, evals, and more

PreviousPromptsNextCollaboration

Last updated 15 days ago

Kiln includes a rating interface for rating dataset entries. This can be used to score the quality of the generated data, or to evaluate the quality of a model.

Requirements: Defining Rating Options

Your rating requirements are defined as part of your task definition, in the requirements section. You can add or edit requirements in Settings > Edit Task.

  • Name: the name of the requirement, which will appear in the rating UI. Limited in length to fit in the UI, but you can add more content in the instructions field below.

  • Rating Type: one of 5-star, pass/fail, pass/fail/critical.

  • Priority: how important this criteria is to the task.

An "Overall" rating is always available, even if your task has zero requirements

How Ratings are Used

Kiln uses ratings in a variety of ways:

  • In our automatic multi-shot prompting only highly rated examples are used. The generated prompt will filters to examples 4+ stars, and prefers 5-star ratings if available.

  • When creating a fine-tuning dataset, you may optionally filter the training data to highly rated content.

  • When using the python library, you can access ratings.

Rating Types:

  • 5-star: a 1-5 star rating.

  • Pass/Fail: A binary pass/fail rating.

  • Pass/Fail/Critical: A ternary pass/fail/critical rating. It can be useful to add the "critical" level when there are criteria where some failures are exceptionally important to avoid. For example, a customer service bot could have a "tone" criteria, where casual/slang language would be a failure, but profanity or insulting the user would be critical.

  • Custom: you can define a custom rating scale when using python library. However, you won't be able to use custom ratings in the Kiln UI.

Use Ratings to Improve your Quality

Once you have ratings, Kiln offers a number of ways to use human ratings to improve task performance and quality:

Instructions: more details about the requirement. These will be available to reviewers in the UI (under the icon).

, filtering the training set to only high quality training data

by finding the eval method with the highest correlation to human preferences. An optimized evaluator is necessary to find the optimal method of running your task.

Create fine-tuned models
Rating UI in Kiln Desktop
Optimize an evaluator