Synthetic Data Generation

Generate synthetic data for fine-tuning or evaluation

Kiln offers a powerful interactive synthetic data generation tool.

Video Walkthrough

Use Cases

Synthetic data is helpful for many reasons:

To generate a dataset for fine-tuning
To generate examples to be used for few-shot or multi-shot prompting
To test your task in a controlled environment
To generate eval datasets
To generate targeted data to reproduce a bug/issue, which can be used for training a fix, evaluating a fix, and backtesting

How It Works

Zero-Shot Data Generation

Once you've created a Kiln task defining your goals, data generation will use the task instructions and requirements to generate synthetic data without any additional configuration.

Topic-Tree Data Generation

To generate a breadth of examples, Kiln can generate a topic tree and generate examples for each node. This includes nested topics, which allows you to generate a lot of broad data very quickly.

You can use automatic topic generation, or manually add topics to your topic tree.

Recommended Models

While you can try any model for synthetic data generation, we recommend using a state of the art model for synthetic data. Large models tend to perform better and handle the complex structure required by synthetic data.

In the synthetic data UI, suggested models will be highlighed.

Human Guidance

Sometimes you may want to guide the generation process to ensure that the data generated matches your needs. You can add human guidance to your data generation task at any time.

Adding a short guidance prompt can quickly improve the quality of the generated data. Some examples:

Generate content for global topics, not only US-centric
Generate examples in Spanish
The model is having trouble classifying sediment of sarcastic messages. Generate sarcastic messages.

Often human guidance is used for producing adversarial content: poor quality or inappropriate content. This is done to ensure an evaluation can detect and fail this sort of content.

However, LLMs will often do their best to avoid producing poor or inappropriate content, even when asked for it. If you find that's the case, use an uncensored and unaligned model like Dolphin 8x22B or Grok. These models will follow instructions more closely, and do not attempt to censor their content.

Interactive Curation UX

Kiln synthetic data generation is designed to be used in our interactive UI.

As you work, delete topics or examples that don't match your goals, and regenerate the data until you're happy with the results. Adding human guidance can help with this process.

Structured Data Generation (JSON, tool calling)

If your task requires structured input and/or output, your synthetic data generation will automatically follow the schemas you defined. All values are validated against the schemas you define, and nothing will be saved into your dataset if they don't comply.

You can define the schema in our task definition UI for a visual schema builder. Alternatively you can directly set a JSON Schema in the task via our python library or a text editor.

Under the hood we attempt to use tool calling when the model supports it, but will fallback to JSON parsing if not.

Generation Options

Kiln offers a number of options when generating a dataset:

Model: which model to use for generation. We support a wide range of models (OpenAI, Anthropic, Llama, Google, Mistral, etc.) and a range of hosts including Ollama. Note: each model you see in the UI has been tested with the data generation tasks.
Prompt: after rating a few examples, more powerful prompt options will open up for data generation. These include few-shot, multi-shot, chain-of-thought prompting, and more.

Iteration

You can use synthetic data generation as many times as you'd like. Data will be appended to your dataset each time you do.

Resolving bugs with synthetic data

Synthetic data can help resolve issues in your LLM systems.

As an example, let's assume your model is often generating text using the wrong tone. For this example: too formal when the use case calls for more casual tone.

Synthetic data can help resolve this issue, and ensure it doesn't regress.

Open the synthetic dataset tab.
Select a high quality model - even if it's not one that's fast or cheap enough for production.
Start generating data which shows the issue, but use the human guidance feature and better model to ensure the outputs are high quality.
Manually delete examples that don't have the correct style.
Once the synthetic data tool is reliably generating correct data (with this model and guidance pair), scale up your generation to hundreds of samples.
Save your new synthetic dataset

The new examples will be saved to your dataset, and will include a unique tag to identify them (e.g. synthetic_session_12345). With this new dataset in hand you can resolve the issue:

Simple: Fix the root prompt, and use this new dataset subset in your evaluations to ensure it works (and doesn't regress in the future)
Advanced: Fine tune a model with this data. Create a smaller and faster model which has learned to emulate your desired styles. Withhold a test set to ensure it worked.

PreviousModels and AI Providers NextFine Tuning Guide

Last updated 1 month ago