Synthetic Data Generation
Generate synthetic data for fine-tuning or evaluation
Last updated
Generate synthetic data for fine-tuning or evaluation
Last updated
Kiln offers a powerful interactive synthetic data generation tool.
Synthetic data is helpful for many reasons:
To generate a dataset for fine-tuning
To generate examples to be used for few-shot or multi-shot prompting
To test your task in a controlled environment
To generate eval datasets
To generate targeted data to reproduce a bug/issue, which can be used for training a fix, evaluating a fix, and backtesting
Once you've created a Kiln task defining your goals, data generation will use the task instructions and requirements to generate synthetic data without any additional configuration.
To generate a breadth of examples, Kiln can generate a topic tree and generate examples for each node. This includes nested topics, which allows you to generate a lot of broad data very quickly.
You can use automatic topic generation, or manually add topics to your topic tree.
Sometimes you may want to guide the generation process to ensure that the data generated matches your needs. You can add human guidance to your data generation task at any time.
Adding a short guidance prompt can quickly improve the quality of the generated data. Some examples:
Generate content for global topics, not only US-centric
Generate examples in Spanish
The model is having trouble classifying sediment of sarcastic messages. Generate sarcastic messages.
Kiln synthetic data generation is designed to be used in our interactive UI.
As you work, delete topics or examples that don't match your goals, and regenerate the data until you're happy with the results. Adding human guidance can help with this process.
If your task requires structured input and/or output, your synthetic data generation will automatically follow the schemas you defined. All values are validated against the schemas you define, and nothing will be saved into your dataset if they don't comply.
You can define the schema in our task definition UI for a visual schema builder. Alternatively you can directly set a JSON Schema in the task via our python library or a text editor.
Under the hood we attempt to use tool calling when the model supports it, but will fallback to JSON parsing if not.
Kiln offers a number of options when generating a dataset:
Model: which model to use for generation. We support a wide range of models (OpenAI, Anthropic, Llama, Google, Mistral, etc.) and a range of hosts including Ollama. Note: each model you see in the UI has been tested with the data generation tasks.
Prompt: after rating a few examples, more powerful prompt options will open up for data generation. These include few-shot, multi-shot, chain-of-thought prompting, and more.
You can use synthetic data generation as many times as you'd like. Data will be appended to your dataset each time you do.
Synthetic data can help resolve issues in your LLM systems.
As an example, let's assume your model is often generating text using the wrong tone. For this example: too formal when the use case calls for more casual tone.
Synthetic data can help resolve this issue, and ensure it doesn't regress.
Open the synthetic dataset tab.
Select a high quality model - even if it's not one that's fast or cheap enough for production.
Start generating data which shows the issue, but use the human guidance feature and better model to ensure the outputs are high quality.
Manually delete examples that don't have the correct style.
Once the synthetic data tool is reliably generating correct data (with this model and guidance pair), scale up your generation to hundreds of samples.
Save your new synthetic dataset
The new examples will be saved to your dataset, and will include a unique tag to identify them (e.g. synthetic_session_12345
). With this new dataset in hand you can resolve the issue:
Simple: Fix the root prompt, and use this new dataset subset in your evaluations to ensure it works (and doesn't regress in the future)
Advanced: Fine tune a model with this data. Create a smaller and faster models which has learned to emulate your desired styles. Withhold a test set to ensure it worked.