Datasets enable bulk testing of prompts against a collection of input/output pairs. This allows you to:
Validate prompt behavior across many test cases
Ensure consistency of outputs
Catch regressions when modifying prompts
Generate performance metrics
Each dataset item contains an input to test, along with its expected output for comparison. You can create and manage datasets through the UI or as JSON files.