Testing in Puzzlet

Puzzlet provides robust testing capabilities to help you validate and improve your prompts through:

  • Datasets: Test prompts against diverse inputs with known expected outputs
  • LLM as Judge Evaluations: Automated quality assessment of prompt outputs using language models

Datasets

Datasets enable bulk testing of prompts against a collection of input/output pairs. This allows you to:

  • Validate prompt behavior across many test cases
  • Ensure consistency of outputs
  • Catch regressions when modifying prompts
  • Generate performance metrics

Each dataset item contains an input to test, along with its expected output for comparison. You can create and manage datasets through the UI or as JSON files.

LLM as Judge Evaluations

Coming soon! LLM evaluations will provide automated assessment of your prompt outputs by using language models as judges. Key features will include:

  • Real-time evaluation of prompt outputs
  • Batch evaluation of datasets
  • Customizable scoring criteria (numeric, boolean, classification, etc.)
  • Detailed reasoning for each evaluation
  • Aggregated quality metrics across runs

This combination of datasets and LLM evaluations gives you comprehensive tools to test, validate, and improve your prompts systematically.

Have Questions?

We’re here to help! Choose the best way to reach us: