Skip to main content
Datasets are collections of test cases used to evaluate prompts in bulk. Each item defines input props and an optional expected output for comparison.

Dataset Format

Datasets use JSONL (JSON Lines) format — one JSON object per line. Each object has:
FieldTypeRequiredDescription
inputRecord<string, unknown>YesThe input props to pass to the prompt
expected_outputstringNoThe expected output for evaluation comparison
{"input": {"question": "What is the capital of France?"}, "expected_output": "Paris"}
{"input": {"question": "Translate 'Hello' to French."}, "expected_output": "Bonjour"}
{"input": {"question": "What is 2 + 2?"}, "expected_output": "4"}
The input field is an object whose keys match the prompt’s expected props. For example, if your prompt uses {props.question}, your dataset items need {"input": {"question": "..."}}.

Creating Datasets

In the Platform UI

  1. Navigate to the Testing section in your AgentMark dashboard
  2. Click Create Dataset
  3. Add items with input props and optional expected outputs
  4. Save the dataset
Dataset items in the AgentMark dashboard

As Local Files

Create a .jsonl file in your project and reference it in your prompt’s frontmatter:
datasets/sentiment.jsonl
{"input": {"text": "I love this product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience."}, "expected_output": "negative"}
{"input": {"text": "It was okay, nothing special."}, "expected_output": "neutral"}
prompt.mdx
---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - accuracy
---

Running Datasets

From the Platform

Run a dataset against a prompt directly from the dashboard. AgentMark executes each item and displays the results, including inputs, outputs, and traces. Dataset runs in the dashboard

From the CLI

Use the run-experiment command to run a prompt against its configured dataset:
agentmark run-experiment agentmark/my-prompt.prompt.mdx
Results stream to the terminal as they complete. See Experiments for full CLI options.

Viewing Results

After running a dataset, view detailed results for each item:
  • Input and Output — See the input provided and the output generated
  • Expected vs. Actual — Compare the prompt’s output against expected values
  • Traces — View the full execution trace for each item, including token usage and latency
  • Eval Scores — If evaluations are configured, see scores alongside each result
Dataset item run results

Webhook Integration

Dataset runs use your configured webhook to execute prompts. The webhook receives each dataset item and returns the prompt’s output, giving you full control over the inference process. To learn more about setting up a webhook, see the webhook documentation.

Best Practices

  • Start small — Begin with 10–20 test cases covering common scenarios, then expand
  • Include edge cases — Test boundary conditions, empty inputs, and unusual formats
  • Use real data — Base test cases on actual production inputs when possible
  • Version control datasets — Store .jsonl files alongside your prompts in source control
  • One case per line — Keep each JSONL entry on a single line for easy diffing
  • Anonymize sensitive data — Remove PII before adding production data to datasets
For guidance on writing datasets, held-out test sets, and statistical significance, see Datasets in the Development docs.

Have Questions?

We’re here to help! Choose the best way to reach us: