Skip to main content
AgentMark Datasets Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an input (required) and optional expected_output.

Quick Start

1. Create a dataset file (agentmark/datasets/sentiment.jsonl):
{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": ""}}
2. Link to your prompt (frontmatter):
---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
---

<System>
Classify the sentiment
</System>
<User>{props.text}</User>
3. Run experiments:
agentmark run-experiment agentmark/sentiment.prompt.mdx

Dataset Structure

Each line must be valid JSON:
  • input (required) - Props passed to your prompt
  • expected_output (optional) - Expected result for evaluation
With expected output (enables evaluations):
{"input": {"text": "Great!", "category": "electronics"}, "expected_output": "positive"}
Without expected output (output-only mode):
{"input": {"text": "Great!", "category": "electronics"}}

What to Test

Common cases:
{"input": {"query": "What is AI?"}, "expected_output": "explanation"}
{"input": {"query": "Explain ML"}, "expected_output": "explanation"}
Edge cases:
{"input": {"text": ""}, "expected_output": "error"}
{"input": {"text": "a"}, "expected_output": "too_short"}
{"input": {"text": "Lorem ipsum... [5000 chars]"}, "expected_output": "truncated"}
Failure modes:
{"input": {"email": "invalid-email"}, "expected_output": "error: invalid email"}
{"input": {"amount": -100}, "expected_output": "error: amount must be positive"}
Real-world data - Use anonymized production data when possible.
LLM-assisted generation - Use LLMs to generate test cases, but have humans verify outputs before using them.

Expected Output Types

Strings (classification):
{"input": {"text": "sunny day"}, "expected_output": "positive"}
Objects (structured data):
{"input": {"text": "John, john@example.com"}, "expected_output": {"name": "John", "email": "john@example.com"}}
Flexible (patterns, not exact matches):
{"input": {"topic": "AI"}, "expected_output": "explanation containing: artificial intelligence"}
Your evaluation function validates flexible expectations.

Dataset Size

Start small (10-20 cases):
  • 5-7 common scenarios
  • 3-5 edge cases
  • 2-3 failure modes
Scale based on needs:
  • Initial development: 50-100 cases (recommended by Confident AI)
  • Statistical significance: ~250 cases (for 95% confidence, 5% margin of error)
  • Production systems: 100-300 cases minimum
  • High-stakes applications: 300+ cases
Quality > quantity. Start with 50-100 high-quality cases, then grow based on statistical power analysis and real-world findings.

Best Practices

  • One test case per line (valid JSONL)
  • Use descriptive inputs that clearly show what’s being validated
  • Version control datasets alongside prompts
  • Avoid duplicates - each case should validate something unique
  • Always anonymize data (never leak sensitive information)

Advanced: Held-Out Test Sets

Create separate datasets to avoid overfitting:
datasets/
├── development.jsonl       # Use during iteration (60-70%)
├── validation.jsonl        # Check progress periodically (15-20%)
└── held-out.jsonl         # Final test before production (15-20%)
Critical rules:
  • Never iterate on held-out data
  • Don’t peek at held-out results during development
  • If you look at held-out results and make changes, create a new held-out set
Example workflow:
Week 1-2: Iterate on development set
  ├─ Test prompt v1 → 75% pass rate
  └─ Test prompt v2 → 82% pass rate

Week 3: Check validation set
  └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)

Before deploy: Test held-out set
  └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements

Advanced: Statistical Significance

Sample size requirements (source):
  • Quick iteration: 10-20 cases (directional feedback only)
  • Initial development: 50-100 cases (industry standard)
  • Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
  • Production deployment: 100-300 cases minimum
  • High-stakes systems: 300+ cases
Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance. Confidence intervals - Report uncertainty:
Pass rate: 85% (85 passed out of 100 tests)
Standard error: √(0.85 × 0.15 / 100) = 0.036
95% confidence interval: 85% ± 7% → [78%, 92%]
✅ “Pass rate: 85% [CI: 77%-91%]” ❌ “Pass rate: 85%” Comparing prompts - Use paired comparisons on same dataset:
// For each test case, record if new prompt performed better
const improvements = testCases.map(tc => {
  const oldPassed = evaluateOld(tc);
  const newPassed = evaluateNew(tc);
  return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);
});

const netImprovement = improvements.reduce((a, b) => a + b, 0);
// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset. Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?” Key parameters:
  • Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
  • Significance level (α): Probability of false positive (typically 0.05 = 5%)
  • Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)
Formula for binary outcomes (pass/fail):
// Simplified formula for comparing two proportions
n ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size

// Example: Detect 5% improvement with 80% power, 95% confidence
// Assuming baseline pass rate p = 0.80
n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05
n7.84 × 0.32 / 0.0025
n1,003 test cases
Practical rules of thumb:
Minimum detectable differenceRequired sample size (per group)
10% (e.g., 80% → 90%)~100 samples
5% (e.g., 80% → 85%)~400 samples
2% (e.g., 80% → 82%)~2,500 samples
1% (e.g., 80% → 81%)~10,000 samples
Why this matters: If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application. Practical approach:
// 1. Define minimum improvement you care about
const minImprovement = 0.05; // 5% better pass rate

// 2. Calculate required sample size
const alpha = 0.05;  // 5% false positive rate
const power = 0.80;  // 80% chance to detect real improvement
const baselineRate = 0.80; // Current pass rate

const n = calculateSampleSize(alpha, power, baselineRate, minImprovement);
console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);

// 3. Collect that many test cases before running experiments

Programmatic access

You can list and retrieve datasets via the REST API or the agentmark api CLI command. Use this to pull dataset metadata into external tools or automate dataset management workflows.
# List all datasets
agentmark api datasets list

# List datasets from the cloud gateway
agentmark api datasets list --remote

# Get a specific dataset by ID
agentmark api datasets get <datasetId>
# Equivalent curl request
curl -H "Authorization: Bearer <API_KEY>" \
     -H "x-app-id: <APP_ID>" \
     https://api.agentmark.co/v1/datasets?limit=50
The local dev server and cloud gateway support the same endpoints, so you can develop integrations locally before deploying. Use the capabilities endpoint to check feature availability.

Next Steps

Evaluations

Write evaluation functions

Running Experiments

Test your datasets

Testing Overview

Learn testing concepts