Skip to main content
AgentMark Datasets Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an input (required) and optional expected_output.

Quick Start

1. Create a dataset file (agentmark/datasets/sentiment.jsonl):
{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": ""}}
2. Link to your prompt (frontmatter):
---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
---

<System>
Classify the sentiment
</System>
<User>{props.text}</User>
3. Run experiments:
npm run experiment agentmark/sentiment.prompt.mdx

Dataset Structure

Each line must be valid JSON:
  • input (required) - Props passed to your prompt
  • expected_output (optional) - Expected result for evaluation
With expected output (enables evaluations):
{"input": {"text": "Great!", "category": "electronics"}, "expected_output": "positive"}
Without expected output (output-only mode):
{"input": {"text": "Great!", "category": "electronics"}}

What to Test

Common cases:
{"input": {"query": "What is AI?"}, "expected_output": "explanation"}
{"input": {"query": "Explain ML"}, "expected_output": "explanation"}
Edge cases:
{"input": {"text": ""}, "expected_output": "error"}
{"input": {"text": "a"}, "expected_output": "too_short"}
{"input": {"text": "Lorem ipsum... [5000 chars]"}, "expected_output": "truncated"}
Failure modes:
{"input": {"email": "invalid-email"}, "expected_output": "error: invalid email"}
{"input": {"amount": -100}, "expected_output": "error: amount must be positive"}
Real-world data - Use anonymized production data when possible.
LLM-assisted generation - Use LLMs to generate test cases, but have humans verify outputs before using them.

Expected Output Types

Strings (classification):
{"input": {"text": "sunny day"}, "expected_output": "positive"}
Objects (structured data):
{"input": {"text": "John, [email protected]"}, "expected_output": {"name": "John", "email": "[email protected]"}}
Flexible (patterns, not exact matches):
{"input": {"topic": "AI"}, "expected_output": "explanation containing: artificial intelligence"}
Your evaluation function validates flexible expectations.

Dataset Size

Start small (10-20 cases):
  • 5-7 common scenarios
  • 3-5 edge cases
  • 2-3 failure modes
Scale based on needs:
  • Initial development: 50-100 cases (recommended by Confident AI)
  • Statistical significance: ~250 cases (for 95% confidence, 5% margin of error)
  • Production systems: 100-300 cases minimum
  • High-stakes applications: 300+ cases
Quality > quantity. Start with 50-100 high-quality cases, then grow based on statistical power analysis and real-world findings.

Best Practices

  • One test case per line (valid JSONL)
  • Use descriptive inputs that clearly show what’s being validated
  • Version control datasets alongside prompts
  • Avoid duplicates - each case should validate something unique
  • Always anonymize data (never leak sensitive information)

Advanced: Held-Out Test Sets

Create separate datasets to avoid overfitting:
datasets/
├── development.jsonl       # Use during iteration (60-70%)
├── validation.jsonl        # Check progress periodically (15-20%)
└── held-out.jsonl         # Final test before production (15-20%)
Critical rules:
  • Never iterate on held-out data
  • Don’t peek at held-out results during development
  • If you look at held-out results and make changes, create a new held-out set
Example workflow:
Week 1-2: Iterate on development set
  ├─ Test prompt v1 → 75% pass rate
  └─ Test prompt v2 → 82% pass rate

Week 3: Check validation set
  └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)

Before deploy: Test held-out set
  └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements

Advanced: Statistical Significance

Sample size requirements (source):
  • Quick iteration: 10-20 cases (directional feedback only)
  • Initial development: 50-100 cases (industry standard)
  • Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
  • Production deployment: 100-300 cases minimum
  • High-stakes systems: 300+ cases
Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance. Confidence intervals - Report uncertainty:
Pass rate: 85% (85 passed out of 100 tests)
Standard error: √(0.85 × 0.15 / 100) = 0.036
95% confidence interval: 85% ± 7% → [78%, 92%]
✅ “Pass rate: 85% [CI: 77%-91%]” ❌ “Pass rate: 85%” Comparing prompts - Use paired comparisons on same dataset:
// For each test case, record if new prompt performed better
const improvements = testCases.map(tc => {
  const oldPassed = evaluateOld(tc);
  const newPassed = evaluateNew(tc);
  return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);
});

const netImprovement = improvements.reduce((a, b) => a + b, 0);
// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset. Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?” Key parameters:
  • Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
  • Significance level (α): Probability of false positive (typically 0.05 = 5%)
  • Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)
Formula for binary outcomes (pass/fail):
// Simplified formula for comparing two proportions
n ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size

// Example: Detect 5% improvement with 80% power, 95% confidence
// Assuming baseline pass rate p = 0.80
n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05
n7.84 × 0.32 / 0.0025
n1,003 test cases
Practical rules of thumb:
Minimum detectable differenceRequired sample size (per group)
10% (e.g., 80% → 90%)~100 samples
5% (e.g., 80% → 85%)~400 samples
2% (e.g., 80% → 82%)~2,500 samples
1% (e.g., 80% → 81%)~10,000 samples
Why this matters: If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application. Practical approach:
// 1. Define minimum improvement you care about
const minImprovement = 0.05; // 5% better pass rate

// 2. Calculate required sample size
const alpha = 0.05;  // 5% false positive rate
const power = 0.80;  // 80% chance to detect real improvement
const baselineRate = 0.80; // Current pass rate

const n = calculateSampleSize(alpha, power, baselineRate, minImprovement);
console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);

// 3. Collect that many test cases before running experiments

Next Steps