Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an input (required) and optional expected_output.
Quick Start
1. Create a dataset file (agentmark/datasets/sentiment.jsonl):
{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": ""}}
2. Link to your prompt (frontmatter):
---
name: sentiment-classifier
test_settings:
dataset: ./datasets/sentiment.jsonl
---
<System>
Classify the sentiment
</System>
<User>{props.text}</User>
3. Run experiments:
npm run experiment agentmark/sentiment.prompt.mdx
Dataset Structure
Each line must be valid JSON:
input (required) - Props passed to your prompt
expected_output (optional) - Expected result for evaluation
With expected output (enables evaluations):
{"input": {"text": "Great!", "category": "electronics"}, "expected_output": "positive"}
Without expected output (output-only mode):
{"input": {"text": "Great!", "category": "electronics"}}
What to Test
Common cases:
{"input": {"query": "What is AI?"}, "expected_output": "explanation"}
{"input": {"query": "Explain ML"}, "expected_output": "explanation"}
Edge cases:
{"input": {"text": ""}, "expected_output": "error"}
{"input": {"text": "a"}, "expected_output": "too_short"}
{"input": {"text": "Lorem ipsum... [5000 chars]"}, "expected_output": "truncated"}
Failure modes:
{"input": {"email": "invalid-email"}, "expected_output": "error: invalid email"}
{"input": {"amount": -100}, "expected_output": "error: amount must be positive"}
Real-world data - Use anonymized production data when possible.
LLM-assisted generation - Use LLMs to generate test cases, but have humans verify outputs before using them.
Expected Output Types
Strings (classification):
{"input": {"text": "sunny day"}, "expected_output": "positive"}
Objects (structured data):
Flexible (patterns, not exact matches):
{"input": {"topic": "AI"}, "expected_output": "explanation containing: artificial intelligence"}
Your evaluation function validates flexible expectations.
Dataset Size
Start small (10-20 cases):
- 5-7 common scenarios
- 3-5 edge cases
- 2-3 failure modes
Scale based on needs:
- Initial development: 50-100 cases (recommended by Confident AI)
- Statistical significance: ~250 cases (for 95% confidence, 5% margin of error)
- Production systems: 100-300 cases minimum
- High-stakes applications: 300+ cases
Quality > quantity. Start with 50-100 high-quality cases, then grow based on statistical power analysis and real-world findings.
Best Practices
- One test case per line (valid JSONL)
- Use descriptive inputs that clearly show what’s being validated
- Version control datasets alongside prompts
- Avoid duplicates - each case should validate something unique
- Always anonymize data (never leak sensitive information)
Advanced: Held-Out Test Sets
Create separate datasets to avoid overfitting:
datasets/
├── development.jsonl # Use during iteration (60-70%)
├── validation.jsonl # Check progress periodically (15-20%)
└── held-out.jsonl # Final test before production (15-20%)
Critical rules:
- Never iterate on held-out data
- Don’t peek at held-out results during development
- If you look at held-out results and make changes, create a new held-out set
Example workflow:
Week 1-2: Iterate on development set
├─ Test prompt v1 → 75% pass rate
└─ Test prompt v2 → 82% pass rate
Week 3: Check validation set
└─ Test prompt v2 → 79% pass rate (close to dev, good sign!)
Before deploy: Test held-out set
└─ Test prompt v3 → 81% pass rate → Deploy if meets requirements
Advanced: Statistical Significance
Sample size requirements (source):
- Quick iteration: 10-20 cases (directional feedback only)
- Initial development: 50-100 cases (industry standard)
- Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
- Production deployment: 100-300 cases minimum
- High-stakes systems: 300+ cases
Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.
Confidence intervals - Report uncertainty:
Pass rate: 85% (85 passed out of 100 tests)
Standard error: √(0.85 × 0.15 / 100) = 0.036
95% confidence interval: 85% ± 7% → [78%, 92%]
✅ “Pass rate: 85% [CI: 77%-91%]”
❌ “Pass rate: 85%”
Comparing prompts - Use paired comparisons on same dataset:
// For each test case, record if new prompt performed better
const improvements = testCases.map(tc => {
const oldPassed = evaluateOld(tc);
const newPassed = evaluateNew(tc);
return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);
});
const netImprovement = improvements.reduce((a, b) => a + b, 0);
// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset.
Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?”
Key parameters:
- Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
- Significance level (α): Probability of false positive (typically 0.05 = 5%)
- Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)
Formula for binary outcomes (pass/fail):
// Simplified formula for comparing two proportions
n ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²
// Example: Detect 5% improvement with 80% power, 95% confidence
// Assuming baseline pass rate p = 0.80
n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²
n ≈ 7.84 × 0.32 / 0.0025
n ≈ 1,003 test cases
Practical rules of thumb:
| Minimum detectable difference | Required sample size (per group) |
|---|
| 10% (e.g., 80% → 90%) | ~100 samples |
| 5% (e.g., 80% → 85%) | ~400 samples |
| 2% (e.g., 80% → 82%) | ~2,500 samples |
| 1% (e.g., 80% → 81%) | ~10,000 samples |
Why this matters:
If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application.
Practical approach:
// 1. Define minimum improvement you care about
const minImprovement = 0.05; // 5% better pass rate
// 2. Calculate required sample size
const alpha = 0.05; // 5% false positive rate
const power = 0.80; // 80% chance to detect real improvement
const baselineRate = 0.80; // Current pass rate
const n = calculateSampleSize(alpha, power, baselineRate, minImprovement);
console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);
// 3. Collect that many test cases before running experiments
Next Steps