datasets/├── development.jsonl # Use during iteration (60-70%)├── validation.jsonl # Check progress periodically (15-20%)└── held-out.jsonl # Final test before production (15-20%)
Critical rules:
Never iterate on held-out data
Don’t peek at held-out results during development
If you look at held-out results and make changes, create a new held-out set
Example workflow:
Week 1-2: Iterate on development set ├─ Test prompt v1 → 75% pass rate └─ Test prompt v2 → 82% pass rateWeek 3: Check validation set └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)Before deploy: Test held-out set └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements
Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
Production deployment: 100-300 cases minimum
High-stakes systems: 300+ cases
Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.Confidence intervals - Report uncertainty:
✅ “Pass rate: 85% [CI: 77%-91%]”
❌ “Pass rate: 85%”Comparing prompts - Use paired comparisons on same dataset:
// For each test case, record if new prompt performed betterconst improvements = testCases.map(tc => { const oldPassed = evaluateOld(tc); const newPassed = evaluateNew(tc); return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);});const netImprovement = improvements.reduce((a, b) => a + b, 0);// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset.Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?”Key parameters:
Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
Significance level (α): Probability of false positive (typically 0.05 = 5%)
Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)
Formula for binary outcomes (pass/fail):
// Simplified formula for comparing two proportionsn ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²// Example: Detect 5% improvement with 80% power, 95% confidence// Assuming baseline pass rate p = 0.80n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²n ≈ 7.84 × 0.32 / 0.0025n ≈ 1,003 test cases
Practical rules of thumb:
Minimum detectable difference
Required sample size (per group)
10% (e.g., 80% → 90%)
~100 samples
5% (e.g., 80% → 85%)
~400 samples
2% (e.g., 80% → 82%)
~2,500 samples
1% (e.g., 80% → 81%)
~10,000 samples
Why this matters:If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application.Practical approach:
// 1. Define minimum improvement you care aboutconst minImprovement = 0.05; // 5% better pass rate// 2. Calculate required sample sizeconst alpha = 0.05; // 5% false positive rateconst power = 0.80; // 80% chance to detect real improvementconst baselineRate = 0.80; // Current pass rateconst n = calculateSampleSize(alpha, power, baselineRate, minImprovement);console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);// 3. Collect that many test cases before running experiments
You can list and retrieve datasets via the REST API or the agentmark api CLI command. Use this to pull dataset metadata into external tools or automate dataset management workflows.
# List all datasetsagentmark api datasets list# List datasets from the cloud gatewayagentmark api datasets list --remote# Get a specific dataset by IDagentmark api datasets get <datasetId>
The local dev server and cloud gateway support the same endpoints, so you can develop integrations locally before deploying. Use the capabilities endpoint to check feature availability.