Use this file to discover all available pages before exploring further.
The dataset editor shows each JSONL row on its own line, with syntax highlighting for the input and expected_output fields. Add rows inline or upload a .jsonl file from disk.Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an input (required) and optional expected_output.
datasets/├── development.jsonl # Use during iteration (60-70%)├── validation.jsonl # Check progress periodically (15-20%)└── held-out.jsonl # Final test before production (15-20%)
Critical rules:
Never iterate on held-out data
Don’t peek at held-out results during development
If you look at held-out results and make changes, create a new held-out set
Example workflow:
Week 1-2: Iterate on development set ├─ Test prompt v1 → 75% pass rate └─ Test prompt v2 → 82% pass rateWeek 3: Check validation set └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)Before deploy: Test held-out set └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements
Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
Production deployment: 100-300 cases minimum
High-stakes systems: 300+ cases
Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.Confidence intervals - Report uncertainty:
✅ “Pass rate: 85% [CI: 77%-91%]”
❌ “Pass rate: 85%”Comparing prompts - Use paired comparisons on same dataset:
// For each test case, record if new prompt performed betterconst improvements = testCases.map(tc => { const oldPassed = evaluateOld(tc); const newPassed = evaluateNew(tc); return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);});const netImprovement = improvements.reduce((a, b) => a + b, 0);// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset.Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?”Key parameters:
Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
Significance level (α): Probability of false positive (typically 0.05 = 5%)
Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)
Formula for binary outcomes (pass/fail):
// Simplified formula for comparing two proportionsn ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²// Example: Detect 5% improvement with 80% power, 95% confidence// Assuming baseline pass rate p = 0.80n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²n ≈ 7.84 × 0.32 / 0.0025n ≈ 1,003 test cases
Practical rules of thumb:
Minimum detectable difference
Required sample size (per group)
10% (e.g., 80% → 90%)
~100 samples
5% (e.g., 80% → 85%)
~400 samples
2% (e.g., 80% → 82%)
~2,500 samples
1% (e.g., 80% → 81%)
~10,000 samples
Why this matters:If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application.Practical approach:
// 1. Define minimum improvement you care aboutconst minImprovement = 0.05; // 5% better pass rate// 2. Calculate required sample sizeconst alpha = 0.05; // 5% false positive rateconst power = 0.80; // 80% chance to detect real improvementconst baselineRate = 0.80; // Current pass rateconst n = calculateSampleSize(alpha, power, baselineRate, minImprovement);console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);// 3. Collect that many test cases before running experiments
You can list datasets and append new rows via the REST API or the agentmark api CLI command. Use this to pull dataset metadata into external tools or automate dataset ingestion.
# List all datasets (defaults to your local dev server)npx agentmark api datasets list# List datasets from AgentMark Cloud insteadnpx agentmark api datasets list --remote
Datasets are keyed by file path, not UUID. To append a row, POST to /v1/datasets/{datasetName}/rows where datasetName is the dataset path without the .jsonl extension, URL-encoded (for example, evals/sentiment-test.jsonl → evals%2Fsentiment-test):
# Append a row via curlcurl -X POST \ -H "Authorization: Bearer <API_KEY>" \ -H "X-Agentmark-App-Id: <APP_ID>" \ -H "Content-Type: application/json" \ -d '{"input": {"text": "Great!"}, "expected_output": "positive"}' \ https://api.agentmark.co/v1/datasets/evals%2Fsentiment-test/rows
The local dev server and the AgentMark Cloud gateway both implement the datasets endpoints, so you can develop integrations locally before deploying. Use the capabilities endpoint to check which endpoints a given server supports.