Use this file to discover all available pages before exploring further.
Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an input (required) and an optional expected_output. The same files power both Cloud and Local — in Cloud they sync to the Dashboard through the deployment pipeline, and in Local you run them directly from the CLI.
Datasets live as JSONL files in your repo. The git deployment pipeline syncs them to AgentMark Cloud, where you select them when you create experiments and when you configure review queues.The New Experiment dialog includes a Dataset field listing the datasets synced to your app. When you select a prompt, the dataset auto-fills from its test_settings frontmatter.
Create the dataset alongside your prompts, for example agentmark/datasets/sentiment.jsonl.
2
Reference it from a prompt
Set test_settings.dataset in the prompt frontmatter so the dialog can auto-fill it.
3
Deploy to sync
Push to your connected branch. The deployment pipeline syncs the dataset to AgentMark Cloud, where it appears in the dataset selector.
The dataset structure is identical to Local — see the Local tab for the JSONL schema, what to test, sizing guidance, held-out sets, and statistical significance.
Rows are appended to a synced dataset in two ways:
Save to dataset — during annotation review, save a reviewed trace’s input and output to the queue’s default dataset. Saved items are staged and committed when the queue is marked completed. See Human annotation.
REST API — POST a row to /v1/datasets/{datasetName}/rows. See Programmatic access in the Local tab for the request shape (the same endpoint serves Cloud and Local).
The dataset editor shows each JSONL row on its own line, with syntax highlighting for the input and expected_output fields. Add rows inline or upload a .jsonl file from disk.
datasets/├── development.jsonl # Use during iteration (60-70%)├── validation.jsonl # Check progress periodically (15-20%)└── held-out.jsonl # Final test before production (15-20%)
Critical rules:
Never iterate on held-out data
Don’t peek at held-out results during development
If you look at held-out results and make changes, create a new held-out set
Example workflow:
Week 1-2: Iterate on development set ├─ Test prompt v1 → 75% pass rate └─ Test prompt v2 → 82% pass rateWeek 3: Check validation set └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)Before deploy: Test held-out set └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements
Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
Production deployment: 100-300 cases minimum
High-stakes systems: 300+ cases
Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.Confidence intervals - Report uncertainty:
✅ “Pass rate: 85% [CI: 77%-91%]”
❌ “Pass rate: 85%”Comparing prompts - Use paired comparisons on same dataset:
// For each test case, record if new prompt performed betterconst improvements = testCases.map(tc => { const oldPassed = evaluateOld(tc); const newPassed = evaluateNew(tc); return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);});const netImprovement = improvements.reduce((a, b) => a + b, 0);// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset.Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?”Key parameters:
Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
Significance level (α): Probability of false positive (typically 0.05 = 5%)
Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)
Formula for binary outcomes (pass/fail):
// Simplified formula for comparing two proportionsn ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²// Example: Detect 5% improvement with 80% power, 95% confidence// Assuming baseline pass rate p = 0.80n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²n ≈ 7.84 × 0.32 / 0.0025n ≈ 1,003 test cases
Practical rules of thumb:
Minimum detectable difference
Required sample size (per group)
10% (e.g., 80% → 90%)
~100 samples
5% (e.g., 80% → 85%)
~400 samples
2% (e.g., 80% → 82%)
~2,500 samples
1% (e.g., 80% → 81%)
~10,000 samples
Why this matters:If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application.Practical approach:
// 1. Define minimum improvement you care aboutconst minImprovement = 0.05; // 5% better pass rate// 2. Calculate required sample sizeconst alpha = 0.05; // 5% false positive rateconst power = 0.80; // 80% chance to detect real improvementconst baselineRate = 0.80; // Current pass rateconst n = calculateSampleSize(alpha, power, baselineRate, minImprovement);console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);// 3. Collect that many test cases before running experiments
You can list datasets and append new rows through the REST API, or from an IDE agent via the agentmark-mcp MCP server. Use either to pull dataset metadata into external tools or automate dataset ingestion.
# List all datasets from the local dev servercurl "http://localhost:9418/v1/datasets"# List datasets from AgentMark Cloudcurl "https://api.agentmark.co/v1/datasets" \ -H "Authorization: Bearer $AGENTMARK_API_KEY" \ -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"
Datasets are keyed by file path, not UUID. To append a row, POST to /v1/datasets/{datasetName}/rows where datasetName is the dataset path without the .jsonl extension, URL-encoded (for example, evals/sentiment-test.jsonl → evals%2Fsentiment-test):
# Append a row via curlcurl -X POST \ -H "Authorization: Bearer <API_KEY>" \ -H "X-Agentmark-App-Id: <APP_ID>" \ -H "Content-Type: application/json" \ -d '{"input": {"text": "Great!"}, "expected_output": "positive"}' \ https://api.agentmark.co/v1/datasets/evals%2Fsentiment-test/rows
The local dev server and the AgentMark Cloud gateway both implement the datasets endpoints, so you can develop integrations locally before deploying. Use the capabilities endpoint to check which endpoints a given server supports.