The dataset editor shows each JSONL row on its own line, with syntax highlighting for the input and expected_output fields. Add rows inline or upload a .jsonl file from disk.Quick start
1. Create a dataset file (agentmark/datasets/sentiment.jsonl):{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": ""}}
2. Link to your prompt (frontmatter):---
name: sentiment-classifier
test_settings:
dataset: ./datasets/sentiment.jsonl
---
<System>
Classify the sentiment
</System>
<User>{props.text}</User>
3. Run experiments:npx @agentmark-ai/cli run-experiment agentmark/sentiment.prompt.mdx
Dataset structure
Each line must be valid JSON:
input (required) - Props passed to your prompt
expected_output (optional) - Expected result for evaluation
With expected output (enables evaluations):{"input": {"text": "Great!", "category": "electronics"}, "expected_output": "positive"}
Without expected output (output-only mode):{"input": {"text": "Great!", "category": "electronics"}}
What to test
Common cases:{"input": {"query": "What is AI?"}, "expected_output": "explanation"}
{"input": {"query": "Explain ML"}, "expected_output": "explanation"}
Edge cases:{"input": {"text": ""}, "expected_output": "error"}
{"input": {"text": "a"}, "expected_output": "too_short"}
{"input": {"text": "Lorem ipsum... [5000 chars]"}, "expected_output": "truncated"}
Failure modes:{"input": {"email": "invalid-email"}, "expected_output": "error: invalid email"}
{"input": {"amount": -100}, "expected_output": "error: amount must be positive"}
Real-world data - Use anonymized production data when possible.LLM-assisted generation - Use LLMs to generate test cases, but have humans verify outputs before using them.
Expected-output types
Strings (classification):{"input": {"text": "sunny day"}, "expected_output": "positive"}
Objects (structured data):{"input": {"text": "John, john@example.com"}, "expected_output": {"name": "John", "email": "john@example.com"}}
Flexible (patterns, not exact matches):{"input": {"topic": "AI"}, "expected_output": "explanation containing: artificial intelligence"}
Your evaluation function validates flexible expectations.Dataset size
Start small (10-20 cases):
- 5-7 common scenarios
- 3-5 edge cases
- 2-3 failure modes
Scale based on needs:
- Initial development: 50-100 cases (recommended by Confident AI)
- Statistical significance: ~250 cases (for 95% confidence, 5% margin of error)
- Production systems: 100-300 cases minimum
- High-stakes applications: 300+ cases
Quality > quantity. Start with 50-100 high-quality cases, then grow based on statistical power analysis and real-world findings.Best practices
- One test case per line (valid JSONL)
- Use descriptive inputs that clearly show what’s being validated
- Version control datasets alongside prompts
- Avoid duplicates - each case should validate something unique
- Always anonymize data (never leak sensitive information)
Advanced: held-out test sets
Create separate datasets to avoid overfitting:datasets/
├── development.jsonl # Use during iteration (60-70%)
├── validation.jsonl # Check progress periodically (15-20%)
└── held-out.jsonl # Final test before production (15-20%)
Critical rules:
- Never iterate on held-out data
- Don’t peek at held-out results during development
- If you look at held-out results and make changes, create a new held-out set
Example workflow:Week 1-2: Iterate on development set
├─ Test prompt v1 → 75% pass rate
└─ Test prompt v2 → 82% pass rate
Week 3: Check validation set
└─ Test prompt v2 → 79% pass rate (close to dev, good sign!)
Before deploy: Test held-out set
└─ Test prompt v3 → 81% pass rate → Deploy if meets requirements
Advanced: statistical significance
Sample size requirements (source):
- Quick iteration: 10-20 cases (directional feedback only)
- Initial development: 50-100 cases (industry standard)
- Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
- Production deployment: 100-300 cases minimum
- High-stakes systems: 300+ cases
Why size matters: with 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.Confidence intervals - Report uncertainty:Pass rate: 85% (85 passed out of 100 tests)
Standard error: √(0.85 × 0.15 / 100) = 0.036
95% confidence interval: 85% ± 7% → [78%, 92%]
✅ “Pass rate: 85% [CI: 78%-92%]”
❌ “Pass rate: 85%”Comparing prompts - Use paired comparisons on same dataset:// For each test case, record if new prompt performed better
const improvements = testCases.map(tc => {
const oldPassed = evaluateOld(tc);
const newPassed = evaluateNew(tc);
return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);
});
const netImprovement = improvements.reduce((a, b) => a + b, 0);
// netImprovement > 10 with 100 cases suggests real improvement
Power analysis - Determine how many samples you need before creating your dataset. It answers: “How many test cases do I need to reliably detect a meaningful improvement?” The smaller the improvement you want to catch, the more cases you need:| Minimum detectable difference | Required sample size (per group) |
|---|
| 10% (e.g., 80% → 90%) | ~100 samples |
| 5% (e.g., 80% → 85%) | ~400 samples |
| 2% (e.g., 80% → 82%) | ~2,500 samples |
| 1% (e.g., 80% → 81%) | ~10,000 samples |
If you only have 50 test cases, you can only reliably detect large improvements (>15%); smaller improvements look like noise. Plan your dataset size around the smallest improvement that matters to your application before you start collecting cases.Programmatic access
You can list datasets and append new rows through the REST API, or from an IDE agent via the agentmark-mcp MCP server. Use either to pull dataset metadata into external tools or automate dataset ingestion.# List all datasets from the local dev server
curl "http://localhost:9418/v1/datasets"
# List datasets from AgentMark Cloud
curl "https://api.agentmark.co/v1/datasets" \
-H "Authorization: Bearer $AGENTMARK_API_KEY" \
-H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"
Datasets are keyed by file path, not UUID. To append a row, POST to /v1/datasets/{datasetName}/rows where datasetName is the dataset path without the .jsonl extension, URL-encoded (for example, evals/sentiment-test.jsonl → evals%2Fsentiment-test):# Append a row via curl
curl -X POST \
-H "Authorization: Bearer <API_KEY>" \
-H "X-Agentmark-App-Id: <APP_ID>" \
-H "Content-Type: application/json" \
-d '{"input": {"text": "Great!"}, "expected_output": "positive"}' \
https://api.agentmark.co/v1/datasets/evals%2Fsentiment-test/rows
The local dev server and the AgentMark Cloud gateway both implement the datasets endpoints, so you can develop integrations locally before deploying. Use the capabilities endpoint to check which endpoints a given server supports.