Datasets - AgentMark Docs

Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an input (required) and an optional expected_output. The same files power both Cloud and Local — in Cloud they sync to the Dashboard through the deployment pipeline, and in Local you run them directly from the CLI.

Cloud
Local

Datasets in the Dashboard

Datasets live as JSONL files in your repo. The git deployment pipeline syncs them to AgentMark Cloud, where you select them when you create experiments and when you configure review queues.

New Experiment dialog in the AgentMark Dashboard showing the dataset selector

The New Experiment dialog includes a Dataset field listing the datasets synced to your app. When you select a prompt, the dataset auto-fills from its test_settings frontmatter.

How datasets reach Cloud

Add a JSONL file to your repo

Create the dataset alongside your prompts, for example agentmark/datasets/sentiment.jsonl.

Reference it from a prompt

Set test_settings.dataset in the prompt frontmatter so the dialog can auto-fill it.

Deploy to sync

Push to your connected branch. The deployment pipeline syncs the dataset to AgentMark Cloud, where it appears in the dataset selector.

The dataset structure is identical to Local — see the Local tab for the JSONL schema, what to test, sizing guidance, held-out sets, and statistical significance.

Where dataset rows appear

Experiment detail — each dataset row’s input and expected_output are shown next to the actual AI output and evaluator scores. See Running experiments.
Review queues — set a default dataset on a queue so the “Save to dataset” action is pre-filled during annotation review.

Appending rows

Rows are appended to a synced dataset in two ways:

Save to dataset — during annotation review, save a reviewed trace’s input and output to the queue’s default dataset. Saved items are staged and committed when the queue is marked completed. See Human annotation.
REST API — POST a row to /v1/datasets/{datasetName}/rows. See Programmatic access in the Local tab for the request shape (the same endpoint serves Cloud and Local).

The dataset editor shows each JSONL row on its own line, with syntax highlighting for the input and expected_output fields. Add rows inline or upload a .jsonl file from disk.

Quick start

1. Create a dataset file (agentmark/datasets/sentiment.jsonl):

{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": ""}}

2. Link to your prompt (frontmatter):

---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
---

<System>
Classify the sentiment
</System>
<User>{props.text}</User>

3. Run experiments:

npx agentmark run-experiment agentmark/sentiment.prompt.mdx

Dataset structure

Each line must be valid JSON:

input (required) - Props passed to your prompt
expected_output (optional) - Expected result for evaluation

With expected output (enables evaluations):

{"input": {"text": "Great!", "category": "electronics"}, "expected_output": "positive"}

Without expected output (output-only mode):

{"input": {"text": "Great!", "category": "electronics"}}

What to test

Common cases:

{"input": {"query": "What is AI?"}, "expected_output": "explanation"}
{"input": {"query": "Explain ML"}, "expected_output": "explanation"}

Edge cases:

{"input": {"text": ""}, "expected_output": "error"}
{"input": {"text": "a"}, "expected_output": "too_short"}
{"input": {"text": "Lorem ipsum... [5000 chars]"}, "expected_output": "truncated"}

Failure modes:

{"input": {"email": "invalid-email"}, "expected_output": "error: invalid email"}
{"input": {"amount": -100}, "expected_output": "error: amount must be positive"}

Real-world data - Use anonymized production data when possible.

LLM-assisted generation - Use LLMs to generate test cases, but have humans verify outputs before using them.

Expected-output types

Strings (classification):

{"input": {"text": "sunny day"}, "expected_output": "positive"}

Objects (structured data):

{"input": {"text": "John, john@example.com"}, "expected_output": {"name": "John", "email": "john@example.com"}}

Flexible (patterns, not exact matches):

{"input": {"topic": "AI"}, "expected_output": "explanation containing: artificial intelligence"}

Your evaluation function validates flexible expectations.

Dataset size

Start small (10-20 cases):

5-7 common scenarios
3-5 edge cases
2-3 failure modes

Scale based on needs:

Initial development: 50-100 cases (recommended by Confident AI)
Statistical significance: ~250 cases (for 95% confidence, 5% margin of error)
Production systems: 100-300 cases minimum
High-stakes applications: 300+ cases

Quality > quantity. Start with 50-100 high-quality cases, then grow based on statistical power analysis and real-world findings.

Best practices

One test case per line (valid JSONL)
Use descriptive inputs that clearly show what’s being validated
Version control datasets alongside prompts
Avoid duplicates - each case should validate something unique
Always anonymize data (never leak sensitive information)

Advanced: held-out test sets

Create separate datasets to avoid overfitting:

datasets/
├── development.jsonl       # Use during iteration (60-70%)
├── validation.jsonl        # Check progress periodically (15-20%)
└── held-out.jsonl         # Final test before production (15-20%)

Critical rules:

Never iterate on held-out data
Don’t peek at held-out results during development
If you look at held-out results and make changes, create a new held-out set

Example workflow:

Week 1-2: Iterate on development set
  ├─ Test prompt v1 → 75% pass rate
  └─ Test prompt v2 → 82% pass rate

Week 3: Check validation set
  └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)

Before deploy: Test held-out set
  └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements

Advanced: statistical significance

Sample size requirements (source):

Quick iteration: 10-20 cases (directional feedback only)
Initial development: 50-100 cases (industry standard)
Statistical rigor: ~250 cases (95% confidence, 5% margin of error)
Production deployment: 100-300 cases minimum
High-stakes systems: 300+ cases

Why size matters: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.Confidence intervals - Report uncertainty:

Pass rate: 85% (85 passed out of 100 tests)
Standard error: √(0.85 × 0.15 / 100) = 0.036
95% confidence interval: 85% ± 7% → [78%, 92%]

✅ “Pass rate: 85% [CI: 77%-91%]” ❌ “Pass rate: 85%”Comparing prompts - Use paired comparisons on same dataset:

// For each test case, record if new prompt performed better
const improvements = testCases.map(tc => {
  const oldPassed = evaluateOld(tc);
  const newPassed = evaluateNew(tc);
  return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);
});

const netImprovement = improvements.reduce((a, b) => a + b, 0);
// netImprovement > 10 with 100 cases suggests real improvement

Power analysis - Determine how many samples you need before creating your dataset.Power analysis answers: “How many test cases do I need to reliably detect a meaningful improvement?”Key parameters:

Effect size: Minimum improvement you want to detect (e.g., 5% better pass rate)
Significance level (α): Probability of false positive (typically 0.05 = 5%)
Statistical power (1-β): Probability of detecting real improvement (typically 0.80 = 80%)

Formula for binary outcomes (pass/fail):

// Simplified formula for comparing two proportions
n ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²

// Example: Detect 5% improvement with 80% power, 95% confidence
// Assuming baseline pass rate p = 0.80
n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²
n ≈ 7.84 × 0.32 / 0.0025
n ≈ 1,003 test cases

Practical rules of thumb:

Minimum detectable difference	Required sample size (per group)
10% (e.g., 80% → 90%)	~100 samples
5% (e.g., 80% → 85%)	~400 samples
2% (e.g., 80% → 82%)	~2,500 samples
1% (e.g., 80% → 81%)	~10,000 samples

Why this matters:If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application.Practical approach:

// 1. Define minimum improvement you care about
const minImprovement = 0.05; // 5% better pass rate

// 2. Calculate required sample size
const alpha = 0.05;  // 5% false positive rate
const power = 0.80;  // 80% chance to detect real improvement
const baselineRate = 0.80; // Current pass rate

const n = calculateSampleSize(alpha, power, baselineRate, minImprovement);
console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);

// 3. Collect that many test cases before running experiments

Programmatic access

You can list datasets and append new rows through the REST API, or from an IDE agent via the agentmark-mcp MCP server. Use either to pull dataset metadata into external tools or automate dataset ingestion.

# List all datasets from the local dev server
curl "http://localhost:9418/v1/datasets"

# List datasets from AgentMark Cloud
curl "https://api.agentmark.co/v1/datasets" \
  -H "Authorization: Bearer $AGENTMARK_API_KEY" \
  -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"

Datasets are keyed by file path, not UUID. To append a row, POST to /v1/datasets/{datasetName}/rows where datasetName is the dataset path without the .jsonl extension, URL-encoded (for example, evals/sentiment-test.jsonl → evals%2Fsentiment-test):

# Append a row via curl
curl -X POST \
  -H "Authorization: Bearer <API_KEY>" \
  -H "X-Agentmark-App-Id: <APP_ID>" \
  -H "Content-Type: application/json" \
  -d '{"input": {"text": "Great!"}, "expected_output": "positive"}' \
  https://api.agentmark.co/v1/datasets/evals%2Fsentiment-test/rows

The local dev server and the AgentMark Cloud gateway both implement the datasets endpoints, so you can develop integrations locally before deploying. Use the capabilities endpoint to check which endpoints a given server supports.

Next steps

Evaluations

Write evaluation functions

Running Experiments

Test your datasets

Testing overview

Learn testing concepts

Have Questions?

We’re here to help! Choose the best way to reach us:

Email us at hello@agentmark.co for support
Schedule an Enterprise Demo to learn about our business solutions

Documentation Index

​Datasets in the Dashboard

​How datasets reach Cloud

​Where dataset rows appear

​Appending rows

​Quick start

​Dataset structure

​What to test

​Expected-output types

​Dataset size

​Best practices

​Advanced: held-out test sets

​Advanced: statistical significance

​Programmatic access

​Next steps

Evaluations

Running Experiments

Testing overview

​Have Questions?

Datasets in the Dashboard

How datasets reach Cloud

Where dataset rows appear

Appending rows

Quick start

Dataset structure

What to test

Expected-output types

Dataset size

Best practices

Advanced: held-out test sets

Advanced: statistical significance

Programmatic access

Next steps

Have Questions?