Skip to main content
Running experiments in the AgentMark Dashboard and CLI The animation shows npx agentmark run-experiment executing against a dataset: each row is processed, the AI output is scored, and a results table prints to stdout with pass/fail status per evaluator. Run prompts against datasets with automatic evaluation to validate quality and consistency.

CLI usage

Quick start

npx agentmark run-experiment agentmark/classifier.prompt.mdx
Requirements:
  • Dataset configured in prompt frontmatter
  • Development server running (npx agentmark dev)
  • Optional: Evaluation functions defined
Keep npx agentmark dev running in a separate terminal. The run-experiment command talks to it on port 9417.

Full command signature

npx agentmark run-experiment <filepath> [options]

Options:
  --server <url>        Webhook server URL (default: http://localhost:9417)
  --skip-eval           Skip running evals even if they exist
  --format <format>     Output format: table, csv, json, or jsonl (default: table)
  --threshold <percent> Fail if pass percentage is below threshold (0-100)
  --truncate <chars>    Truncate long cells in table output (default 1000; 0 = unlimited)

Dataset sampling (pick at most one):
  --sample <percent>    Run on a random N% of rows (1-100)
  --rows <spec>         Select specific rows by index or range (e.g., 0,3-5,9)
  --split <spec>        Train/test split (e.g., train:80 or test:80)
  --seed <number>       Seed for reproducible sampling/splitting
The --server flag defaults to the AGENTMARK_WEBHOOK_URL environment variable if set, otherwise http://localhost:9417.

Command options

Skip evaluations (output-only mode):
npx agentmark run-experiment agentmark/test.prompt.mdx --skip-eval
Output format:
npx agentmark run-experiment agentmark/test.prompt.mdx --format table   # Default
npx agentmark run-experiment agentmark/test.prompt.mdx --format csv     # Spreadsheets
npx agentmark run-experiment agentmark/test.prompt.mdx --format json    # Structured
npx agentmark run-experiment agentmark/test.prompt.mdx --format jsonl   # Line-delimited
Pass rate threshold (CI/CD):
npx agentmark run-experiment agentmark/test.prompt.mdx --threshold 85
Exits with non-zero code if pass rate falls below the threshold. Requires evaluations that return a passed field. Dataset sampling (see Dataset Sampling below):
npx agentmark run-experiment agentmark/test.prompt.mdx --sample 20
npx agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9
npx agentmark run-experiment agentmark/test.prompt.mdx --split train:80
Custom server:
npx agentmark run-experiment agentmark/test.prompt.mdx --server http://staging:9417

Dataset sampling

Run experiments on a subset of your dataset without modifying the dataset file. The three sampling modes are mutually exclusive — use only one per run. Random sample (--sample <percent>): Run on a random N% of rows. Useful for quick smoke tests against large datasets.
# Run on ~20% of rows (random, non-reproducible)
npx agentmark run-experiment agentmark/test.prompt.mdx --sample 20

# Reproducible: same 20% every time
npx agentmark run-experiment agentmark/test.prompt.mdx --sample 20 --seed 42
Specific rows (--rows <spec>): Select individual rows by zero-based index. Supports comma-separated indices and ranges.
# Row 0 only
npx agentmark run-experiment agentmark/test.prompt.mdx --rows 0

# Rows 0, 3, 4, 5, and 9
npx agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9
Train/test split (--split <spec>): Split the dataset into train and test portions. Run only the train portion or only the test portion.
# Run on the first 80% (train portion), positional split
npx agentmark run-experiment agentmark/test.prompt.mdx --split train:80

# Run on the remaining 20% (test portion), positional split
npx agentmark run-experiment agentmark/test.prompt.mdx --split test:80

# Seeded split — random assignment, reproducible across runs
npx agentmark run-experiment agentmark/test.prompt.mdx --split train:80 --seed 42
npx agentmark run-experiment agentmark/test.prompt.mdx --split test:80 --seed 42
Without --seed, --split uses positional assignment: the first N% of rows are “train” and the rest are “test”. With --seed, each row is assigned to train or test by a deterministic hash — the order in the file does not matter.
Reproducibility with --seed: The --seed flag guarantees the same rows are selected every time, across TypeScript and Python. Pass the same seed to get identical results on any machine or language runtime.
# These two runs always process the exact same rows
npx agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
npx agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
Use --seed in CI/CD pipelines to prevent flaky results from random row selection.

Output example

#InputAI ResultExpected Outputsentiment_check
1{"text":"I love it"}positivepositivePASS (1.00)
2{"text":"Terrible"}negativenegativePASS (1.00)
3{"text":"It's okay"}neutralneutralPASS (1.00)
Summary:
Pass rate: 100% (3/3 passed)
The CLI supports both .mdx source files and pre-built .json files (from npx agentmark build). Media outputs (images, audio) are saved to .agentmark-outputs/ with clickable file paths.

How it works

The run-experiment command:
  1. Loads your prompt file (.mdx or pre-built .json) and parses the frontmatter
  2. Reads the dataset specified in test_settings.dataset
  3. Sends the prompt and dataset to the dev server (default: http://localhost:9417)
  4. The server runs the prompt against each dataset row
  5. Evaluates results using the evals specified in test_settings.evals
  6. Streams results back to the CLI as they complete
  7. Displays formatted output (table, CSV, JSON, or JSONL)

Configuration

Link dataset and evals in prompt frontmatter:
---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check
---

<System>Classify the sentiment</System>
<User>{props.text}</User>
You can also provide default props via test_settings.props:
test_settings:
  props:
    language: en
    verbose: false
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check
Props from each dataset row override the defaults. Dataset (sentiment.jsonl):
{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": "It's okay"}, "expected_output": "neutral"}
Learn more about datasets → Learn more about evals →

Workflow

1. Develop prompts - Iterate on your prompt design 2. Create datasets - Add test cases covering your scenarios 3. Write evaluations - Define success criteria 4. Run experiments - Test against dataset
npx agentmark run-experiment agentmark/my-prompt.prompt.mdx
5. Review results - Identify failures and patterns 6. Iterate - Fix issues, improve prompts, add test cases 7. Deploy with confidence - Pass rate meets your threshold

SDK usage

Run experiments programmatically using formatWithDataset():
import { client } from './agentmark-client';
import { generateText } from 'ai';  // Or your adapter's generation function

const prompt = await client.loadTextPrompt('agentmark/classifier.prompt.mdx');

// Returns a stream of formatted inputs from the dataset
const datasetStream = await prompt.formatWithDataset();

// Process each test case
for await (const item of datasetStream) {
  const { dataset, formatted, evals } = item;

  // Run the prompt with your AI SDK
  const result = await generateText(formatted);

  // Check results
  const passed = result.text === dataset.expected_output;
  console.log(`Input: ${JSON.stringify(dataset.input)}`);
  console.log(`Expected: ${dataset.expected_output}`);
  console.log(`Got: ${result.text}`);
  console.log(`Result: ${passed ? 'PASS' : 'FAIL'}\n`);
}
The stream returns objects with:
  • dataset - The test case (input and expected_output)
  • formatted - The formatted prompt ready for your AI SDK
  • evals - List of evaluation names to run
  • type - Always "dataset"
Options (FormatWithDatasetOptions):
  • datasetPath?: string - Override dataset from frontmatter
  • format?: 'ndjson' | 'json' - Buffer all rows ('json') or stream as available ('ndjson', default)
When to use:
  • Custom test logic in your test framework
  • Fine-grained control over test execution
  • Integrating with existing test infrastructure
  • Running experiments in application code

Troubleshooting

CLI issues

Dataset not found:
  • Check dataset path in frontmatter
  • Verify file exists and is valid JSONL
Server connection error:
  • Ensure npx agentmark dev is running
  • Check ports are available (default webhook port: 9417)
  • Verify --server URL if using a custom server
Invalid dataset format:
  • Each line must be valid JSON
  • Required: input field
  • Optional: expected_output field
No evaluations ran:
  • Add evals to test_settings in frontmatter
  • Or use --skip-eval flag for output-only mode
Threshold check failed:
  • The --threshold flag requires evals that return a passed field
  • Verify your eval functions return { passed: true/false, ... }
Sampling options conflict:
  • Only one of --sample, --rows, or --split may be used at a time
  • --seed can be combined with any of them

Programmatic access

You can query experiment results, run traces, and prompt file listings via the REST API or the agentmark api CLI command. Use this to build custom reporting, export results to external tools, or integrate experiment data into CI/CD pipelines.
# List experiments
npx agentmark api experiments list --limit 10

# Get a specific experiment, including its runs and evaluation results
npx agentmark api experiments get <experimentId>

# List traces for a specific experiment run — filter `/v1/traces` by `dataset_run_id`
# (the former `/v1/runs/{runId}/traces` endpoint is deprecated; both paths hit the
# same predicate, but the filter approach works on Cloud + Local without a second
# endpoint).
curl "http://localhost:9418/v1/traces?dataset_run_id=<runId>"

# List prompt files in the current project
npx agentmark api prompts list --limit 10
# Equivalent curl against the local dev server
curl http://localhost:9418/v1/experiments?limit=10
experiments ships on Cloud + Local. prompts is Local-only today — Cloud returns 501 not_available_on_cloud. The legacy /v1/runs/{runId}/traces endpoint is deprecated but still works on Local for backwards compatibility; use /v1/traces?dataset_run_id=… in new code. Call GET /v1/capabilities to check which features a server supports at runtime.

Next steps

Datasets

Create test datasets

Evaluations

Write evaluation functions

Testing overview

Learn testing concepts

Have Questions?

We’re here to help! Choose the best way to reach us: