Skip to main content
Running Experiments Run prompts against datasets with automatic evaluation to validate quality and consistency.

CLI Usage

Quick Start

agentmark run-experiment agentmark/classifier.prompt.mdx
Requirements:
  • Dataset configured in prompt frontmatter
  • Development server running (agentmark dev)
  • Optional: Evaluation functions defined

Full Command Signature

agentmark run-experiment <filepath> [options]

Options:
  --server <url>        Webhook server URL (default: http://localhost:9417)
  --skip-eval           Skip running evals even if they exist
  --format <format>     Output format: table, csv, json, or jsonl (default: table)
  --threshold <percent> Fail if pass percentage is below threshold (0-100)

Dataset Sampling (pick at most one):
  --sample <percent>    Run on a random N% of rows (1-100)
  --rows <spec>         Select specific rows by index or range (e.g., 0,3-5,9)
  --split <spec>        Train/test split (e.g., train:80 or test:80)
  --seed <number>       Seed for reproducible sampling/splitting
The --server flag defaults to the AGENTMARK_WEBHOOK_URL environment variable if set, otherwise http://localhost:9417.

Command Options

Skip evaluations (output-only mode):
agentmark run-experiment agentmark/test.prompt.mdx --skip-eval
Output format:
agentmark run-experiment agentmark/test.prompt.mdx --format table   # Default
agentmark run-experiment agentmark/test.prompt.mdx --format csv     # Spreadsheets
agentmark run-experiment agentmark/test.prompt.mdx --format json    # Structured
agentmark run-experiment agentmark/test.prompt.mdx --format jsonl   # Line-delimited
Pass rate threshold (CI/CD):
agentmark run-experiment agentmark/test.prompt.mdx --threshold 85
Exits with non-zero code if pass rate falls below the threshold. Requires evaluations that return a passed field. Dataset sampling (see Dataset Sampling below):
agentmark run-experiment agentmark/test.prompt.mdx --sample 20
agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9
agentmark run-experiment agentmark/test.prompt.mdx --split train:80
Custom server:
agentmark run-experiment agentmark/test.prompt.mdx --server http://staging:9417

Dataset Sampling

Run experiments on a subset of your dataset without modifying the dataset file. The three sampling modes are mutually exclusive — use only one per run. Random sample (--sample <percent>): Run on a random N% of rows. Useful for quick smoke tests against large datasets.
# Run on ~20% of rows (random, non-reproducible)
agentmark run-experiment agentmark/test.prompt.mdx --sample 20

# Reproducible: same 20% every time
agentmark run-experiment agentmark/test.prompt.mdx --sample 20 --seed 42
Specific rows (--rows <spec>): Select individual rows by zero-based index. Supports comma-separated indices and ranges.
# Row 0 only
agentmark run-experiment agentmark/test.prompt.mdx --rows 0

# Rows 0, 3, 4, 5, and 9
agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9
Train/test split (--split <spec>): Split the dataset into train and test portions. Run only the train portion or only the test portion.
# Run on the first 80% (train portion), positional split
agentmark run-experiment agentmark/test.prompt.mdx --split train:80

# Run on the remaining 20% (test portion), positional split
agentmark run-experiment agentmark/test.prompt.mdx --split test:80

# Seeded split — random assignment, reproducible across runs
agentmark run-experiment agentmark/test.prompt.mdx --split train:80 --seed 42
agentmark run-experiment agentmark/test.prompt.mdx --split test:80 --seed 42
Without --seed, --split uses positional assignment: the first N% of rows are “train” and the rest are “test”. With --seed, each row is assigned to train or test by a deterministic hash — the order in the file does not matter.
Reproducibility with --seed: The --seed flag guarantees the same rows are selected every time, across TypeScript and Python. Pass the same seed to get identical results on any machine or language runtime.
# These two runs always process the exact same rows
agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
Use --seed in CI/CD pipelines to prevent flaky results from random row selection.

Output Example

#InputAI ResultExpected Outputsentiment_check
1{"text":"I love it"}positivepositivePASS (1.00)
2{"text":"Terrible"}negativenegativePASS (1.00)
3{"text":"It's okay"}neutralneutralPASS (1.00)
Summary:
Pass rate: 100% (3/3 passed)
The CLI supports both .mdx source files and pre-built .json files (from agentmark build). Media outputs (images, audio) are saved to .agentmark-outputs/ with clickable file paths.

How It Works

The run-experiment command:
  1. Loads your prompt file (.mdx or pre-built .json) and parses the frontmatter
  2. Reads the dataset specified in test_settings.dataset
  3. Sends the prompt and dataset to the dev server (default: http://localhost:9417)
  4. The server runs the prompt against each dataset row
  5. Evaluates results using the evals specified in test_settings.evals
  6. Streams results back to the CLI as they complete
  7. Displays formatted output (table, CSV, JSON, or JSONL)

Configuration

Link dataset and evals in prompt frontmatter:
---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check
---

<System>Classify the sentiment</System>
<User>{props.text}</User>
You can also provide default props via test_settings.props:
test_settings:
  props:
    language: en
    verbose: false
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check
Props from each dataset row override the defaults. Dataset (sentiment.jsonl):
{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": "It's okay"}, "expected_output": "neutral"}
Learn more about datasets → Learn more about evals →

Workflow

1. Develop prompts - Iterate on your prompt design 2. Create datasets - Add test cases covering your scenarios 3. Write evaluations - Define success criteria 4. Run experiments - Test against dataset
agentmark run-experiment agentmark/my-prompt.prompt.mdx
5. Review results - Identify failures and patterns 6. Iterate - Fix issues, improve prompts, add test cases 7. Deploy with confidence - Pass rate meets your threshold

SDK Usage

Run experiments programmatically using formatWithDataset():
import { client } from './agentmark-client';
import { generateText } from 'ai';  // Or your adapter's generation function

const prompt = await client.loadTextPrompt('agentmark/classifier.prompt.mdx');

// Returns a stream of formatted inputs from the dataset
const datasetStream = await prompt.formatWithDataset();

// Process each test case
for await (const item of datasetStream) {
  const { dataset, formatted, evals } = item;

  // Run the prompt with your AI SDK
  const result = await generateText(formatted);

  // Check results
  const passed = result.text === dataset.expected_output;
  console.log(`Input: ${JSON.stringify(dataset.input)}`);
  console.log(`Expected: ${dataset.expected_output}`);
  console.log(`Got: ${result.text}`);
  console.log(`Result: ${passed ? 'PASS' : 'FAIL'}\n`);
}
The stream returns objects with:
  • dataset - The test case (input and expected_output)
  • formatted - The formatted prompt ready for your AI SDK
  • evals - List of evaluation names to run
  • type - Always "dataset"
Options (FormatWithDatasetOptions):
  • datasetPath?: string - Override dataset from frontmatter
  • format?: 'ndjson' | 'json' - Buffer all rows ('json') or stream as available ('ndjson', default)
When to use:
  • Custom test logic in your test framework
  • Fine-grained control over test execution
  • Integrating with existing test infrastructure
  • Running experiments in application code

Troubleshooting

CLI Issues

Dataset not found:
  • Check dataset path in frontmatter
  • Verify file exists and is valid JSONL
Server connection error:
  • Ensure agentmark dev is running
  • Check ports are available (default webhook port: 9417)
  • Verify --server URL if using a custom server
Invalid dataset format:
  • Each line must be valid JSON
  • Required: input field
  • Optional: expected_output field
No evaluations ran:
  • Add evals to test_settings in frontmatter
  • Or use --skip-eval flag for output-only mode
Threshold check failed:
  • The --threshold flag requires evals that return a passed field
  • Verify your eval functions return { passed: true/false, ... }
Sampling options conflict:
  • Only one of --sample, --rows, or --split may be used at a time
  • --seed can be combined with any of them

Programmatic access

You can query experiment results, individual runs, and prompt execution logs via the REST API or the agentmark api CLI command. Use this to build custom reporting, export results to external tools, or integrate experiment data into CI/CD pipelines.
# List experiments
agentmark api experiments list --limit 10

# Get a specific experiment with its results
agentmark api experiments get <experimentId>

# List individual runs within an experiment
agentmark api runs list

# Get a specific run with full input/output
agentmark api runs get <runId>

# List prompt execution logs
agentmark api prompts list --limit 10
# Equivalent curl request for experiments
curl -H "Authorization: Bearer <API_KEY>" \
     -H "x-app-id: <APP_ID>" \
     https://api.agentmark.co/v1/experiments?limit=10
The local dev server and cloud gateway support the same endpoints, so you can develop integrations locally before deploying. Use the capabilities endpoint to check feature availability.

Next Steps

Datasets

Create test datasets

Evaluations

Write evaluation functions

Testing Overview

Learn testing concepts