Running Experiments

Run prompts against datasets with automatic evaluation to validate quality and consistency.

CLI Usage

Quick Start

agentmark run-experiment agentmark/classifier.prompt.mdx

Requirements:

Dataset configured in prompt frontmatter
Development server running (agentmark dev)
Optional: Evaluation functions defined

Full Command Signature

agentmark run-experiment <filepath> [options]

Options:
  --server <url>        Webhook server URL (default: http://localhost:9417)
  --skip-eval           Skip running evals even if they exist
  --format <format>     Output format: table, csv, json, or jsonl (default: table)
  --threshold <percent> Fail if pass percentage is below threshold (0-100)

The --server flag defaults to the AGENTMARK_WEBHOOK_URL environment variable if set, otherwise http://localhost:9417.

Command Options

Skip evaluations (output-only mode):

agentmark run-experiment agentmark/test.prompt.mdx --skip-eval

Output format:

agentmark run-experiment agentmark/test.prompt.mdx --format table   # Default
agentmark run-experiment agentmark/test.prompt.mdx --format csv     # Spreadsheets
agentmark run-experiment agentmark/test.prompt.mdx --format json    # Structured
agentmark run-experiment agentmark/test.prompt.mdx --format jsonl   # Line-delimited

Pass rate threshold (CI/CD):

agentmark run-experiment agentmark/test.prompt.mdx --threshold 85

Exits with non-zero code if pass rate falls below the threshold. Requires evaluations that return a passed field. Custom server:

agentmark run-experiment agentmark/test.prompt.mdx --server http://staging:9417

Output Example

#	Input	AI Result	Expected Output	sentiment_check
1	`{"text":"I love it"}`	positive	positive	PASS (1.00)
2	`{"text":"Terrible"}`	negative	negative	PASS (1.00)
3	`{"text":"It's okay"}`	neutral	neutral	PASS (1.00)

Summary:

Pass rate: 100% (3/3 passed)

The CLI supports both .mdx source files and pre-built .json files (from agentmark build). Media outputs (images, audio) are saved to .agentmark-outputs/ with clickable file paths.

How It Works

The run-experiment command:

Loads your prompt file (.mdx or pre-built .json) and parses the frontmatter
Reads the dataset specified in test_settings.dataset
Sends the prompt and dataset to the webhook server (default: http://localhost:9417)
The server runs the prompt against each dataset row
Evaluates results using the evals specified in test_settings.evals
Streams results back to the CLI as they complete
Displays formatted output (table, CSV, JSON, or JSONL)

Configuration

Link dataset and evals in prompt frontmatter:

---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check
---

<System>Classify the sentiment</System>
<User>{props.text}</User>

You can also provide default props via test_settings.props:

test_settings:
  props:
    language: en
    verbose: false
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check

Props from each dataset row override the defaults. Dataset (sentiment.jsonl):

{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": "It's okay"}, "expected_output": "neutral"}

Learn more about datasets → Learn more about evals →

Workflow

1. Develop prompts - Iterate on your prompt design 2. Create datasets - Add test cases covering your scenarios 3. Write evaluations - Define success criteria 4. Run experiments - Test against dataset

agentmark run-experiment agentmark/my-prompt.prompt.mdx

5. Review results - Identify failures and patterns 6. Iterate - Fix issues, improve prompts, add test cases 7. Deploy with confidence - Pass rate meets your threshold

SDK Usage

Run experiments programmatically using formatWithDataset():

import { client } from './agentmark-client';
import { generateText } from 'ai';  // Or your adapter's generation function

const prompt = await client.loadTextPrompt('agentmark/classifier.prompt.mdx');

// Returns a stream of formatted inputs from the dataset
const datasetStream = await prompt.formatWithDataset();

// Process each test case
for await (const item of datasetStream) {
  const { dataset, formatted, evals } = item;

  // Run the prompt with your AI SDK
  const result = await generateText(formatted);

  // Check results
  const passed = result.text === dataset.expected_output;
  console.log(`Input: ${JSON.stringify(dataset.input)}`);
  console.log(`Expected: ${dataset.expected_output}`);
  console.log(`Got: ${result.text}`);
  console.log(`Result: ${passed ? 'PASS' : 'FAIL'}\n`);
}

The stream returns objects with:

dataset - The test case (input and expected_output)
formatted - The formatted prompt ready for your AI SDK
evals - List of evaluation names to run
type - Always "dataset"

Options (FormatWithDatasetOptions):

datasetPath?: string - Override dataset from frontmatter
format?: 'ndjson' | 'json' - Buffer all rows ('json') or stream as available ('ndjson', default)

When to use:

Custom test logic in your test framework
Fine-grained control over test execution
Integrating with existing test infrastructure
Running experiments in application code

Troubleshooting

CLI Issues

Dataset not found:

Check dataset path in frontmatter
Verify file exists and is valid JSONL

Server connection error:

Ensure agentmark dev is running
Check ports are available (default webhook port: 9417)
Verify --server URL if using a custom server

Invalid dataset format:

Each line must be valid JSON
Required: input field
Optional: expected_output field

No evaluations ran:

Add evals to test_settings in frontmatter
Or use --skip-eval flag for output-only mode

Threshold check failed:

The --threshold flag requires evals that return a passed field
Verify your eval functions return { passed: true/false, ... }

Next Steps

Datasets

Create test datasets

Evaluations

Write evaluation functions

Testing Overview

Learn testing concepts

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

CLI Usage

Quick Start

Full Command Signature

Command Options

Output Example

How It Works

Configuration

Workflow

SDK Usage

Troubleshooting

CLI Issues

Next Steps

Datasets

Evaluations

Testing Overview

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

​CLI Usage

​Quick Start

​Full Command Signature

​Command Options

​Output Example

​How It Works

​Configuration

​Workflow

​SDK Usage

​Troubleshooting

​CLI Issues

​Next Steps

Datasets

Evaluations

Testing Overview

CLI Usage

Quick Start

Full Command Signature

Command Options

Output Example

How It Works

Configuration

Workflow

SDK Usage

Troubleshooting

CLI Issues

Next Steps