Experiments

Experiments run a prompt against every item in a dataset, optionally applying evaluation functions to score each result. Use experiments to validate prompt changes, compare models, and enforce quality thresholds.

Running Experiments

From the CLI

agentmark run-experiment agentmark/classifier.prompt.mdx

Requirements:

A dataset configured in the prompt’s test_settings.dataset
Development server running (agentmark dev)
Optional: evaluation functions registered in your EvalRegistry

From the Platform

Run experiments directly from the AgentMark dashboard by selecting a prompt and its associated dataset. Results appear in real time as each item completes.

Command Options

agentmark run-experiment <filepath> [options]

Option	Description
`--server <url>`	Webhook server URL (default: `http://localhost:9417`)
`--skip-eval`	Skip evaluations, output results only
`--format <format>`	Output format: `table` (default), `csv`, `json`, or `jsonl`
`--threshold <percent>`	Fail with non-zero exit code if pass rate is below threshold (0–100)

Output Formats

Table (default)
CSV
JSON
JSONL

Human-readable table rendered in the terminal:

| # | Input               | AI Result | Expected Output | accuracy    |
|---|---------------------|-----------|-----------------|-------------|
| 1 | {"text":"I love it"}| positive  | positive        | PASS (1.00) |
| 2 | {"text":"Terrible"} | negative  | negative        | PASS (1.00) |
| 3 | {"text":"It's okay"}| neutral   | neutral         | PASS (1.00) |

Pass rate: 100% (3/3 passed)

agentmark run-experiment prompt.mdx --format csv

Outputs CSV suitable for spreadsheets. Values are escaped automatically.

agentmark run-experiment prompt.mdx --format json

Outputs a JSON array of all results (buffered until complete).

agentmark run-experiment prompt.mdx --format jsonl

Outputs one JSON object per line, streamed as results arrive.

Threshold Enforcement

Set a minimum pass rate to gate deployments in CI/CD:

agentmark run-experiment agentmark/classifier.prompt.mdx --threshold 90

Exits with a non-zero code if the pass rate falls below the threshold. Requires evaluations that return a passed field.

Configuration

Link a dataset and evals in your prompt’s frontmatter:

classifier.prompt.mdx

---
name: sentiment-classifier
text_config:
  model_name: gpt-4o
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - accuracy
    - format_check
---

<System>Classify the sentiment as positive, negative, or neutral.</System>
<User>{props.text}</User>

test_settings Fields

Field	Type	Description
`dataset`	`string`	Path to the JSONL dataset file
`evals`	`string[]`	Names of registered evaluation functions to run
`props`	`Record<string, any>`	Default props for `run-prompt` (overridden by dataset `input` during experiments)

How It Works

Load the prompt file and parse the frontmatter
Read the dataset from test_settings.dataset
Send the prompt and dataset to the dev server webhook
Execute each dataset item — the server runs the prompt with the item’s input as props
Evaluate — registered evals score each output against expectedOutput
Stream results back to the CLI as they complete
Display formatted output and pass rate summary

Each experiment run generates individual traces viewable in the dashboard, with token usage, latency, and eval scores attached.

Result Structure

Each result in the experiment stream contains:

{
  type: "dataset",
  runId: string,
  runName: string,
  result: {
    input: Record<string, unknown>,
    expectedOutput?: string,
    actualOutput: any,
    tokens?: number,
    evals: Array<{
      name: string,
      score?: number,
      label?: string,
      reason?: string,
      passed?: boolean,
    }>
  }
}

Workflow

Develop your prompt
Create a dataset with test cases covering your scenarios
Write evaluations that define success criteria

Run experiments to validate:

agentmark run-experiment agentmark/my-prompt.prompt.mdx

Review results — identify failures, inspect traces in the dashboard
Iterate — fix issues, add test cases, rerun
Deploy when pass rate meets your threshold

Integration with CI/CD

Use the --threshold flag and machine-readable output formats to integrate experiments into your deployment pipeline:

# In CI: fail the build if accuracy drops below 85%
agentmark run-experiment agentmark/classifier.prompt.mdx \
  --threshold 85 \
  --format jsonl

Learn More

Datasets — Create test datasets
Evaluations — Set up evaluation functions
Running Experiments (Development) — SDK usage and programmatic experiment execution

Have Questions?

We’re here to help! Choose the best way to reach us:

Join our Discord community for quick answers and discussions
Email us at hello@agentmark.co for support
Schedule an Enterprise Demo to learn about our business solutions

Getting Started

Prompt Management

Observability

Testing

Further Reference

Running Experiments

From the CLI

From the Platform

Command Options

Output Formats

Threshold Enforcement

Configuration

test_settings Fields

How It Works

Result Structure

Workflow

Integration with CI/CD

Learn More

Have Questions?

Getting Started

Prompt Management

Observability

Testing

Further Reference

​Running Experiments

​From the CLI

​From the Platform

​Command Options

​Output Formats

​Threshold Enforcement

​Configuration

​test_settings Fields

​How It Works

​Result Structure

​Workflow

​Integration with CI/CD

​Learn More

​Have Questions?

Running Experiments

From the CLI

From the Platform

Command Options

Output Formats

Threshold Enforcement

Configuration

test_settings Fields

How It Works

Result Structure

Workflow

Integration with CI/CD

Learn More

Have Questions?