Skip to main content
Experiments run a prompt against every item in a dataset, optionally applying evaluation functions to score each result. Use experiments to validate prompt changes, compare models, and enforce quality thresholds.

Running Experiments

From the CLI

agentmark run-experiment agentmark/classifier.prompt.mdx
Requirements:
  • A dataset configured in the prompt’s test_settings.dataset
  • Development server running (agentmark dev)
  • Optional: evaluation functions registered in your EvalRegistry

From the Platform

Run experiments directly from the AgentMark dashboard by selecting a prompt and its associated dataset. Results appear in real time as each item completes.

Command Options

agentmark run-experiment <filepath> [options]
OptionDescription
--server <url>Webhook server URL (default: http://localhost:9417)
--skip-evalSkip evaluations, output results only
--format <format>Output format: table (default), csv, json, or jsonl
--threshold <percent>Fail with non-zero exit code if pass rate is below threshold (0–100)

Output Formats

Human-readable table rendered in the terminal:
| # | Input               | AI Result | Expected Output | accuracy    |
|---|---------------------|-----------|-----------------|-------------|
| 1 | {"text":"I love it"}| positive  | positive        | PASS (1.00) |
| 2 | {"text":"Terrible"} | negative  | negative        | PASS (1.00) |
| 3 | {"text":"It's okay"}| neutral   | neutral         | PASS (1.00) |

Pass rate: 100% (3/3 passed)

Threshold Enforcement

Set a minimum pass rate to gate deployments in CI/CD:
agentmark run-experiment agentmark/classifier.prompt.mdx --threshold 90
Exits with a non-zero code if the pass rate falls below the threshold. Requires evaluations that return a passed field.

Configuration

Link a dataset and evals in your prompt’s frontmatter:
classifier.prompt.mdx
---
name: sentiment-classifier
text_config:
  model_name: gpt-4o
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - accuracy
    - format_check
---

<System>Classify the sentiment as positive, negative, or neutral.</System>
<User>{props.text}</User>

test_settings Fields

FieldTypeDescription
datasetstringPath to the JSONL dataset file
evalsstring[]Names of registered evaluation functions to run
propsRecord<string, any>Default props for run-prompt (overridden by dataset input during experiments)

How It Works

  1. Load the prompt file and parse the frontmatter
  2. Read the dataset from test_settings.dataset
  3. Send the prompt and dataset to the dev server webhook
  4. Execute each dataset item — the server runs the prompt with the item’s input as props
  5. Evaluate — registered evals score each output against expectedOutput
  6. Stream results back to the CLI as they complete
  7. Display formatted output and pass rate summary
Each experiment run generates individual traces viewable in the dashboard, with token usage, latency, and eval scores attached.

Result Structure

Each result in the experiment stream contains:
{
  type: "dataset",
  runId: string,
  runName: string,
  result: {
    input: Record<string, unknown>,
    expectedOutput?: string,
    actualOutput: any,
    tokens?: number,
    evals: Array<{
      name: string,
      score?: number,
      label?: string,
      reason?: string,
      passed?: boolean,
    }>
  }
}

Workflow

  1. Develop your prompt
  2. Create a dataset with test cases covering your scenarios
  3. Write evaluations that define success criteria
  4. Run experiments to validate:
    agentmark run-experiment agentmark/my-prompt.prompt.mdx
    
  5. Review results — identify failures, inspect traces in the dashboard
  6. Iterate — fix issues, add test cases, rerun
  7. Deploy when pass rate meets your threshold

Integration with CI/CD

Use the --threshold flag and machine-readable output formats to integrate experiments into your deployment pipeline:
# In CI: fail the build if accuracy drops below 85%
agentmark run-experiment agentmark/classifier.prompt.mdx \
  --threshold 85 \
  --format jsonl

Learn More

Have Questions?

We’re here to help! Choose the best way to reach us: