Running Experiments
From the CLI
- A dataset configured in the prompt’s
test_settings.dataset - Development server running (
agentmark dev) - Optional: evaluation functions registered in your
EvalRegistry
From the Platform
Run experiments directly from the AgentMark dashboard by selecting a prompt and its associated dataset. Results appear in real time as each item completes.Command Options
| Option | Description |
|---|---|
--server <url> | Webhook server URL (default: http://localhost:9417) |
--skip-eval | Skip evaluations, output results only |
--format <format> | Output format: table (default), csv, json, or jsonl |
--threshold <percent> | Fail with non-zero exit code if pass rate is below threshold (0–100) |
Output Formats
- Table (default)
- CSV
- JSON
- JSONL
Human-readable table rendered in the terminal:
Threshold Enforcement
Set a minimum pass rate to gate deployments in CI/CD:passed field.
Configuration
Link a dataset and evals in your prompt’s frontmatter:classifier.prompt.mdx
test_settings Fields
| Field | Type | Description |
|---|---|---|
dataset | string | Path to the JSONL dataset file |
evals | string[] | Names of registered evaluation functions to run |
props | Record<string, any> | Default props for run-prompt (overridden by dataset input during experiments) |
How It Works
- Load the prompt file and parse the frontmatter
- Read the dataset from
test_settings.dataset - Send the prompt and dataset to the dev server webhook
- Execute each dataset item — the server runs the prompt with the item’s
inputas props - Evaluate — registered evals score each output against
expectedOutput - Stream results back to the CLI as they complete
- Display formatted output and pass rate summary
Result Structure
Each result in the experiment stream contains:Workflow
- Develop your prompt
- Create a dataset with test cases covering your scenarios
- Write evaluations that define success criteria
- Run experiments to validate:
- Review results — identify failures, inspect traces in the dashboard
- Iterate — fix issues, add test cases, rerun
- Deploy when pass rate meets your threshold
Integration with CI/CD
Use the--threshold flag and machine-readable output formats to integrate experiments into your deployment pipeline:
Learn More
- Datasets — Create test datasets
- Evaluations — Set up evaluation functions
- Running Experiments (Development) — SDK usage and programmatic experiment execution
Have Questions?
We’re here to help! Choose the best way to reach us:
- Join our Discord community for quick answers and discussions
- Email us at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about our business solutions