Use this file to discover all available pages before exploring further.
Run prompts against datasets with automatic evaluation to validate quality and consistency. In Cloud, you create and review experiments in the AgentMark Dashboard against git-synced datasets. In Local, you run experiments from the CLI against JSONL files on disk.
The Dashboard runs experiments against datasets and score configs that are synced from your repo through the deployment pipeline, using the evals registered on your deployed handler. Open Experiments (flask icon) in the sidebar to get started.
Running an experiment from the Dashboard requires the app to be connected to a deployed handler. If it isn’t, the dialog returns an “app not connected” error. See Deployment for connecting a handler.
The Experiments page is a paginated list of every run in your app. Filter it by prompt name and dataset path to find a specific run.The Experiments list shows each run as a row, with filters for prompt name and dataset path and a New Experiment button in the top-right. Comparison charts — average latency, total cost, and average score across the runs — sit above the list. Select 2 to 3 runs to enable Compare.Running an experiment requires the experiment.run permission.
Click New Experiment to open the dialog.The New Experiment dialog has four fields: Name, Prompt, Dataset, and Evaluations (a multi-select populated from the evals your deployed handler registers). Selecting a prompt auto-fills the dataset and evaluations from its test_settings frontmatter.The Name must start with a letter and may contain letters, numbers, hyphens, and underscores, up to 100 characters.
1
Name the experiment
Enter a Name that starts with a letter.
2
Choose a prompt
Pick the Prompt to test. The Dataset and Evaluations auto-fill from its test_settings.
3
Confirm dataset and evaluations
Adjust Dataset and Evaluations if you want to run against a different dataset or eval set.
4
Run
Click Run Experiment. Results stream in live, then open in the experiment detail view.
As the run executes, results stream in row by row, and a summary reports the item count and total tokens when it finishes. Open the experiment to review the full results.
Click any experiment to open its detail view.The experiment detail view lists each dataset row in a table — Item, Input, Output, Expected Output, Model, latency, cost, tokens, Scores, and a Trace link. Above the table, aggregate metrics summarize the run (items, average score, total cost, average latency, total tokens) alongside charts.Use Send to Review Queue on the detail page to send the experiment’s items to an annotation queue for human review. See Human annotation.
Select 2 to 3 experiments in the list, then click Compare to view them side by side.The comparison view places runs side by side (2 to 3) and tags each item as Improved, Regressed, or Unchanged, so you can see exactly which cases a prompt change fixed or broke.
The animation shows npx agentmark run-experiment executing against a dataset: each row is processed, the AI output is scored, and a results table prints to stdout with pass/fail status per evaluator.
npx agentmark run-experiment <filepath> [options]Options: --server <url> Webhook server URL (default: http://localhost:9417) --skip-eval Skip running evals even if they exist --format <format> Output format: table, csv, json, jsonl, or junit (default: table) --threshold <percent> Fail if pass percentage is below threshold (0-100) --truncate <chars> Truncate long cells in table output (default 1000; 0 = unlimited)Dataset sampling (pick at most one): --sample <percent> Run on a random N% of rows (1-100) --rows <spec> Select specific rows by index or range (e.g., 0,3-5,9) --split <spec> Train/test split (e.g., train:80 or test:80) --seed <number> Seed for reproducible sampling/splitting
The --server flag defaults to the AGENTMARK_WEBHOOK_URL environment variable if set, otherwise http://localhost:9417.
Exits with non-zero code if pass rate falls below the threshold. Requires evaluations that return a passed field.
--threshold is an absolute pass-rate gate on a single run. To gate CI on per-case regressions against a baseline — failing a PR when a case scores worse than it did before — see Regression gates.
JUnit XML for CI gating:--format junit emits a JUnit XML document that every major CI system already parses natively — GitHub Actions (via marketplace parsers), GitLab CI (via artifacts.reports.junit), Jenkins, CircleCI, and others. Each (row × scorer) pair becomes one <testcase>; failing scorers emit <failure> with input/actual/expected payload in CDATA.
The XML can be combined with --threshold for a suite-level gate on top of the per-row failures already surfaced in the report.GitHub Actions — use the agentmark-ai/eval-action composite, which diffs the PR, runs --format junit per changed prompt, and pipes results to mikepenz/action-junit-report:
GitLab CI — use the agentmark-ai/eval-component Catalog component, which diffs the MR, runs --format junit per changed prompt, and surfaces results in the MR widget via artifacts:reports:junit::
Other CI systems (Jenkins, CircleCI, Buildkite) consume the same XML via their native JUnit-report plugins.Dataset sampling (see Dataset sampling below):
Run experiments on a subset of your dataset without modifying the dataset file. The three sampling modes are mutually exclusive — use only one per run.Random sample (--sample <percent>):Run on a random N% of rows. Useful for quick smoke tests against large datasets.
# Run on ~20% of rows (random, non-reproducible)npx agentmark run-experiment agentmark/test.prompt.mdx --sample 20# Reproducible: same 20% every timenpx agentmark run-experiment agentmark/test.prompt.mdx --sample 20 --seed 42
Specific rows (--rows <spec>):Select individual rows by zero-based index. Supports comma-separated indices and ranges.
Train/test split (--split <spec>):Split the dataset into train and test portions. Run only the train portion or only the test portion.
# Run on the first 80% (train portion), positional splitnpx agentmark run-experiment agentmark/test.prompt.mdx --split train:80# Run on the remaining 20% (test portion), positional splitnpx agentmark run-experiment agentmark/test.prompt.mdx --split test:80# Seeded split — random assignment, reproducible across runsnpx agentmark run-experiment agentmark/test.prompt.mdx --split train:80 --seed 42npx agentmark run-experiment agentmark/test.prompt.mdx --split test:80 --seed 42
Without --seed, --split uses positional assignment: the first N% of rows are “train” and the rest are “test”. With --seed, each row is assigned to train or test by a deterministic hash — the order in the file does not matter.
Reproducibility with --seed:The --seed flag guarantees the same rows are selected every time, across TypeScript and Python. Pass the same seed to get identical results on any machine or language runtime.
# These two runs always process the exact same rowsnpx agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99npx agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
Use --seed in CI/CD pipelines to prevent flaky results from random row selection.
On a normal run the table (or --format output) is the only result. Pass --threshold <0-100> to also print a pass-rate summary and gate the exit code — the pass rate is counted over evaluations (row × evaluator pairs), not rows:
If the pass rate falls below the threshold, the CLI prints ❌ Experiment failed threshold check and exits non-zero — wire that into CI for regression gating.The CLI supports both .mdx source files and pre-built .json files (from npx agentmark build). Media outputs (images, audio) are saved to .agentmark-outputs/ with clickable file paths.
1. Develop prompts - Iterate on your prompt design2. Create datasets - Add test cases covering your scenarios3. Write evaluations - Define success criteria4. Run experiments - Test against dataset
Run experiments programmatically using formatWithDataset():
import { client } from './agentmark-client';import { generateText } from 'ai'; // Or your adapter's generation functionconst prompt = await client.loadTextPrompt('agentmark/classifier.prompt.mdx');// Returns a stream of formatted inputs from the datasetconst datasetStream = await prompt.formatWithDataset();// Process each test casefor await (const item of datasetStream) { const { dataset, formatted, evals } = item; // Run the prompt with your AI SDK const result = await generateText(formatted); // Check results const passed = result.text === dataset.expected_output; console.log(`Input: ${JSON.stringify(dataset.input)}`); console.log(`Expected: ${dataset.expected_output}`); console.log(`Got: ${result.text}`); console.log(`Result: ${passed ? 'PASS' : 'FAIL'}\n`);}
The stream returns objects with:
dataset - The test case (input and expected_output)
formatted - The formatted prompt ready for your AI SDK
evals - List of evaluation names to run
type - Always "dataset"
Options (FormatWithDatasetOptions):
datasetPath?: string - Override dataset from frontmatter
format?: 'ndjson' | 'json' - Buffer all rows ('json') or stream as available ('ndjson', default)
You can query experiment results, run traces, and prompt file listings through the REST API, or from an IDE agent via the agentmark-mcp MCP server. Use either to build custom reporting, export results to external tools, or integrate experiment data into CI/CD pipelines.
# List experimentscurl "http://localhost:9418/v1/experiments?limit=10"# Get a specific experiment, including its runs and evaluation resultscurl "http://localhost:9418/v1/experiments/<experimentId>"# List traces for a specific experiment run — filter `/v1/traces` by `dataset_run_id`# (the former `/v1/runs/{runId}/traces` endpoint is deprecated; both paths hit the# same predicate, but the filter approach works on Cloud + Local without a second# endpoint).curl "http://localhost:9418/v1/traces?dataset_run_id=<runId>"# List prompt files registered with the local dev servercurl "http://localhost:9418/v1/prompts?limit=10"
# Same call against AgentMark Cloud — set auth + app headerscurl "https://api.agentmark.co/v1/experiments?limit=10" \ -H "Authorization: Bearer $AGENTMARK_API_KEY" \ -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"
experiments ships on Cloud + Local. prompts is Local-only today — Cloud returns 501 not_available_on_cloud. The legacy /v1/runs/{runId}/traces endpoint is deprecated but still works on Local for backwards compatibility; use /v1/traces?dataset_run_id=… in new code. Call GET /v1/capabilities to check which features a server supports at runtime.