Skip to main content

Testing in AgentMark

AgentMark provides a comprehensive testing workflow to validate, measure, and improve your prompts:
  • Datasets — Test prompts against collections of input/output pairs to catch regressions and validate behavior
  • Evaluations — Score prompt outputs automatically using custom evaluation functions
  • Experiments — Run prompts against datasets with evals, compare versions, and track performance
  • Annotations — Manually label and score traces for human-in-the-loop evaluation
Testing Overview

How It Works

  1. Create a dataset with test inputs and expected outputs (JSONL format)
  2. Write evaluation functions that score outputs (pass/fail, numeric scores, labels)
  3. Connect them to prompts via test_settings in the prompt frontmatter
  4. Run experiments from the platform or CLI to test all dataset items
  5. Review results — scores, pass rates, and individual outputs in the dashboard
  6. Annotate traces with human judgment for additional insight

Datasets

Datasets are collections of input/output pairs in JSONL format that define your test cases. Each item specifies the input props for a prompt and an optional expected output for comparison.
  • Create and manage datasets through the platform UI or as local JSONL files
  • Run datasets against prompts to generate bulk results
  • View detailed traces for each run item
Learn more →

Evaluations

Evaluations are functions that automatically score prompt outputs. Register eval functions in your client config, reference them in prompt frontmatter, and they run automatically during experiments.
  • Score outputs with numeric values, pass/fail status, labels, and reasons
  • Use reference-based, heuristic, or LLM-as-judge approaches
  • View eval results alongside traces in the dashboard
Learn more →

Experiments

Experiments run a prompt against a dataset with evaluations. Use them to validate prompt changes, compare model configurations, and enforce quality thresholds before deploying.
  • Run from the platform UI or via agentmark run-experiment in the CLI
  • Output results as tables, CSV, JSON, or JSONL
  • Set pass-rate thresholds to gate deployments
Learn more →

Annotations

Annotations let team members manually score and label individual traces in the dashboard. Use them for human-in-the-loop evaluation, edge case documentation, and creating training datasets from production data.
  • Add scores, labels, and detailed reasoning to any span
  • Complement automated evals with human judgment
  • Review annotations alongside automated scores in the Evaluation tab
Learn more →

Have Questions?

We’re here to help! Choose the best way to reach us: