Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt

Use this file to discover all available pages before exploring further.

Evaluations (evals) are functions that score prompt outputs and determine pass/fail status. You declare what each eval scores on in agentmark.json, write the function in code, and connect the two by name. In Cloud, score configs sync to the Dashboard and you pick evals to run from the New Experiment dialog. In Local, you register eval functions in your client and run them through the CLI.
Start with evals first - Build your evaluation framework before writing prompts. Evals provide the foundation for measuring effectiveness and iterating.

Evals in the Dashboard

Cloud evals come from two pieces that you maintain in your repo:
  • Score configs declared in agentmark.json under scores. These define what each eval scores on (boolean, numeric, or categorical) and sync to AgentMark Cloud through the deployment pipeline.
  • Eval functions on your deployed handler. These run during experiments and produce the scores.

Declare score configs

Add a scores block to agentmark.json. Each key is an eval name that your eval functions return scores for.
agentmark.json
{
  "scores": {
    "accuracy": {
      "type": "boolean",
      "description": "Was the response factually correct?"
    },
    "tone": {
      "type": "categorical",
      "description": "Response tone classification",
      "categories": [
        { "label": "professional", "value": 1 },
        { "label": "casual", "value": 0.5 },
        { "label": "inappropriate", "value": 0 }
      ]
    }
  }
}
Push to your connected branch so the deployment pipeline syncs the score configs to AgentMark Cloud. Once synced, they stay available in the Dashboard across deployments. See Project configuration for the full scores schema, and the Local tab for writing the eval functions themselves.

Pick evals when you run an experiment

In the New Experiment dialog, the Evaluations field is a multi-select populated from the evals your deployed handler registers. Selecting a prompt auto-fills the evaluations from its test_settings frontmatter.New Experiment dialog in the AgentMark Dashboard showing the evals selectorThe New Experiment dialog includes an Evaluations multi-select. Options come from the evals your deployed handler registers.

See results as scores

Eval results appear as per-row scores and run-level aggregates in the experiment detail view.Experiment detail view showing per-row evaluator scores and aggregate metricsThe experiment detail view lists each dataset row with its evaluator scores, plus aggregate metrics for the run such as average score. See Running experiments for the full detail view.
The same score configs power human annotation. Reviewers score traces against the configs you declare in agentmark.json. See Human annotation.

Next steps

Datasets

Create test datasets

Running Experiments

Run your evaluations

Testing Overview

Learn testing concepts

Have Questions?

We’re here to help! Choose the best way to reach us: