Evaluate

Write automated evaluators in code or CLI. Use the Dashboard for human annotations, shared experiment results, and visual score comparisons.

Experiment detail view in the AgentMark Dashboard

The experiment detail view shows each dataset row’s input, the AI output, expected output, and evaluator scores, alongside aggregate metrics for the run (average score, average latency, total cost, total tokens).

Why test prompts?

LLM outputs are non-deterministic — the same prompt can produce different results. Testing helps you:

Catch regressions — Know when prompt changes break existing functionality
Validate quality — Ensure outputs meet standards across diverse scenarios
Measure improvements — Quantify whether prompt iterations actually perform better
Build confidence — Deploy changes backed by data, not guesswork

Testing workflow

Create a dataset

Define test inputs in a JSONL file. Each line is one test case.

Write evaluations

Create eval functions that score outputs. Register them in your client.

Connect to prompts

Add test_settings.dataset and test_settings.evals to your prompt frontmatter.

Run experiments

Execute npx agentmark run-experiment to test your prompt against the dataset.

Prerequisites: You must have npx agentmark dev running in a separate terminal before running experiments.

Core concepts

Datasets

Collections of test inputs (and optionally expected outputs) stored as JSONL files. Define the scenarios your prompt should handle — common cases, edge cases, failure modes.

{"input": {"text": "Great product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience"}, "expected_output": "negative"}
{"input": {"text": ""}, "expected_output": "neutral"}

Cloud
Local

Create and manage datasets through the Dashboard UI, or use local JSONL files that sync when connected.

Store JSONL files alongside your prompts in the agentmark/ directory.

Learn more about datasets →

Evaluations

Functions that score prompt outputs and determine pass/fail status. Define your success criteria — what makes an output correct, high-quality, or acceptable.

export const accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return { passed: match, score: match ? 1 : 0 };
};

Learn more about evaluations →

Experiments

Run a prompt against a dataset with evaluations. Use them to validate prompt changes, compare model configurations, and enforce quality thresholds.

Cloud
Local

Run experiments from the Dashboard and review results with visual score comparisons, charts, and per-item drill-down.

Run from the CLI and view results as tables in your terminal:

npx agentmark run-experiment agentmark/party-planner.prompt.mdx

Learn more about running experiments →

Annotations

Cloud feature. Annotations are available in the AgentMark Dashboard.

Manually label and score traces for human-in-the-loop evaluation. Add scores, labels, and detailed reasoning to any span. Complement automated evals with human judgment. Learn more about annotations →

Testing strategies

Start small (5-10 cases), then grow with real data
Test multiple dimensions — accuracy, completeness, tone, format
Version control everything — datasets live alongside prompts in your repo
Run in CI/CD — gate deployments on pass-rate thresholds

Programmatic access

Query datasets, experiments, runs, and prompt execution logs via the REST API or the agentmark api CLI command. Use this to build custom reporting, export evaluation results to external tools, or integrate experiment data into CI/CD pipelines.

# List datasets and experiments from the CLI
npx agentmark api datasets list
npx agentmark api experiments list --limit 10

# Get detailed results for a specific experiment
npx agentmark api experiments get <experimentId>

# List traces produced by a specific experiment run
# (filter /v1/traces by dataset_run_id — the former /v1/runs/{runId}/traces
# endpoint is deprecated on Local and returns 501 on Cloud)
curl "http://localhost:9418/v1/traces?dataset_run_id=<runId>"

The local dev server and the AgentMark Cloud gateway share the same /v1/* wire contract. A small number of routes are environment-specific — see API reference → Available endpoints for the Where column.

Next steps

Datasets

Create test datasets

Writing Evals

Write evaluation functions

Running Experiments

Execute tests with the CLI or Dashboard

Annotations

Human-in-the-loop scoring

Have Questions?

We’re here to help! Choose the best way to reach us:

Email us at hello@agentmark.co for support
Schedule an Enterprise Demo to learn about our business solutions

Introduction

Getting Started

Build

Observe

Configure

Deploy

Integrations

Evaluate

Why test prompts?

Testing workflow

Core concepts

Datasets

Evaluations

Experiments

Annotations

Testing strategies

Programmatic access

Next steps

Datasets

Writing Evals

Running Experiments

Annotations

Have Questions?

Introduction

Getting Started

Build

Evaluate

Observe

Configure

Deploy

Integrations

Documentation Index

​Why test prompts?

​Testing workflow

​Core concepts

​Datasets

​Evaluations

​Experiments

​Annotations

​Testing strategies

​Programmatic access

​Next steps

Datasets

Writing Evals

Running Experiments

Annotations

​Have Questions?

Why test prompts?

Testing workflow

Core concepts

Datasets

Evaluations

Experiments

Annotations

Testing strategies

Programmatic access

Next steps

Have Questions?