Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt

Use this file to discover all available pages before exploring further.

Write automated evaluators in code or CLI. Use the Dashboard for human annotations, shared experiment results, and visual score comparisons. Experiment detail view in the AgentMark Dashboard The experiment detail view shows each dataset row’s input, the AI output, expected output, and evaluator scores, alongside aggregate metrics for the run (average score, average latency, total cost, total tokens).

Why test prompts?

LLM outputs are non-deterministic — the same prompt can produce different results. Testing helps you:
  • Catch regressions — Know when prompt changes break existing functionality
  • Validate quality — Ensure outputs meet standards across diverse scenarios
  • Measure improvements — Quantify whether prompt iterations actually perform better
  • Build confidence — Deploy changes backed by data, not guesswork

Testing workflow

1

Create a dataset

Define test inputs in a JSONL file. Each line is one test case.
2

Write evaluations

Create eval functions that score outputs. Register them in your client.
3

Connect to prompts

Add test_settings.dataset and test_settings.evals to your prompt frontmatter.
4

Run experiments

Execute npx agentmark run-experiment to test your prompt against the dataset.
Prerequisites: You must have npx agentmark dev running in a separate terminal before running experiments.

Core concepts

Datasets

Collections of test inputs (and optionally expected outputs) stored as JSONL files. Define the scenarios your prompt should handle — common cases, edge cases, failure modes.
{"input": {"text": "Great product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience"}, "expected_output": "negative"}
{"input": {"text": ""}, "expected_output": "neutral"}
Create and manage datasets through the Dashboard UI, or use local JSONL files that sync when connected.
Learn more about datasets →

Evaluations

Functions that score prompt outputs and determine pass/fail status. Define your success criteria — what makes an output correct, high-quality, or acceptable.
export const accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return { passed: match, score: match ? 1 : 0 };
};
Learn more about evaluations →

Experiments

Run a prompt against a dataset with evaluations. Use them to validate prompt changes, compare model configurations, and enforce quality thresholds.
Run experiments from the Dashboard and review results with visual score comparisons, charts, and per-item drill-down.
Learn more about running experiments →

Annotations

Cloud feature. Annotations are available in the AgentMark Dashboard.
Manually label and score traces for human-in-the-loop evaluation. Add scores, labels, and detailed reasoning to any span. Complement automated evals with human judgment. Learn more about annotations →

Testing strategies

  • Start small (5-10 cases), then grow with real data
  • Test multiple dimensions — accuracy, completeness, tone, format
  • Version control everything — datasets live alongside prompts in your repo
  • Run in CI/CD — gate deployments on pass-rate thresholds

Programmatic access

Query datasets, experiments, runs, and prompt execution logs via the REST API or the agentmark api CLI command. Use this to build custom reporting, export evaluation results to external tools, or integrate experiment data into CI/CD pipelines.
# List datasets and experiments from the CLI
npx agentmark api datasets list
npx agentmark api experiments list --limit 10

# Get detailed results for a specific experiment
npx agentmark api experiments get <experimentId>

# List traces produced by a specific experiment run
# (filter /v1/traces by dataset_run_id — the former /v1/runs/{runId}/traces
# endpoint is deprecated on Local and returns 501 on Cloud)
curl "http://localhost:9418/v1/traces?dataset_run_id=<runId>"
The local dev server and the AgentMark Cloud gateway share the same /v1/* wire contract. A small number of routes are environment-specific — see API reference → Available endpoints for the Where column.

Next steps

Datasets

Create test datasets

Writing Evals

Write evaluation functions

Running Experiments

Execute tests with the CLI or Dashboard

Annotations

Human-in-the-loop scoring

Have Questions?

We’re here to help! Choose the best way to reach us: