Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt

Use this file to discover all available pages before exploring further.

AgentMark gives you two ways to test prompts, and they share the same building blocks. In Cloud, you run and review experiments in the AgentMark Dashboard — datasets and score configs are synced from your repo through the git deployment pipeline, and your deployed handler runs the evals. In Local, you keep datasets as JSONL files alongside your prompts, write eval functions in code, and run experiments from the CLI. The Dashboard experiment views are the same shared UI components the local dev server renders, so Cloud and self-hosted Local look the same. The difference is the data source — git-synced versus local files — and who runs the eval handler. Experiment detail view in the AgentMark Dashboard showing per-row scores and aggregate metrics The experiment detail view shows each dataset row’s input, the AI output, expected output, and evaluator scores, alongside aggregate metrics for the run (average score, average latency, total cost, total tokens).

Why test prompts?

LLM outputs are non-deterministic — the same prompt can produce different results. Testing helps you:
  • Catch regressions — Know when prompt changes break existing functionality
  • Validate quality — Ensure outputs meet standards across diverse scenarios
  • Measure improvements — Quantify whether prompt iterations actually perform better
  • Build confidence — Deploy changes backed by data, not guesswork

Testing workflow

1

Define a dataset in your repo

Add a JSONL file to your agentmark/ directory. Each line is one test case.
2

Declare score configs and write evals

Add score configs to agentmark.json under scores, and write the eval functions on your handler.
3

Deploy to sync

Push to your connected branch. The deployment pipeline syncs your datasets and score configs to AgentMark Cloud.
4

Run an experiment

Open Experiments in the Dashboard, click New Experiment, choose the prompt, dataset, and evaluations, and run. Results stream in live, then open in the experiment detail view.

Core concepts

Datasets

Collections of test inputs (and optionally expected outputs) that define the scenarios your prompt should handle — common cases, edge cases, and failure modes.
{"input": {"text": "Great product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience"}, "expected_output": "negative"}
{"input": {"text": ""}, "expected_output": "neutral"}
Datasets live as JSONL files in your repo and sync to AgentMark Cloud through the deployment pipeline. In the Dashboard you pick a synced dataset when you create an experiment or configure a review queue. Rows are appended through the “Save to dataset” flow during annotation review and through the REST API.
Learn more about datasets →

Evaluations

Functions that score prompt outputs and determine pass/fail status. Define your success criteria — what makes an output correct, high-quality, or acceptable.
export const accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return { passed: match, score: match ? 1 : 0 };
};
Score configs are declared in agentmark.json under scores and synced to AgentMark Cloud through the deployment pipeline. Eval functions run during experiments on your deployed handler. In the New Experiment dialog you select which registered evals to run, and results appear as per-row scores and aggregates in the experiment detail view.
Learn more about evaluations →

Experiments

Run a prompt against a dataset with evaluations. Use them to validate prompt changes, compare model configurations, and enforce quality thresholds.
Run experiments from the Experiments page in the Dashboard. Review results with per-row score drill-down, aggregate metrics, and charts, and compare runs side by side.
Learn more about running experiments →

Annotations

Cloud feature. Annotations are available in the AgentMark Dashboard.
Manually label and score traces for human-in-the-loop evaluation. Add scores, labels, and detailed reasoning to any span. Complement automated evals with human judgment. Learn more about annotations →

Testing strategies

  • Start small (5-10 cases), then grow with real data
  • Test multiple dimensions — accuracy, completeness, tone, format
  • Version control everything — datasets live alongside prompts in your repo
  • Run in CI/CD — gate deployments on pass-rate thresholds

Programmatic access

Query datasets, experiments, runs, and prompt execution logs through the REST API, or from an IDE agent via the agentmark-mcp MCP server. Use either to build custom reporting, export evaluation results to external tools, or integrate experiment data into CI/CD pipelines.
# List datasets and experiments from the local dev server
curl "http://localhost:9418/v1/datasets"
curl "http://localhost:9418/v1/experiments?limit=10"

# Get detailed results for a specific experiment
curl "http://localhost:9418/v1/experiments/<experimentId>"

# List traces produced by a specific experiment run
# (filter /v1/traces by dataset_run_id — the former /v1/runs/{runId}/traces
# endpoint is deprecated on Local and returns 501 on Cloud)
curl "http://localhost:9418/v1/traces?dataset_run_id=<runId>"

# Against Cloud, set the auth + app headers:
curl "https://api.agentmark.co/v1/experiments?limit=10" \
  -H "Authorization: Bearer $AGENTMARK_API_KEY" \
  -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"
The local dev server and the AgentMark Cloud gateway share the same /v1/* wire contract. A small number of routes are environment-specific — see API reference → Available endpoints for the Where column.

Next steps

Datasets

Create test datasets

Writing Evals

Write evaluation functions

Running Experiments

Execute tests with the CLI or Dashboard

Annotations

Human-in-the-loop scoring

Have Questions?

We’re here to help! Choose the best way to reach us: