Skip to main content
Write automated evaluators in code or CLI. Use the Dashboard for human annotations, shared experiment results, and visual score comparisons. AgentMark Experiments

Why Test Prompts?

LLM outputs are non-deterministic — the same prompt can produce different results. Testing helps you:
  • Catch regressions — Know when prompt changes break existing functionality
  • Validate quality — Ensure outputs meet standards across diverse scenarios
  • Measure improvements — Quantify whether prompt iterations actually perform better
  • Build confidence — Deploy changes backed by data, not guesswork

Testing Workflow

1

Create a Dataset

Define test inputs in a JSONL file. Each line is one test case.
2

Write Evaluations

Create eval functions that score outputs. Register them in your client.
3

Connect to Prompts

Add test_settings.dataset and test_settings.evals to your prompt frontmatter.
4

Run Experiments

Execute agentmark run-experiment to test your prompt against the dataset.
Prerequisites: You must have agentmark dev running in a separate terminal before running experiments.

Core Concepts

Datasets

Collections of test inputs (and optionally expected outputs) stored as JSONL files. Define the scenarios your prompt should handle — common cases, edge cases, failure modes.
{"input": {"text": "Great product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience"}, "expected_output": "negative"}
{"input": {"text": ""}, "expected_output": "neutral"}
Create and manage datasets through the Dashboard UI, or use local JSONL files that sync when connected.
Learn more about datasets →

Evaluations

Functions that score prompt outputs and determine pass/fail status. Define your success criteria — what makes an output correct, high-quality, or acceptable.
export const accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return { passed: match, score: match ? 1 : 0 };
};
Learn more about evaluations →

Experiments

Run a prompt against a dataset with evaluations. Use them to validate prompt changes, compare model configurations, and enforce quality thresholds.
Run experiments from the Dashboard and review results with visual score comparisons, charts, and per-item drill-down.
Learn more about running experiments →

Annotations

Cloud feature. Annotations are available in the AgentMark Dashboard.
Manually label and score traces for human-in-the-loop evaluation. Add scores, labels, and detailed reasoning to any span. Complement automated evals with human judgment. Learn more about annotations →

Testing Strategies

  • Start small (5-10 cases), then grow with real data
  • Test multiple dimensions — accuracy, completeness, tone, format
  • Version control everything — datasets live alongside prompts in your repo
  • Run in CI/CD — gate deployments on pass-rate thresholds

Programmatic access

Query datasets, experiments, runs, and prompt execution logs via the REST API or the agentmark api CLI command. Use this to build custom reporting, export evaluation results to external tools, or integrate experiment data into CI/CD pipelines.
# List datasets and experiments from the CLI
agentmark api datasets list
agentmark api experiments list --limit 10

# Get detailed results for a specific experiment
agentmark api experiments get <experimentId>

# List individual runs within an experiment
agentmark api runs list
The local dev server and cloud gateway support the same API endpoints, so you can develop and test integrations locally before deploying.

Next Steps

Datasets

Create test datasets

Writing Evals

Write evaluation functions

Running Experiments

Execute tests with the CLI or Dashboard

Annotations

Human-in-the-loop scoring

Have Questions?

We’re here to help! Choose the best way to reach us: