AgentMark gives you two ways to test prompts, and they share the same building blocks. In Cloud, you run and review experiments in the AgentMark Dashboard — datasets and score configs are synced from your repo through the git deployment pipeline, and your deployed handler runs the evals. In Local, you keep datasets as JSONL files alongside your prompts, write eval functions in code, and run experiments from the CLI. The Dashboard experiment views are the same shared UI components the local dev server renders, so Cloud and self-hosted Local look the same. The difference is the data source — git-synced versus local files — and who runs the eval handler.Documentation Index
Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt
Use this file to discover all available pages before exploring further.

Why test prompts?
LLM outputs are non-deterministic — the same prompt can produce different results. Testing helps you:- Catch regressions — Know when prompt changes break existing functionality
- Validate quality — Ensure outputs meet standards across diverse scenarios
- Measure improvements — Quantify whether prompt iterations actually perform better
- Build confidence — Deploy changes backed by data, not guesswork
Testing workflow
- Cloud
- Local
Define a dataset in your repo
Add a JSONL file to your
agentmark/ directory. Each line is one test case.Declare score configs and write evals
Add score configs to
agentmark.json under scores, and write the eval functions on your handler.Deploy to sync
Push to your connected branch. The deployment pipeline syncs your datasets and score configs to AgentMark Cloud.
Core concepts
Datasets
Collections of test inputs (and optionally expected outputs) that define the scenarios your prompt should handle — common cases, edge cases, and failure modes.- Cloud
- Local
Datasets live as JSONL files in your repo and sync to AgentMark Cloud through the deployment pipeline. In the Dashboard you pick a synced dataset when you create an experiment or configure a review queue. Rows are appended through the “Save to dataset” flow during annotation review and through the REST API.
Evaluations
Functions that score prompt outputs and determine pass/fail status. Define your success criteria — what makes an output correct, high-quality, or acceptable.- Cloud
- Local
Score configs are declared in
agentmark.json under scores and synced to AgentMark Cloud through the deployment pipeline. Eval functions run during experiments on your deployed handler. In the New Experiment dialog you select which registered evals to run, and results appear as per-row scores and aggregates in the experiment detail view.Experiments
Run a prompt against a dataset with evaluations. Use them to validate prompt changes, compare model configurations, and enforce quality thresholds.- Cloud
- Local
Run experiments from the Experiments page in the Dashboard. Review results with per-row score drill-down, aggregate metrics, and charts, and compare runs side by side.
Annotations
Cloud feature. Annotations are available in the AgentMark Dashboard.
Testing strategies
- Start small (5-10 cases), then grow with real data
- Test multiple dimensions — accuracy, completeness, tone, format
- Version control everything — datasets live alongside prompts in your repo
- Run in CI/CD — gate deployments on pass-rate thresholds
Programmatic access
Query datasets, experiments, runs, and prompt execution logs through the REST API, or from an IDE agent via theagentmark-mcp MCP server. Use either to build custom reporting, export evaluation results to external tools, or integrate experiment data into CI/CD pipelines.
/v1/* wire contract. A small number of routes are environment-specific — see API reference → Available endpoints for the Where column.
Next steps
Datasets
Create test datasets
Writing Evals
Write evaluation functions
Running Experiments
Execute tests with the CLI or Dashboard
Annotations
Human-in-the-loop scoring
Have Questions?
We’re here to help! Choose the best way to reach us:
- Email us at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about our business solutions