Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt

Use this file to discover all available pages before exploring further.

Run prompts against datasets with automatic evaluation to validate quality and consistency. In Cloud, you create and review experiments in the AgentMark Dashboard against git-synced datasets. In Local, you run experiments from the CLI against JSONL files on disk.

Experiments in the Dashboard

The Dashboard runs experiments against datasets and score configs that are synced from your repo through the deployment pipeline, using the evals registered on your deployed handler. Open Experiments (flask icon) in the sidebar to get started.
Running an experiment from the Dashboard requires the app to be connected to a deployed handler. If it isn’t, the dialog returns an “app not connected” error. See Deployment for connecting a handler.

Browse the experiments list

The Experiments page is a paginated list of every run in your app. Filter it by prompt name and dataset path to find a specific run.Experiments list in the AgentMark Dashboard with prompt and dataset filters and the New Experiment buttonThe Experiments list shows each run as a row, with filters for prompt name and dataset path and a New Experiment button in the top-right. Comparison charts — average latency, total cost, and average score across the runs — sit above the list. Select 2 to 3 runs to enable Compare.Running an experiment requires the experiment.run permission.

Create and run an experiment

Click New Experiment to open the dialog.New Experiment dialog with name, prompt, dataset, and evaluations fieldsThe New Experiment dialog has four fields: Name, Prompt, Dataset, and Evaluations (a multi-select populated from the evals your deployed handler registers). Selecting a prompt auto-fills the dataset and evaluations from its test_settings frontmatter.The Name must start with a letter and may contain letters, numbers, hyphens, and underscores, up to 100 characters.
1

Name the experiment

Enter a Name that starts with a letter.
2

Choose a prompt

Pick the Prompt to test. The Dataset and Evaluations auto-fill from its test_settings.
3

Confirm dataset and evaluations

Adjust Dataset and Evaluations if you want to run against a different dataset or eval set.
4

Run

Click Run Experiment. Results stream in live, then open in the experiment detail view.
As the run executes, results stream in row by row, and a summary reports the item count and total tokens when it finishes. Open the experiment to review the full results.

Read the experiment detail

Click any experiment to open its detail view.Experiment detail: per-row input, expected, and actual output with evaluator scores, plus aggregate metrics and chartsThe experiment detail view lists each dataset row in a table — Item, Input, Output, Expected Output, Model, latency, cost, tokens, Scores, and a Trace link. Above the table, aggregate metrics summarize the run (items, average score, total cost, average latency, total tokens) alongside charts.Use Send to Review Queue on the detail page to send the experiment’s items to an annotation queue for human review. See Human annotation.

Compare runs

Select 2 to 3 experiments in the list, then click Compare to view them side by side.Two experiments compared side by side in the AgentMark DashboardThe comparison view places runs side by side (2 to 3) and tags each item as Improved, Regressed, or Unchanged, so you can see exactly which cases a prompt change fixed or broke.

Next steps

Datasets

Create test datasets

Evaluations

Write evaluation functions

Testing overview

Learn testing concepts

Have Questions?

We’re here to help! Choose the best way to reach us: