Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt

Use this file to discover all available pages before exploring further.

Cloud feature. Annotations are available in the AgentMark Dashboard.
Human annotation adds manual scores, labels, and feedback to your traces. Use it to evaluate subjective quality, flag edge cases, curate training datasets, and calibrate your automated evals. AgentMark supports two annotation workflows:
  • Inline annotation — score a single trace directly from the trace drawer
  • Annotation queues — batch traces into structured review queues with assignment, progress tracking, and multi-reviewer support
Animated walkthrough of the annotation queue review flow: queue list, detail view, and review panel The animation shows a reviewer moving through a queue: list view → queue detail with assigned items → side-by-side review panel (trace on the left, score controls on the right) → next item.

When to use human annotation

Use caseExampleWorkflow
Quality auditsReview a sample of production traces for correctness and toneCreate a queue, add traces, assign to domain experts
Edge case triageFlag and investigate unexpected model behaviorInline annotation from the trace drawer
Dataset curationBuild high-quality test datasets from real production dataReview in queue, save passing traces to a dataset
Calibrate automated evalsAlign your LLM-as-judge scorers with human judgmentScore the same traces manually that your evals score, compare results
Multi-reviewer consensusGet independent assessments from multiple team membersSet reviewers required > 1 on a queue

Score types

Score configs define what reviewers score on. They are declared as JSON in your agentmark.json under the top-level scores field and synced to AgentMark Cloud through the deployment pipeline. When creating a queue, you select which score configs to include.
Score configs must be synced to AgentMark Cloud before you can create a queue. Push your changes to the connected branch so the deployment pipeline picks up your agentmark.json scores. Once synced, score configs are always available in the Dashboard — no worker dependency required. See Project configuration for the full scores schema and Evaluations for adding automated eval functions.
A binary judgment. The reviewer clicks Pass or Fail.
agentmark.json
{
  "scores": {
    "factual_accuracy": {
      "type": "boolean",
      "description": "Was the response factually correct?"
    }
  }
}
Saved as score 1 (pass) or 0 (fail). Best for clear-cut criteria.
Every score type includes an optional reason field where the reviewer can explain their judgment.

Inline annotation

Add a score to any trace directly from the trace drawer — no queue required. Evaluations tab in the trace drawer showing inline annotation scores alongside automated eval results The Evaluations tab lists every score attached to the selected span — both automated eval results and human annotations. Each row shows the score name, label/value, and reason; annotations carry an annotation badge to distinguish them from automated scores.
1

Open a trace

Navigate to Traces and click on any trace to open the detail drawer.
2

Select a span

Choose the span you want to annotate from the trace tree.
3

Go to the evaluations tab

Click the Evaluations tab in the drawer.
4

Add annotation

Click Add annotation, fill in the name, label, score, and reason, then click Save.
Inline annotations appear alongside automated eval scores, distinguished by an “annotation” badge.

Annotation queues

For batch review, use annotation queues. Queues let you organize items, assign reviewers, track progress, and require multiple independent reviews. Review queues list showing active queues with progress bars and pending badge in sidebar The Review Queues page lists every queue in the app with its name, status, progress, and creation time. Filter tabs narrow by status (All, Active, Completed, Archived), and an Assigned to me toggle restricts the list to queues with items assigned to you. A sidebar badge surfaces the total number of pending items across all active queues.

Create a queue

Navigate to Review Queues in the sidebar and click Create Queue. Create review queue dialog with name, instructions, reviewers required, and score config fields The Create Review Queue dialog takes a name, optional description and annotator instructions, the number of independent reviews required per item, and the set of score configs to show to reviewers. A default dataset can be selected to pre-fill the “Save to dataset” action during review.
FieldRequiredDescription
NameYesDescriptive name for the review batch
DescriptionNoContext for what this queue covers
Instructions for annotatorsNoGuidance shown during review (e.g., “Mark PASS if factually correct and professional”)
Reviewers requiredYesIndependent reviews needed per item (default: 1)
Score configsYesWhich scoring dimensions to show during review
Default datasetNoPre-selects a dataset for the “Save to dataset” action

Add items

1

Select traces

Go to the Traces page and select traces using the checkboxes.
2

Add to queue

Click Add to Queue in the bulk actions bar, choose a queue, and confirm.

Queue detail

Click any queue to see its items, progress, and reviewer assignments. Queue detail view showing items table with status, type, assignment, filter tabs, and multi-reviewer badge The queue detail view lists every item with status (pending, completed, skipped), resource type (trace, span, or session), and assignee. Filter tabs across the top narrow by status, and a multi-reviewer badge shows how many independent reviews remain per item. Filter items using the tabs:
TabShows
AllEvery item in the queue
PendingItems waiting for review
CompletedReviewed or skipped items
Assigned to meItems assigned to you
Click the assign icon on any row to assign it to a team member. Use the three-dot menu to archive a queue when review is complete.

Review workflow

Click Start Review to begin. The review view splits into two panels. Left panel — trace content:
  • Metadata bar with trace name, latency, cost, tokens, and model
  • Root span input/output formatted as JSON
  • Expandable spans tree — click any span to see its I/O
  • For session items, a conversation timeline showing all turns
Right panel — annotation:
  • Annotator instructions (collapsible, from queue config)
  • Score controls for each configured dimension
  • Prior annotations on this resource (read-only)
  • Save to dataset section with auto-extracted I/O
ActionShortcutWhat it does
Complete + NextEnterSave scores, mark complete, advance
SkipMark as skipped, advance
BackReturn to queue detail
Dataset items added through the Save to dataset section are staged on the queue while review is in progress. They are committed to the target dataset in a single batch when the queue is marked completed, so saved items will not appear in the dataset until queue completion. This keeps the dataset clean if a review is paused, abandoned, or reverted.

Multi-reviewer

When reviewers required is set above 1, each reviewer annotates independently:
  • The review header shows a progress badge (e.g., “0/2 reviewed”) tracking how many reviewers have completed their assessment
  • Each reviewer sees their own fresh annotation form — they don’t see other reviewers’ scores while annotating
  • An item is only marked complete when the required number of independent reviews is reached
  • The /next endpoint automatically skips items the current reviewer has already reviewed, so each reviewer only sees items they haven’t scored yet
Use multi-reviewer for high-stakes evaluations like safety reviews or fine-tuning dataset curation where a single reviewer’s judgment isn’t sufficient.

Resource types

Queues support three item types:
TypeWhen to useWhat the reviewer sees
TraceReview a complete requestFull trace with expandable per-span I/O
SpanReview a single LLM call or tool invocationIndividual span content
SessionReview a multi-turn conversationConversation timeline across traces

Programmatic queue management

Annotation queues are fully exposed on the public REST API at /v1/annotation-queues (Cloud only — the local dev server returns 404). CI pipelines can create queues, enqueue traces, and — via the /reviews endpoint — submit annotations through the same path a human reviewer clicks in the Dashboard.
MethodPathPurpose
GET · POST/v1/annotation-queuesList / create queues
GET · PATCH · DELETE/v1/annotation-queues/{queueId}Read / update / delete a queue
GET · POST/v1/annotation-queues/{queueId}/itemsList or add traces, spans, or sessions
GET · PATCH · DELETE/v1/annotation-queues/{queueId}/items/{itemId}Read, update, or remove an item
POST/v1/annotation-queues/{queueId}/items/{itemId}/reviewsSubmit a review (LLM-as-judge entry point)
The review-submission endpoint is the one that makes this more than just queue CRUD: posting { "status": "completed" } records the authenticated user as a reviewer, and when the queue’s reviewers_required threshold is met the item auto-advances to completed. That lets an LLM-as-judge pipeline submit annotations that count toward the same threshold as human reviewers. API keys need the annotation_queue.read, annotation_queue.write, annotation_queue.delete, and annotation_queue.review permissions (split so CI pipelines can be granted review without queue-CRUD access). See the API reference for full endpoint schemas.

End-to-end example: dataset curation

A common workflow is using annotation queues to curate high-quality datasets from production traces.
1

Create a queue

Create a queue with a boolean score config (e.g., dataset_quality) and set the default dataset to your target dataset.
2

Add production traces

Go to Traces, filter to interesting traces (errors, low automated scores, specific prompts), select them, and add to the queue.
3

Review and score

Click Start Review. For each trace, read the I/O, mark Pass or Fail, and optionally edit the input/output before saving to the dataset.
4

Save to dataset

Expand the Save to dataset section, verify the auto-extracted fields, and click Save. The default dataset is pre-selected. Saved items are staged on the queue and remain pending until the queue is completed.
5

Complete the queue

Once every item has been reviewed, mark the queue as completed. All staged dataset items are committed to the target dataset in a single batch at this point.
6

Use in experiments

Run experiments against the curated dataset to validate prompt changes against human-verified examples.

Human annotation vs automated evals

Use both together. They serve different purposes.
Human annotationAutomated evals
Created byTeam members in the DashboardEval functions during experiments
Best forSubjective quality, edge cases, nuanceRegression testing, scale, consistency
ScaleTens to hundreds of itemsEntire datasets
WhenAnytime, on any traceDuring experiment runs
Automated evals catch regressions at scale. Human annotations handle the cases machines can’t judge — and provide the ground truth to calibrate your automated scorers against.
Score the same set of traces with both human reviewers and your LLM-as-judge eval. Compare the results to identify where your automated scorer disagrees with human judgment, then tune your eval prompt accordingly.

Evaluations

Automate scoring with eval functions

Datasets

Create and manage test datasets

Experiments

Run prompts against datasets to validate quality

Traces

View and explore trace data

Have Questions?

We’re here to help! Choose the best way to reach us: