Skip to main content
Cloud feature. Annotations are available in the AgentMark Dashboard.
Human annotation adds manual scores, labels, and feedback to your traces. Use it to evaluate subjective quality, flag edge cases, curate training datasets, and calibrate your automated evals. AgentMark supports two annotation workflows:
  • Inline annotation — score a single trace directly from the trace drawer
  • Annotation queues — batch traces into structured review queues with assignment, progress tracking, and multi-reviewer support
Animated walkthrough of the annotation queue review flow: queue list, detail view, and review panel

When to use human annotation

Use caseExampleWorkflow
Quality auditsReview a sample of production traces for correctness and toneCreate a queue, add traces, assign to domain experts
Edge case triageFlag and investigate unexpected model behaviorInline annotation from the trace drawer
Dataset curationBuild high-quality test datasets from real production dataReview in queue, save passing traces to a dataset
Calibrate automated evalsAlign your LLM-as-judge scorers with human judgmentScore the same traces manually that your evals score, compare results
Multi-reviewer consensusGet independent assessments from multiple team membersSet reviewers required > 1 on a queue

Score types

Score configs define what reviewers score on. They are declared in your agentmark.json file under the scores field and persisted to the platform database on agentmark deploy. When creating a queue, you select which score configs to include.
Score configs must be deployed before you can create a queue. Run agentmark deploy after adding scores to your agentmark.json. Once deployed, score configs are always available in the dashboard — no worker dependency required. See Project configuration for the scores schema and Evaluations for adding automated eval functions.
A binary judgment. The reviewer clicks Pass or Fail.
Name: factual_accuracy
Type: boolean
Saved as score 1 (pass) or 0 (fail). Best for clear-cut criteria: “Is the response factually correct?”
Every score type includes an optional reason field where the reviewer can explain their judgment.

Inline annotation

Add a score to any trace directly from the trace drawer — no queue required. Evaluations tab in the trace drawer showing inline annotation scores alongside automated eval results
1

Open a trace

Navigate to Traces and click on any trace to open the detail drawer.
2

Select a span

Choose the span you want to annotate from the trace tree.
3

Go to the evaluations tab

Click the Evaluations tab in the drawer.
4

Add annotation

Click Add annotation, fill in the name, label, score, and reason, then click Save.
Inline annotations appear alongside automated eval scores, distinguished by an “annotation” badge.

Annotation queues

For batch review, use annotation queues. Queues let you organize items, assign reviewers, track progress, and require multiple independent reviews. Review queues list showing active queues with progress bars and pending badge in sidebar

Create a queue

Navigate to Review Queues in the sidebar and click Create Queue. Create review queue dialog with name, instructions, reviewers required, and score config fields
FieldRequiredDescription
NameYesDescriptive name for the review batch
DescriptionNoContext for what this queue covers
Instructions for annotatorsNoGuidance shown during review (e.g., “Mark PASS if factually correct and professional”)
Reviewers requiredYesIndependent reviews needed per item (default: 1)
Score configsYesWhich scoring dimensions to show during review
Default datasetNoPre-selects a dataset for the “Save to dataset” action

Add items

1

Select traces

Go to the Traces page and select traces using the checkboxes.
2

Add to queue

Click Add to Queue in the bulk actions bar, choose a queue, and confirm.

Queue detail

Click any queue to see its items, progress, and reviewer assignments. Queue detail view showing items table with status, type, assignment, filter tabs, and multi-reviewer badge Filter items using the tabs:
TabShows
AllEvery item in the queue
PendingItems waiting for review
CompletedReviewed or skipped items
Assigned to meItems assigned to you
Click the assign icon on any row to assign it to a team member. Use the three-dot menu to archive a queue when review is complete.

Review workflow

Click Start Review to begin. The review view splits into two panels. Left panel — trace content:
  • Metadata bar with trace name, latency, cost, tokens, and model
  • Root span input/output formatted as JSON
  • Expandable spans tree — click any span to see its I/O
  • For session items, a conversation timeline showing all turns
Right panel — annotation:
  • Annotator instructions (collapsible, from queue config)
  • Score controls for each configured dimension
  • Prior annotations on this resource (read-only)
  • Save to dataset section with auto-extracted I/O
ActionShortcutWhat it does
Complete + NextEnterSave scores, mark complete, advance
SkipMark as skipped, advance
BackReturn to queue detail

Multi-reviewer

When reviewers required is set above 1, each reviewer annotates independently:
  • The review header shows a progress badge (e.g., “0/2 reviewed”) tracking how many reviewers have completed their assessment
  • Each reviewer sees their own fresh annotation form — they don’t see other reviewers’ scores while annotating
  • An item is only marked complete when the required number of independent reviews is reached
  • The /next endpoint automatically skips items the current reviewer has already reviewed, so each reviewer only sees items they haven’t scored yet
Use multi-reviewer for high-stakes evaluations like safety reviews or fine-tuning dataset curation where a single reviewer’s judgment isn’t sufficient.

Resource types

Queues support three item types:
TypeWhen to useWhat the reviewer sees
TraceReview a complete requestFull trace with expandable per-span I/O
SpanReview a single LLM call or tool invocationIndividual span content
SessionReview a multi-turn conversationConversation timeline across traces

Programmatic queue management

All queue operations are available via the REST API. Use these to integrate annotation into your CI pipelines or automated workflows.
curl -X POST /api/annotation-queues?appId=YOUR_APP_ID \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Weekly quality review",
    "score_config_names": ["accuracy", "tone"],
    "reviewers_required": 2,
    "instructions": "Review for factual accuracy and professional tone."
  }'
const response = await fetch(`/api/annotation-queues?appId=${appId}`, {
  method: 'POST',
  headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
  body: JSON.stringify({
    name: 'Weekly quality review',
    score_config_names: ['accuracy', 'tone'],
    reviewers_required: 2,
    instructions: 'Review for factual accuracy and professional tone.',
  }),
});
const { queue } = await response.json();
curl -X POST /api/annotation-queues/QUEUE_ID/items?appId=YOUR_APP_ID \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {"resource_id": "trace-abc-123"},
      {"resource_id": "span-def-456", "resource_type": "span"}
    ]
  }'
await fetch(`/api/annotation-queues/${queueId}/items?appId=${appId}`, {
  method: 'POST',
  headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
  body: JSON.stringify({
    items: traceIds.map(id => ({ resource_id: id })),
  }),
});
curl /api/annotation-queues/QUEUE_ID/next?appId=YOUR_APP_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

# Returns 200 with item, or 204 if queue is fully reviewed
const res = await fetch(`/api/annotation-queues/${queueId}/next?appId=${appId}`, {
  headers: { 'Authorization': `Bearer ${apiKey}` },
});

if (res.status === 204) {
  console.log('All items reviewed');
} else {
  const { item } = await res.json();
  console.log(`Next item: ${item.resourceId} (${item.resourceType})`);
}
# Assign an item to a reviewer
curl -X PATCH /api/annotation-queues/QUEUE_ID/items/ITEM_ID?appId=YOUR_APP_ID \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"assigned_to": "USER_UUID"}'

# Mark an item as completed
curl -X PATCH /api/annotation-queues/QUEUE_ID/items/ITEM_ID?appId=YOUR_APP_ID \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"status": "completed"}'

End-to-end example: dataset curation

A common workflow is using annotation queues to curate high-quality datasets from production traces.
1

Create a queue

Create a queue with a boolean score config (e.g., dataset_quality) and set the default dataset to your target dataset.
2

Add production traces

Go to Traces, filter to interesting traces (errors, low automated scores, specific prompts), select them, and add to the queue.
3

Review and score

Click Start Review. For each trace, read the I/O, mark Pass or Fail, and optionally edit the input/output before saving to the dataset.
4

Save to dataset

Expand the Save to dataset section, verify the auto-extracted fields, and click Save. The default dataset is pre-selected.
5

Use in experiments

Run experiments against the curated dataset to validate prompt changes against human-verified examples.

Human annotation vs automated evals

Use both together. They serve different purposes.
Human annotationAutomated evals
Created byTeam members in the dashboardEval functions during experiments
Best forSubjective quality, edge cases, nuanceRegression testing, scale, consistency
ScaleTens to hundreds of itemsEntire datasets
WhenAnytime, on any traceDuring experiment runs
Automated evals catch regressions at scale. Human annotations handle the cases machines can’t judge — and provide the ground truth to calibrate your automated scorers against.
Score the same set of traces with both human reviewers and your LLM-as-judge eval. Compare the results to identify where your automated scorer disagrees with human judgment, then tune your eval prompt accordingly.

Evaluations

Automate scoring with eval functions

Datasets

Create and manage test datasets

Experiments

Run prompts against datasets to validate quality

Traces

View and explore trace data

Have Questions?

We’re here to help! Choose the best way to reach us: