Skip to main content
Annotations allow you to manually add scores, labels, and contextual information to traces and spans. Use them for human-in-the-loop evaluation, debugging, and creating training datasets from production data. Annotations

What Are Annotations?

Annotations are manual evaluations added to spans in your traces. Unlike automated evaluations that run during experiments, annotations are created by team members directly in the AgentMark dashboard. Each annotation contains:
FieldDescription
NameA short title describing what you’re evaluating
LabelA categorical assessment (e.g., “correct”, “incorrect”, “regression”)
ScoreA numeric value representing quality or performance
ReasonDetailed explanation of why you assigned this score and label

Use Cases

Quality Assessment

Review production traces to identify issues and track improvements:
Name: Response Quality
Label: good
Score: 0.85
Reason: The response was accurate and well-formatted, but could have been more concise.

Edge Case Documentation

Flag unusual inputs or unexpected behavior for follow-up:
Name: Edge Case
Label: unexpected_behavior
Score: 0.3
Reason: Model hallucinated when given empty input. Should add input validation.

Training Data Curation

Label production traces to build high-quality datasets from real usage:
Name: Training Suitability
Label: include
Score: 1.0
Reason: Clean input/output pair suitable for fine-tuning dataset.

Adding Annotations

From the Traces View

  1. Navigate to the Traces page in your AgentMark dashboard
  2. Click on any trace to open the trace details drawer
  3. Select a span from the trace tree
  4. Click on the Evaluation tab
  5. Click the Add annotation button
  6. Fill in the annotation fields:
    • Name — Short identifier for this annotation
    • Label — Category or classification
    • Score — Numeric value (can be decimal)
    • Reason — Detailed explanation
  7. Click Save

Viewing Annotations

Annotations appear in the Evaluation tab alongside automated evaluation scores. They are distinguished by a filled badge labeled “annotation” (vs. “eval” for automated scores).

Annotations vs. Automated Evals

AnnotationsAutomated Evals
Created byTeam members in the dashboardEval functions during experiments
WhenAnytime, on any traceDuring experiment runs
Best forSubjective quality, edge cases, training dataAutomated regression testing
ScaleIndividual reviewBulk dataset evaluation
Use both together: automated evals catch regressions at scale, while annotations add human judgment on individual cases.

Learn More

Have Questions?

We’re here to help! Choose the best way to reach us: