Annotations allow you to manually add scores, labels, and contextual information to traces and spans. Use them for human-in-the-loop evaluation, debugging, and creating training datasets from production data.
What Are Annotations?
Annotations are manual evaluations added to spans in your traces. Unlike automated evaluations that run during experiments, annotations are created by team members directly in the AgentMark dashboard.
Each annotation contains:
| Field | Description |
|---|
| Name | A short title describing what you’re evaluating |
| Label | A categorical assessment (e.g., “correct”, “incorrect”, “regression”) |
| Score | A numeric value representing quality or performance |
| Reason | Detailed explanation of why you assigned this score and label |
Use Cases
Quality Assessment
Review production traces to identify issues and track improvements:
Name: Response Quality
Label: good
Score: 0.85
Reason: The response was accurate and well-formatted, but could have been more concise.
Edge Case Documentation
Flag unusual inputs or unexpected behavior for follow-up:
Name: Edge Case
Label: unexpected_behavior
Score: 0.3
Reason: Model hallucinated when given empty input. Should add input validation.
Training Data Curation
Label production traces to build high-quality datasets from real usage:
Name: Training Suitability
Label: include
Score: 1.0
Reason: Clean input/output pair suitable for fine-tuning dataset.
Adding Annotations
From the Traces View
- Navigate to the Traces page in your AgentMark dashboard
- Click on any trace to open the trace details drawer
- Select a span from the trace tree
- Click on the Evaluation tab
- Click the Add annotation button
- Fill in the annotation fields:
- Name — Short identifier for this annotation
- Label — Category or classification
- Score — Numeric value (can be decimal)
- Reason — Detailed explanation
- Click Save
Viewing Annotations
Annotations appear in the Evaluation tab alongside automated evaluation scores. They are distinguished by a filled badge labeled “annotation” (vs. “eval” for automated scores).
Annotations vs. Automated Evals
| Annotations | Automated Evals |
|---|
| Created by | Team members in the dashboard | Eval functions during experiments |
| When | Anytime, on any trace | During experiment runs |
| Best for | Subjective quality, edge cases, training data | Automated regression testing |
| Scale | Individual review | Bulk dataset evaluation |
Use both together: automated evals catch regressions at scale, while annotations add human judgment on individual cases.
Learn More
Have Questions?
We’re here to help! Choose the best way to reach us: