Skip to main content
AgentMark Dashboards give you a unified view of your application’s health — operational metrics (cost, latency, tokens, errors), evaluation scores (distributions, trends, cross-score comparison), and custom widgets — all on one page.
Developers set up observability in your application. See Development documentation for setup instructions.
Dashboard showing operational widgets and score analytics section

Operational metrics

The dashboard automatically tracks key metrics from your prompt executions:
CategoryMetrics
CostTotal cost, average cost per request, cost by model
LatencyAverage latency, P50/P95/P99 percentiles, latency trends
TokensInput tokens, output tokens, total tokens, tokens by model
VolumeRequest count, error count, error rate, unique users
ModelsRequest count per model, cost per model, top models ranking
These appear as widgets on your dashboard — stat cards for at-a-glance numbers, line/bar/area charts for trends. Operational metrics dashboard

Score analytics

Score analytics are available as dashboard widgets — add them to any dashboard through the “Add Widget” dialog, or start from the Score Analytics template in the template gallery. Four score widget types are available:

Summary cards

Aggregated statistics for each score name:
  • Avg — Mean score value
  • Count — Total number of scores recorded
  • Min / Max — Range of observed values
Summary cards showing avg, count, min, max per score name

Score distribution

The histogram shows how score values are distributed. AgentMark auto-detects the score type:
  • Numeric scores — 10 equal-width bins between min and max
  • Categorical scores — Bar chart by category label
  • Boolean scores — Two bars for true/false
Score distribution histogram

Trend over time

Average score values over configurable intervals — Hourly, Daily, Weekly, or Monthly. Score trend chart

Score comparison

Compare two scores of the same type to see how they align across shared traces:
  • Categorical / Boolean — Confusion matrix (N×M heatmap)
  • Numeric — Scatter plot with paired values
Confusion matrix comparing two boolean scores
Both scores must be the same type. Mixing numeric with categorical will show an error. The scatter plot is capped at 10,000 data points for performance.

Score types

Score typeDetection ruleDistributionComparison
NumericFloat values, no labels10-bin histogramScatter plot
CategoricalString labels (not just true/false)Category bar chartN×M confusion matrix
BooleanLabels are only “true” and/or “false”Two-bar chart2×2 confusion matrix

Widgets

Dashboards are fully configurable with drag-and-drop widgets. Add any mix of operational and score widgets to create the view you need. Operational widgets (stat card, line, bar, or area chart):
  • Request count, error rate, cost, latency, tokens, unique users, model rankings
  • Derived metrics: cost/request, tokens/request, success rate, and more
Score widgets:
  • Score Summary — aggregated stats for all scores
  • Score Distribution — histogram or category chart for a selected score
  • Score Trend — trend line over time for a selected score
  • Score Comparison — confusion matrix or scatter plot comparing two scores
Custom dashboard with widgets

Available metrics

Volume: request_count, unique_users, total_tokens, avg_tokens Cost: total_cost, avg_cost Errors: error_count, error_rate Latency: avg_latency, p50_latency, p95_latency, p99_latency Rankings: top_models

Adding widgets

  1. Click + Add Widget in the dashboard header
  2. Choose a title and metric — operational metrics are under “Built-in” and “Derived”, score metrics are under “Scores”
  3. For score widgets, enter the score name(s) to track
  4. Choose a visualization type and optional group-by dimension
  5. The widget appears on the grid — drag to rearrange
Operational widgets support group-by dimensions (model, user, metadata key), time granularity (hour, day, auto), and filters (model, user ID, status).

Templates

Start from a pre-built template or create a blank dashboard. Dashboard template gallery
TemplateWhat it includes
OverviewRequest volume, cost, errors, latency — stat cards + time series
Cost AnalysisTotal cost, avg cost/request, cost over time, top models by cost, tokens
PerformanceP50/P95/P99 latency, error count, error rate
Score AnalyticsScore summary, distribution, trend, and comparison widgets

Dashboard settings

  • Default dashboard — mark any dashboard as default to load it when you visit the Dashboard page
  • Time range — global selector (24h, 7d, 30d, 90d) applies to all widgets and the score analytics section
  • Limits — up to 10 dashboards per app, maximum 20 widgets per dashboard

Score analytics API

Score data is available via four API endpoints. All require authentication and an appId parameter.

GET /api/analytics/scores/histogram

Returns bucketed distribution data for a score.
ParameterRequiredDescription
appIdYesApplication ID
nameYesScore name
rangeNoDate range preset (default: 7d)
startDateIf customStart date for custom range
endDateIf customEnd date for custom range
sourceNoFilter by source (eval or annotation)

GET /api/analytics/scores/trend

Returns time-series trend data for a score.
ParameterRequiredDescription
appIdYesApplication ID
nameYesScore name
intervalNohour, day, week, month (default: day)
rangeNoDate range preset (default: 7d)
startDateIf customStart date for custom range
endDateIf customEnd date for custom range
sourceNoFilter by source (eval or annotation)

GET /api/analytics/scores/comparison

Returns confusion matrix data comparing two categorical/boolean scores.
ParameterRequiredDescription
appIdYesApplication ID
nameAYesFirst score name
nameBYesSecond score name
rangeNoDate range preset (default: 7d)
startDateIf customStart date for custom range
endDateIf customEnd date for custom range
sourceNoFilter by source (eval or annotation)

GET /api/analytics/scores/scatter

Returns paired numeric score values for scatter plot. Both scores must be numeric.
ParameterRequiredDescription
appIdYesApplication ID
nameAYesFirst score name (numeric)
nameBYesSecond score name (numeric)
rangeNoDate range preset (default: 7d)
startDateIf customStart date for custom range
endDateIf customEnd date for custom range
sourceNoFilter by source (eval or annotation)
The scatter endpoint returns a maximum of 10,000 paired data points. A “Sampled” indicator appears in the UI when the cap is reached.

Have Questions?

We’re here to help! Choose the best way to reach us: