Skip to main content
Evaluations (evals) automatically score prompt outputs during experiment runs. Register eval functions in your client config, reference them in prompt frontmatter, and they execute automatically when you run experiments. Evaluations in the AgentMark dashboard

How Evals Work

  1. Register eval functions in your EvalRegistry
  2. Reference eval names in prompt test_settings.evals
  3. List eval names in agentmark.json so they appear in the platform editor
  4. Run experiments — evals execute automatically on each dataset item
  5. View scores in the dashboard alongside traces

Setting Up Evals

1. Register Eval Functions

In your client configuration, create an EvalRegistry and register your evaluation functions:
agentmark.client.ts
import {
  createAgentMarkClient,
  VercelAIModelRegistry,
  EvalRegistry,
} from "@agentmark-ai/ai-sdk-v5-adapter";

const evalRegistry = new EvalRegistry();

evalRegistry.register("accuracy", ({ output, expectedOutput }) => {
  const match = output?.toString().trim() === expectedOutput?.trim();
  return {
    passed: match,
    score: match ? 1 : 0,
    reason: match ? undefined : `Expected "${expectedOutput}", got "${output}"`,
  };
});

export const client = createAgentMarkClient({
  loader: sdk.getApiLoader(),
  modelRegistry,
  evalRegistry,
});

2. Reference Evals in Prompt Frontmatter

Add eval names to the test_settings.evals array in your prompt:
classifier.prompt.mdx
---
name: sentiment-classifier
text_config:
  model_name: gpt-4o
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - accuracy
---

<System>Classify the sentiment of the following text as positive, negative, or neutral.</System>
<User>{props.text}</User>

3. List Evals in agentmark.json

Add eval names to agentmark.json so they appear as options in the platform prompt editor:
agentmark.json
{
  "evals": ["accuracy", "relevance", "format_check"]
}

Eval Function Signature

type EvalFunction = (params: EvalParams) => EvalResult | Promise<EvalResult>;

interface EvalParams {
  input: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
  output: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
  expectedOutput?: string;
}

interface EvalResult {
  score?: number;    // Numeric score (0–1 recommended)
  passed?: boolean;  // Pass/fail status
  label?: string;    // Categorical label (e.g., "correct", "incorrect")
  reason?: string;   // Explanation for the result
}
All fields in EvalResult are optional. At minimum, return either passed or score so that experiment results are meaningful.

Eval Types

Reference-Based

Compare outputs against known correct answers from expectedOutput:
evalRegistry.register("exact_match", ({ output, expectedOutput }) => {
  const match = output === expectedOutput;
  return { passed: match, score: match ? 1 : 0 };
});

Reference-Free

Check structural requirements without needing expected output:
evalRegistry.register("has_required_fields", ({ output }) => {
  const required = ["name", "email", "summary"];
  const hasAll = required.every((field) => output[field]);
  return {
    passed: hasAll,
    score: hasAll ? 1 : 0,
    reason: hasAll ? undefined : "Missing required fields",
  };
});

Batch Registration

Register the same function under multiple names:
evalRegistry.register(["exact_match", "em"], ({ output, expectedOutput }) => {
  const match = output === expectedOutput;
  return { passed: match, score: match ? 1 : 0 };
});

Manual Scoring via SDK

For custom workflows outside the eval registry (e.g., scoring production traces), use the SDK score() method:
import { AgentMarkSDK } from "@agentmark-ai/sdk";

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
});

await sdk.score({
  resourceId: traceId,   // The traceId of the execution to score
  name: "correctness",
  score: 0.95,
  label: "correct",
  reason: "Output matches expected result",
});

Score Parameters

ParameterTypeRequiredDescription
resourceIdstringYesTrace ID or span ID to attach the score to
namestringYesName of the evaluation metric
scorenumberYesNumeric score
labelstringYesCategorical label
reasonstringYesExplanation for the score
typestringNoOptional type classifier

Viewing Evaluations

Eval scores appear in the dashboard:
  • Experiment results — Each dataset item row shows eval pass/fail and scores
  • Traces — Navigate to a trace and open the Evaluation tab to see all scores
  • Aggregated metrics — Pass rates and score averages across an experiment run

Best Practices

  • Test one thing per eval — Separate functions for different criteria (accuracy, format, tone)
  • Keep scores in 0–1 range — Consistent scale across all evals
  • Provide failure reasons — Makes debugging failed items faster
  • Use deterministic evals — Avoid flaky tests that produce inconsistent results
  • Start with reference-based evals — Exact match and keyword checks are simple and reliable
For advanced eval patterns including LLM-as-judge, domain-specific evals, and graduated scoring, see Evaluations in the Development docs.

Have Questions?

We’re here to help! Choose the best way to reach us: