Evaluations (evals) automatically score prompt outputs during experiment runs. Register eval functions in your client config, reference them in prompt frontmatter, and they execute automatically when you run experiments.
How Evals Work
- Register eval functions in your
EvalRegistry
- Reference eval names in prompt
test_settings.evals
- List eval names in
agentmark.json so they appear in the platform editor
- Run experiments — evals execute automatically on each dataset item
- View scores in the dashboard alongside traces
Setting Up Evals
1. Register Eval Functions
In your client configuration, create an EvalRegistry and register your evaluation functions:
AI SDK (Vercel)
Claude Agent SDK
import {
createAgentMarkClient,
VercelAIModelRegistry,
EvalRegistry,
} from "@agentmark-ai/ai-sdk-v5-adapter";
const evalRegistry = new EvalRegistry();
evalRegistry.register("accuracy", ({ output, expectedOutput }) => {
const match = output?.toString().trim() === expectedOutput?.trim();
return {
passed: match,
score: match ? 1 : 0,
reason: match ? undefined : `Expected "${expectedOutput}", got "${output}"`,
};
});
export const client = createAgentMarkClient({
loader: sdk.getApiLoader(),
modelRegistry,
evalRegistry,
});
import {
createAgentMarkClient,
ClaudeAgentModelRegistry,
EvalRegistry,
} from "@agentmark-ai/claude-agent-sdk-adapter";
const evalRegistry = new EvalRegistry();
evalRegistry.register("accuracy", ({ output, expectedOutput }) => {
const match = output?.toString().trim() === expectedOutput?.trim();
return {
passed: match,
score: match ? 1 : 0,
reason: match ? undefined : `Expected "${expectedOutput}", got "${output}"`,
};
});
export const client = createAgentMarkClient({
loader: sdk.getApiLoader(),
modelRegistry,
evalRegistry,
});
2. Reference Evals in Prompt Frontmatter
Add eval names to the test_settings.evals array in your prompt:
---
name: sentiment-classifier
text_config:
model_name: gpt-4o
test_settings:
dataset: ./datasets/sentiment.jsonl
evals:
- accuracy
---
<System>Classify the sentiment of the following text as positive, negative, or neutral.</System>
<User>{props.text}</User>
3. List Evals in agentmark.json
Add eval names to agentmark.json so they appear as options in the platform prompt editor:
{
"evals": ["accuracy", "relevance", "format_check"]
}
Eval Function Signature
type EvalFunction = (params: EvalParams) => EvalResult | Promise<EvalResult>;
interface EvalParams {
input: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
output: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
expectedOutput?: string;
}
interface EvalResult {
score?: number; // Numeric score (0–1 recommended)
passed?: boolean; // Pass/fail status
label?: string; // Categorical label (e.g., "correct", "incorrect")
reason?: string; // Explanation for the result
}
All fields in EvalResult are optional. At minimum, return either passed or score so that experiment results are meaningful.
Eval Types
Reference-Based
Compare outputs against known correct answers from expectedOutput:
evalRegistry.register("exact_match", ({ output, expectedOutput }) => {
const match = output === expectedOutput;
return { passed: match, score: match ? 1 : 0 };
});
Reference-Free
Check structural requirements without needing expected output:
evalRegistry.register("has_required_fields", ({ output }) => {
const required = ["name", "email", "summary"];
const hasAll = required.every((field) => output[field]);
return {
passed: hasAll,
score: hasAll ? 1 : 0,
reason: hasAll ? undefined : "Missing required fields",
};
});
Batch Registration
Register the same function under multiple names:
evalRegistry.register(["exact_match", "em"], ({ output, expectedOutput }) => {
const match = output === expectedOutput;
return { passed: match, score: match ? 1 : 0 };
});
Manual Scoring via SDK
For custom workflows outside the eval registry (e.g., scoring production traces), use the SDK score() method:
import { AgentMarkSDK } from "@agentmark-ai/sdk";
const sdk = new AgentMarkSDK({
apiKey: process.env.AGENTMARK_API_KEY!,
appId: process.env.AGENTMARK_APP_ID!,
});
await sdk.score({
resourceId: traceId, // The traceId of the execution to score
name: "correctness",
score: 0.95,
label: "correct",
reason: "Output matches expected result",
});
Score Parameters
| Parameter | Type | Required | Description |
|---|
resourceId | string | Yes | Trace ID or span ID to attach the score to |
name | string | Yes | Name of the evaluation metric |
score | number | Yes | Numeric score |
label | string | Yes | Categorical label |
reason | string | Yes | Explanation for the score |
type | string | No | Optional type classifier |
Viewing Evaluations
Eval scores appear in the dashboard:
- Experiment results — Each dataset item row shows eval pass/fail and scores
- Traces — Navigate to a trace and open the Evaluation tab to see all scores
- Aggregated metrics — Pass rates and score averages across an experiment run
Best Practices
- Test one thing per eval — Separate functions for different criteria (accuracy, format, tone)
- Keep scores in 0–1 range — Consistent scale across all evals
- Provide failure reasons — Makes debugging failed items faster
- Use deterministic evals — Avoid flaky tests that produce inconsistent results
- Start with reference-based evals — Exact match and keyword checks are simple and reliable
For advanced eval patterns including LLM-as-judge, domain-specific evals, and graduated scoring, see Evaluations in the Development docs.
Have Questions?
We’re here to help! Choose the best way to reach us: