Evaluations (evals) are functions that score prompt outputs and determine pass/fail status.
Start with evals first - Build your evaluation framework before writing prompts. Evals provide the foundation for measuring effectiveness and iterating.
Quick Start
Create an eval function that returns {passed, score, reason}:
export const accuracy = async ({ output, expected_output, input }) => {
const match = output.trim() === expected_output.trim();
return {
passed: match,
score: match ? 1.0 : 0.0,
reason: match ? undefined : `Expected "${expected_output}", got "${output}"`
};
};
Reference in your prompt’s frontmatter:
---
title: Sentiment Classifier
test_settings:
dataset: ./datasets/sentiment.jsonl
evals:
- accuracy
---
Function Signature
type EvalFunction = (context: {
output: any; // AI-generated result
expected_output: any; // Expected result from dataset
input: any; // Input props from dataset
}) => Promise<EvalResult>;
type EvalResult = {
passed?: boolean; // Pass/fail status
score: number; // Numeric score (0-1)
reason?: string; // Explanation for failure
label?: string; // Custom label for categorization
};
Evaluation Types
Reference-Based (Ground Truth)
Compare outputs against known correct answers:
export const exact_match = async ({ output, expected_output }) => {
return {
passed: output === expected_output,
score: output === expected_output ? 1 : 0
};
};
Use for: Classification, extraction, math problems, multiple choice
Reference-Free (Heuristic)
Check structural requirements without ground truth:
export const has_required_fields = async ({ output }) => {
const required = ['name', 'email', 'summary'];
const hasAll = required.every(field => output[field]);
return {
passed: hasAll,
score: hasAll ? 1 : 0,
reason: hasAll ? undefined : 'Missing required fields'
};
};
Use for: Format validation, length checks, required content
Model-Graded (LLM-as-Judge)
Use an LLM to evaluate subjective criteria:
export const tone_eval = async ({ output, expected_output }, { agentmark }) => {
const judge = await agentmark.loadObjectPrompt('evals/tone-judge.prompt.mdx');
const input = await judge.format({
props: { output, expected_output }
});
const result = await ...; // Run with your SDK
return {
passed: result.passed,
score: result.passed ? 1 : 0,
reason: result.reasoning
};
};
Use for: Tone, creativity, helpfulness, semantic similarity
Combine approaches - Use reference-based for correctness, reference-free for structure, and model-graded for subjective quality.
Common Patterns
Classification
export const classification_accuracy = async ({ output, expected_output }) => {
const match = output.trim().toLowerCase() === expected_output.trim().toLowerCase();
return {
passed: match,
score: match ? 1 : 0,
reason: match ? undefined : `Expected ${expected_output}, got ${output}`
};
};
Contains Keyword
export const contains_keyword = async ({ output, expected_output }) => {
const contains = output.includes(expected_output);
return {
passed: contains,
score: contains ? 1 : 0,
reason: contains ? undefined : `Output missing "${expected_output}"`
};
};
Field Presence
export const required_fields = async ({ output }) => {
const required = ['name', 'email', 'message'];
const missing = required.filter(field => !(field in output));
return {
passed: missing.length === 0,
score: (required.length - missing.length) / required.length,
reason: missing.length > 0 ? `Missing: ${missing.join(', ')}` : undefined
};
};
Length Check
export const length_check = async ({ output }) => {
const length = output.length;
const passed = length >= 10 && length <= 500;
return {
passed,
score: passed ? 1 : 0,
reason: passed ? undefined : `Length ${length} outside range [10, 500]`
};
};
export const email_format = async ({ output }) => {
const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
const passed = emailRegex.test(output);
return {
passed,
score: passed ? 1 : 0,
reason: passed ? undefined : 'Invalid email format'
};
};
Graduated Scoring
Use label field to categorize results:
export const sentiment_gradual = async ({ output, expected_output }) => {
if (output === expected_output) {
return { passed: true, score: 1.0, label: 'exact_match' };
}
const partialMatches = {
'positive': ['very positive', 'somewhat positive'],
'negative': ['very negative', 'somewhat negative']
};
if (partialMatches[expected_output]?.includes(output)) {
return {
passed: true,
score: 0.7,
label: 'partial_match',
reason: 'Close semantic match'
};
}
return {
passed: false,
score: 0,
label: 'no_match',
reason: `Expected ${expected_output}, got ${output}`
};
};
Filter by label (exact_match, partial_match, no_match) to understand patterns.
LLM-as-Judge
Using AgentMark Prompts (Recommended)
1. Create eval prompt (agentmark/evals/tone-judge.prompt.mdx):
---
object_config:
model_name: gpt-4o-mini
temperature: 0.1
schema:
type: object
properties:
passed:
type: boolean
reasoning:
type: string
---
<System>
You are evaluating whether an AI response has appropriate professional tone.
First explain your reasoning step-by-step, then provide your final judgment.
</System>
<User>
**Output to evaluate:**
{props.output}
**Expected tone:**
{props.expected_output}
</User>
2. Use in eval function:
export const tone_check = async ({ output, expected_output }, { agentmark }) => {
const evalPrompt = await agentmark.loadObjectPrompt('evals/tone-judge.prompt.mdx');
const input = await evalPrompt.format({
props: { output, expected_output }
});
const result = await ...; // Run with your SDK
return {
passed: result.passed,
score: result.passed ? 1 : 0,
reason: result.reasoning
};
};
Benefits: Version control eval logic, iterate independently, reuse prompts, leverage templating.
Best Practices
Configuration:
- Use low temperature (0.1-0.3) for consistency
- Ask for reasoning before judgment (chain-of-thought)
- Use binary scoring (PASS/FAIL) not scales (1-10)
- Test one dimension at a time
Model selection:
- Use stronger model to grade weaker models (GPT-4 → GPT-3.5)
- Avoid grading a model with itself
- Validate with human evaluation before scaling
Usage:
- Use sparingly - slower and more expensive
- Reserve for subjective criteria
- Watch for position bias, verbosity bias, self-enhancement bias
Avoid exact-match for open-ended outputs - Use only for classification or short outputs. For longer text, use semantic similarity or LLM-based evaluation.
Domain-Specific Evals
RAG (Retrieval-Augmented Generation)
export const faithfulness = async ({ output, input }) => {
const context = input.retrieved_context;
const claims = extractClaims(output);
const supported = claims.every(claim => isSupported(claim, context));
return {
passed: supported,
score: supported ? 1 : 0,
reason: supported ? undefined : 'Output contains unsupported claims'
};
};
export const answer_relevancy = async ({ output, input }) => {
const isRelevant = output.toLowerCase().includes(input.query.toLowerCase());
return {
passed: isRelevant,
score: isRelevant ? 1 : 0,
reason: isRelevant ? undefined : 'Answer not relevant to query'
};
};
export const tool_correctness = async ({ output, expected_output }) => {
const correctTool = output.tool === expected_output.tool;
const correctParams = JSON.stringify(output.parameters) ===
JSON.stringify(expected_output.parameters);
return {
passed: correctTool && correctParams,
score: correctTool && correctParams ? 1 : 0.5,
reason: !correctTool ? 'Wrong tool selected' :
!correctParams ? 'Incorrect parameters' : undefined
};
};
Best Practices
- Test one thing per eval - separate functions for different criteria
- Provide helpful failure reasons for debugging
- Use meaningful names (
sentiment_accuracy not eval1)
- Keep scores in 0-1 range
- Make evals deterministic and consistent (avoid flaky tests)
- Validate general behavior, not specific outputs (avoid overfitting)
Next Steps