Evaluations

Evaluations (evals) are functions that score prompt outputs and determine pass/fail status.

Start with evals first - Build your evaluation framework before writing prompts. Evals provide the foundation for measuring effectiveness and iterating.

Quick Start

Create an eval function that returns {passed, score, reason}:

TypeScript
Python

export const accuracy = async ({ output, expectedOutput, input }) => {
  const match = output.trim() === expectedOutput.trim();
  return {
    passed: match,
    score: match ? 1.0 : 0.0,
    reason: match ? undefined : `Expected "${expectedOutput}", got "${output}"`
  };
};

from agentmark_prompt_core import EvalParams, EvalResult

def accuracy(params: EvalParams) -> EvalResult:
    output = str(params["output"]).strip()
    expected = str(params["expectedOutput"]).strip()
    match = output == expected

    return {
        "passed": match,
        "score": 1.0 if match else 0.0,
        "reason": None if match else f'Expected "{expected}", got "{output}"'
    }

Reference in your prompt’s frontmatter:

---
title: Sentiment Classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - accuracy
---

Registering Evals

import { EvalRegistry } from "@agentmark-ai/prompt-core";

const evalRegistry = new EvalRegistry();

// Register a single eval
evalRegistry.register('accuracy', ({ output, expectedOutput }) => {
  const match = output.trim() === expectedOutput?.trim();
  return { passed: match, score: match ? 1 : 0 };
});

// Register multiple names for the same eval
evalRegistry.register(['exact_match', 'em'], ({ output, expectedOutput }) => {
  return { passed: output === expectedOutput, score: output === expectedOutput ? 1 : 0 };
});

// Chain registrations
evalRegistry
  .register('relevance', ({ output, input }) => {
    const relevant = output.toLowerCase().includes(input.query?.toLowerCase());
    return { passed: relevant, score: relevant ? 1 : 0 };
  })
  .register('length_check', ({ output }) => {
    const ok = output.length >= 10 && output.length <= 500;
    return { passed: ok, score: ok ? 1 : 0 };
  });

List your registered eval names in agentmark.json to make them available in the platform UI:

{
  "evals": ["accuracy", "exact_match", "relevance", "length_check"]
}

Function Signature

TypeScript
Python

interface EvalParams {
  input: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
  output: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
  expectedOutput?: string;  // Maps from dataset's expected_output field
}

interface EvalResult {
  score?: number;    // Numeric score (0-1 recommended)
  passed?: boolean;  // Pass/fail status (used by --threshold)
  label?: string;    // Classification label for categorization
  reason?: string;   // Explanation for the result
}

type EvalFunction = (params: EvalParams) => Promise<EvalResult> | EvalResult;

from typing import Any, TypedDict, Callable, Awaitable

class EvalParams(TypedDict):
    input: str | dict[str, Any] | list[dict[str, Any] | str]
    output: str | dict[str, Any] | list[dict[str, Any] | str]
    expectedOutput: str | None  # Note: camelCase in Python

class EvalResult(TypedDict, total=False):
    passed: bool      # Pass/fail status
    score: float      # Numeric score (0-1)
    reason: str       # Explanation for failure
    label: str        # Custom label for categorization

# Both sync and async functions are supported
EvalFunction = Callable[[EvalParams], EvalResult | Awaitable[EvalResult]]

Registering Evals

TypeScript
Python (Pydantic AI)
Python (Claude Agent SDK)

import { EvalRegistry } from "@agentmark-ai/sdk";

const evalRegistry = new EvalRegistry();

// Register eval functions
evalRegistry.register("accuracy", accuracy);
evalRegistry.register("contains_keyword", containsKeyword);

// Pass to your client
const client = createAgentMarkClient({
  loader,
  modelRegistry,
  evalRegistry,
});

from agentmark_prompt_core import FileLoader, EvalRegistry
from agentmark_pydantic_ai_v0 import (
    create_pydantic_ai_client,
    create_default_model_registry,
    PydanticAIToolRegistry,
)

# Create eval registry and register functions
eval_registry = EvalRegistry()
eval_registry.register("accuracy", accuracy)
eval_registry.register("contains_keyword", contains_keyword)

# Create client with eval registry
client = create_pydantic_ai_client(
    model_registry=create_default_model_registry(),
    tool_registry=PydanticAIToolRegistry(),
    loader=FileLoader(base_dir="./"),
    eval_registry=eval_registry,
)

from agentmark_prompt_core import EvalRegistry
from agentmark_claude_agent_sdk import (
    create_claude_agent_client,
    ClaudeAgentModelRegistry,
    ClaudeAgentAdapterOptions,
)

# Create eval registry and register functions
eval_registry = EvalRegistry()
eval_registry.register("accuracy", accuracy)

# Create client with eval registry
client = create_claude_agent_client(
    model_registry=ClaudeAgentModelRegistry.create_default(),
    adapter_options=ClaudeAgentAdapterOptions(
        permission_mode="bypassPermissions",
    ),
    eval_registry=eval_registry,
)

EvalRegistry API

TypeScript
Python

const registry = new EvalRegistry();

registry.register("name", evalFn);     // Register an eval
registry.get("name");                   // Get an eval function
registry.has("name");                   // Check if exists
registry.remove("name");                // Remove an eval
registry.clear();                       // Remove all evals
registry.listNames();                   // List registered names

eval_registry = EvalRegistry()

eval_registry.register("name", eval_fn)  # Register an eval
eval_registry.get("name")                 # Get an eval function
eval_registry.has("name")                 # Check if exists
eval_registry.remove("name")              # Remove an eval
eval_registry.clear()                     # Remove all evals
eval_registry.list_names()                # List registered names

All fields in EvalResult are optional. Return whichever fields are relevant to your eval. The passed field is used by the CLI --threshold flag to calculate pass rates.

Evaluation Types

Reference-Based (Ground Truth)

Compare outputs against known correct answers:

export const exact_match = async ({ output, expectedOutput }) => {
  return {
    passed: output === expectedOutput,
    score: output === expectedOutput ? 1 : 0
  };
};

Use for: Classification, extraction, math problems, multiple choice

Reference-Free (Heuristic)

Check structural requirements without ground truth:

export const has_required_fields = async ({ output }) => {
  const required = ['name', 'email', 'summary'];
  const hasAll = required.every(field => output[field]);
  return {
    passed: hasAll,
    score: hasAll ? 1 : 0,
    reason: hasAll ? undefined : 'Missing required fields'
  };
};

Use for: Format validation, length checks, required content

Model-Graded (LLM-as-Judge)

Use an LLM to evaluate subjective criteria:

import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

export const tone_eval = async ({ output, expectedOutput }) => {
  const { object } = await generateObject({
    model: openai('gpt-4o-mini'),
    schema: z.object({ passed: z.boolean(), reasoning: z.string() }),
    prompt: `Evaluate if this response has appropriate ${expectedOutput} tone:\n\n${output}`,
    temperature: 0.1,
  });

  return {
    passed: object.passed,
    score: object.passed ? 1 : 0,
    reason: object.reasoning,
  };
};

Use for: Tone, creativity, helpfulness, semantic similarity

Combine approaches - Use reference-based for correctness, reference-free for structure, and model-graded for subjective quality.

Common Patterns

Classification

TypeScript
Python

export const classification_accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return {
    passed: match,
    score: match ? 1 : 0,
    reason: match ? undefined : `Expected ${expectedOutput}, got ${output}`
  };
};

def classification_accuracy(params: EvalParams) -> EvalResult:
    output = str(params["output"]).strip().lower()
    expected = str(params["expectedOutput"]).strip().lower()
    match = output == expected

    return {
        "passed": match,
        "score": 1.0 if match else 0.0,
        "reason": None if match else f"Expected {expected}, got {output}"
    }

Contains Keyword

TypeScript
Python

export const contains_keyword = async ({ output, expectedOutput }) => {
  const contains = output.includes(expectedOutput);
  return {
    passed: contains,
    score: contains ? 1 : 0,
    reason: contains ? undefined : `Output missing "${expectedOutput}"`
  };
};

def contains_keyword(params: EvalParams) -> EvalResult:
    output = str(params["output"])
    expected = str(params["expectedOutput"])
    contains = expected in output

    return {
        "passed": contains,
        "score": 1.0 if contains else 0.0,
        "reason": None if contains else f'Output missing "{expected}"'
    }

Field Presence

export const required_fields = async ({ output }) => {
  const required = ['name', 'email', 'message'];
  const missing = required.filter(field => !(field in output));

  return {
    passed: missing.length === 0,
    score: (required.length - missing.length) / required.length,
    reason: missing.length > 0 ? `Missing: ${missing.join(', ')}` : undefined
  };
};

Length Check

export const length_check = async ({ output }) => {
  const length = output.length;
  const passed = length >= 10 && length <= 500;
  return {
    passed,
    score: passed ? 1 : 0,
    reason: passed ? undefined : `Length ${length} outside range [10, 500]`
  };
};

Format Validation

export const email_format = async ({ output }) => {
  const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
  const passed = emailRegex.test(output);
  return {
    passed,
    score: passed ? 1 : 0,
    reason: passed ? undefined : 'Invalid email format'
  };
};

Graduated Scoring

Use label field to categorize results:

export const sentiment_gradual = async ({ output, expectedOutput }) => {
  if (output === expectedOutput) {
    return { passed: true, score: 1.0, label: 'exact_match' };
  }

  const partialMatches = {
    'positive': ['very positive', 'somewhat positive'],
    'negative': ['very negative', 'somewhat negative']
  };

  if (partialMatches[expectedOutput]?.includes(output)) {
    return {
      passed: true,
      score: 0.7,
      label: 'partial_match',
      reason: 'Close semantic match'
    };
  }

  return {
    passed: false,
    score: 0,
    label: 'no_match',
    reason: `Expected ${expectedOutput}, got ${output}`
  };
};

Filter by label (exact_match, partial_match, no_match) to understand patterns.

LLM-as-Judge

Using AgentMark Prompts (Recommended)

1. Create eval prompt (agentmark/evals/tone-judge.prompt.mdx):

---
object_config:
  model_name: gpt-4o-mini
  temperature: 0.1
  schema:
    type: object
    properties:
      passed:
        type: boolean
      reasoning:
        type: string
---

<System>
You are evaluating whether an AI response has appropriate professional tone.

First explain your reasoning step-by-step, then provide your final judgment.
</System>

<User>
**Output to evaluate:**
{props.output}

**Expected tone:**
{props.expectedOutput}
</User>

2. Use in eval function:

import { client } from './agentmark-client';
import { generateObject } from 'ai';

export const tone_check = async ({ output, expectedOutput }) => {
  const evalPrompt = await client.loadObjectPrompt('evals/tone-judge.prompt.mdx');
  const formatted = await evalPrompt.format({
    props: { output, expectedOutput }
  });
  const { object } = await generateObject(formatted);

  return {
    passed: object.passed,
    score: object.passed ? 1 : 0,
    reason: object.reasoning,
  };
};

Benefits: Version control eval logic, iterate independently, reuse prompts, leverage templating.

Best Practices

Configuration:

Use low temperature (0.1-0.3) for consistency
Ask for reasoning before judgment (chain-of-thought)
Use binary scoring (PASS/FAIL) not scales (1-10)
Test one dimension at a time

Model selection:

Use stronger model to grade weaker models (GPT-4 → GPT-3.5)
Avoid grading a model with itself
Validate with human evaluation before scaling

Usage:

Use sparingly - slower and more expensive
Reserve for subjective criteria
Watch for position bias, verbosity bias, self-enhancement bias

Avoid exact-match for open-ended outputs - Use only for classification or short outputs. For longer text, use semantic similarity or LLM-based evaluation.

Domain-Specific Evals

RAG (Retrieval-Augmented Generation)

export const faithfulness = async ({ output, input }) => {
  const context = input.retrieved_context;
  const claims = extractClaims(output);
  const supported = claims.every(claim => isSupported(claim, context));

  return {
    passed: supported,
    score: supported ? 1 : 0,
    reason: supported ? undefined : 'Output contains unsupported claims'
  };
};

export const answer_relevancy = async ({ output, input }) => {
  const isRelevant = output.toLowerCase().includes(input.query.toLowerCase());
  return {
    passed: isRelevant,
    score: isRelevant ? 1 : 0,
    reason: isRelevant ? undefined : 'Answer not relevant to query'
  };
};

Agent/Tool Calling

export const tool_correctness = async ({ output, expectedOutput }) => {
  const correctTool = output.tool === expectedOutput.tool;
  const correctParams = JSON.stringify(output.parameters) ===
                       JSON.stringify(expectedOutput.parameters);

  return {
    passed: correctTool && correctParams,
    score: correctTool && correctParams ? 1 : 0.5,
    reason: !correctTool ? 'Wrong tool selected' :
            !correctParams ? 'Incorrect parameters' : undefined
  };
};

Best Practices

Test one thing per eval - separate functions for different criteria
Provide helpful failure reasons for debugging
Use meaningful names (sentiment_accuracy not eval1)
Keep scores in 0-1 range
Make evals deterministic and consistent (avoid flaky tests)
Validate general behavior, not specific outputs (avoid overfitting)

Next Steps

Datasets

Create test datasets

Running Experiments

Run your evaluations

Testing Overview

Learn testing concepts

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

Quick Start

Registering Evals

Function Signature

Registering Evals

EvalRegistry API

Evaluation Types

Reference-Based (Ground Truth)

Reference-Free (Heuristic)

Model-Graded (LLM-as-Judge)

Common Patterns

Classification

Contains Keyword

Field Presence

Length Check

Format Validation

Graduated Scoring

LLM-as-Judge

Using AgentMark Prompts (Recommended)

Best Practices

Domain-Specific Evals

RAG (Retrieval-Augmented Generation)

Agent/Tool Calling

Best Practices

Next Steps

Datasets

Running Experiments

Testing Overview

Getting Started

Prompts and Agents

Testing

Observability

Integrations

Python

Further Reference

​Quick Start

​Registering Evals

​Function Signature

​Registering Evals

​EvalRegistry API

​Evaluation Types

​Reference-Based (Ground Truth)

​Reference-Free (Heuristic)

​Model-Graded (LLM-as-Judge)

​Common Patterns

​Classification

​Contains Keyword

​Field Presence

​Length Check

​Format Validation

​Graduated Scoring

​LLM-as-Judge

​Using AgentMark Prompts (Recommended)

​Best Practices

​Domain-Specific Evals

​RAG (Retrieval-Augmented Generation)

​Agent/Tool Calling

​Best Practices

​Next Steps

Datasets

Running Experiments

Testing Overview

Quick Start

Registering Evals

Function Signature

Registering Evals

EvalRegistry API

Evaluation Types

Reference-Based (Ground Truth)

Reference-Free (Heuristic)

Model-Graded (LLM-as-Judge)

Common Patterns

Classification

Contains Keyword

Field Presence

Length Check

Format Validation

Graduated Scoring

LLM-as-Judge

Using AgentMark Prompts (Recommended)

Best Practices

Domain-Specific Evals

RAG (Retrieval-Augmented Generation)

Agent/Tool Calling

Best Practices

Next Steps