Skip to main content
Evaluations (evals) are functions that score prompt outputs and determine pass/fail status.
Start with evals first - Build your evaluation framework before writing prompts. Evals provide the foundation for measuring effectiveness and iterating.

Quick Start

Create an eval function that returns {passed, score, reason}:
export const accuracy = async ({ output, expected_output, input }) => {
  const match = output.trim() === expected_output.trim();
  return {
    passed: match,
    score: match ? 1.0 : 0.0,
    reason: match ? undefined : `Expected "${expected_output}", got "${output}"`
  };
};
Reference in your prompt’s frontmatter:
---
title: Sentiment Classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - accuracy
---

Function Signature

type EvalFunction = (context: {
  output: any;           // AI-generated result
  expected_output: any;  // Expected result from dataset
  input: any;           // Input props from dataset
}) => Promise<EvalResult>;

type EvalResult = {
  passed?: boolean;  // Pass/fail status
  score: number;     // Numeric score (0-1)
  reason?: string;   // Explanation for failure
  label?: string;    // Custom label for categorization
};

Evaluation Types

Reference-Based (Ground Truth)

Compare outputs against known correct answers:
export const exact_match = async ({ output, expected_output }) => {
  return {
    passed: output === expected_output,
    score: output === expected_output ? 1 : 0
  };
};
Use for: Classification, extraction, math problems, multiple choice

Reference-Free (Heuristic)

Check structural requirements without ground truth:
export const has_required_fields = async ({ output }) => {
  const required = ['name', 'email', 'summary'];
  const hasAll = required.every(field => output[field]);
  return {
    passed: hasAll,
    score: hasAll ? 1 : 0,
    reason: hasAll ? undefined : 'Missing required fields'
  };
};
Use for: Format validation, length checks, required content

Model-Graded (LLM-as-Judge)

Use an LLM to evaluate subjective criteria:
export const tone_eval = async ({ output, expected_output }, { agentmark }) => {
  const judge = await agentmark.loadObjectPrompt('evals/tone-judge.prompt.mdx');
  const input = await judge.format({
    props: { output, expected_output }
  });
  const result = await ...; // Run with your SDK

  return {
    passed: result.passed,
    score: result.passed ? 1 : 0,
    reason: result.reasoning
  };
};
Use for: Tone, creativity, helpfulness, semantic similarity
Combine approaches - Use reference-based for correctness, reference-free for structure, and model-graded for subjective quality.

Common Patterns

Classification

export const classification_accuracy = async ({ output, expected_output }) => {
  const match = output.trim().toLowerCase() === expected_output.trim().toLowerCase();
  return {
    passed: match,
    score: match ? 1 : 0,
    reason: match ? undefined : `Expected ${expected_output}, got ${output}`
  };
};

Contains Keyword

export const contains_keyword = async ({ output, expected_output }) => {
  const contains = output.includes(expected_output);
  return {
    passed: contains,
    score: contains ? 1 : 0,
    reason: contains ? undefined : `Output missing "${expected_output}"`
  };
};

Field Presence

export const required_fields = async ({ output }) => {
  const required = ['name', 'email', 'message'];
  const missing = required.filter(field => !(field in output));

  return {
    passed: missing.length === 0,
    score: (required.length - missing.length) / required.length,
    reason: missing.length > 0 ? `Missing: ${missing.join(', ')}` : undefined
  };
};

Length Check

export const length_check = async ({ output }) => {
  const length = output.length;
  const passed = length >= 10 && length <= 500;
  return {
    passed,
    score: passed ? 1 : 0,
    reason: passed ? undefined : `Length ${length} outside range [10, 500]`
  };
};

Format Validation

export const email_format = async ({ output }) => {
  const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
  const passed = emailRegex.test(output);
  return {
    passed,
    score: passed ? 1 : 0,
    reason: passed ? undefined : 'Invalid email format'
  };
};

Graduated Scoring

Use label field to categorize results:
export const sentiment_gradual = async ({ output, expected_output }) => {
  if (output === expected_output) {
    return { passed: true, score: 1.0, label: 'exact_match' };
  }

  const partialMatches = {
    'positive': ['very positive', 'somewhat positive'],
    'negative': ['very negative', 'somewhat negative']
  };

  if (partialMatches[expected_output]?.includes(output)) {
    return {
      passed: true,
      score: 0.7,
      label: 'partial_match',
      reason: 'Close semantic match'
    };
  }

  return {
    passed: false,
    score: 0,
    label: 'no_match',
    reason: `Expected ${expected_output}, got ${output}`
  };
};
Filter by label (exact_match, partial_match, no_match) to understand patterns.

LLM-as-Judge

1. Create eval prompt (agentmark/evals/tone-judge.prompt.mdx):
---
object_config:
  model_name: gpt-4o-mini
  temperature: 0.1
  schema:
    type: object
    properties:
      passed:
        type: boolean
      reasoning:
        type: string
---

<System>
You are evaluating whether an AI response has appropriate professional tone.

First explain your reasoning step-by-step, then provide your final judgment.
</System>

<User>
**Output to evaluate:**
{props.output}

**Expected tone:**
{props.expected_output}
</User>
2. Use in eval function:
export const tone_check = async ({ output, expected_output }, { agentmark }) => {
  const evalPrompt = await agentmark.loadObjectPrompt('evals/tone-judge.prompt.mdx');
  const input = await evalPrompt.format({
    props: { output, expected_output }
  });
  const result = await ...; // Run with your SDK

  return {
    passed: result.passed,
    score: result.passed ? 1 : 0,
    reason: result.reasoning
  };
};
Benefits: Version control eval logic, iterate independently, reuse prompts, leverage templating.

Best Practices

Configuration:
  • Use low temperature (0.1-0.3) for consistency
  • Ask for reasoning before judgment (chain-of-thought)
  • Use binary scoring (PASS/FAIL) not scales (1-10)
  • Test one dimension at a time
Model selection:
  • Use stronger model to grade weaker models (GPT-4 → GPT-3.5)
  • Avoid grading a model with itself
  • Validate with human evaluation before scaling
Usage:
  • Use sparingly - slower and more expensive
  • Reserve for subjective criteria
  • Watch for position bias, verbosity bias, self-enhancement bias
Avoid exact-match for open-ended outputs - Use only for classification or short outputs. For longer text, use semantic similarity or LLM-based evaluation.

Domain-Specific Evals

RAG (Retrieval-Augmented Generation)

export const faithfulness = async ({ output, input }) => {
  const context = input.retrieved_context;
  const claims = extractClaims(output);
  const supported = claims.every(claim => isSupported(claim, context));

  return {
    passed: supported,
    score: supported ? 1 : 0,
    reason: supported ? undefined : 'Output contains unsupported claims'
  };
};

export const answer_relevancy = async ({ output, input }) => {
  const isRelevant = output.toLowerCase().includes(input.query.toLowerCase());
  return {
    passed: isRelevant,
    score: isRelevant ? 1 : 0,
    reason: isRelevant ? undefined : 'Answer not relevant to query'
  };
};

Agent/Tool Calling

export const tool_correctness = async ({ output, expected_output }) => {
  const correctTool = output.tool === expected_output.tool;
  const correctParams = JSON.stringify(output.parameters) ===
                       JSON.stringify(expected_output.parameters);

  return {
    passed: correctTool && correctParams,
    score: correctTool && correctParams ? 1 : 0.5,
    reason: !correctTool ? 'Wrong tool selected' :
            !correctParams ? 'Incorrect parameters' : undefined
  };
};

Best Practices

  • Test one thing per eval - separate functions for different criteria
  • Provide helpful failure reasons for debugging
  • Use meaningful names (sentiment_accuracy not eval1)
  • Keep scores in 0-1 range
  • Make evals deterministic and consistent (avoid flaky tests)
  • Validate general behavior, not specific outputs (avoid overfitting)

Next Steps