Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt

Use this file to discover all available pages before exploring further.

Evaluations (evals) are functions that score prompt outputs and determine pass/fail status.
Start with evals first - Build your evaluation framework before writing prompts. Evals provide the foundation for measuring effectiveness and iterating.

Quick start

1. Define score schemas in agentmark.json:
agentmark.json
{
  "scores": {
    "accuracy": {
      "type": "boolean",
      "description": "Was the response factually correct?"
    }
  }
}
2. Create an eval function that returns {passed, score, reason}:
export const accuracy = async ({ output, expectedOutput, input }) => {
  const match = output.trim() === expectedOutput.trim();
  return {
    passed: match,
    score: match ? 1.0 : 0.0,
    reason: match ? undefined : `Expected "${expectedOutput}", got "${output}"`
  };
};
3. Reference in your prompt’s frontmatter:
---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - accuracy
---

Writing eval functions

Eval functions are plain functions that score prompt outputs. Score schemas are defined separately in agentmark.json — the two are connected by name.
const evals = {
  accuracy: async ({ output, expectedOutput }) => {
    const match = output.trim() === expectedOutput?.trim();
    return { passed: match, score: match ? 1 : 0 };
  },
  tone: async ({ output }) => {
    const isProfessional = !output.includes("lol") && !output.includes("!!!");
    return {
      passed: isProfessional,
      label: isProfessional ? "professional" : "casual",
    };
  },
};
Define the corresponding score schemas in agentmark.json so they are available in the Dashboard:
agentmark.json
{
  "scores": {
    "accuracy": {
      "type": "boolean",
      "description": "Was the response factually correct?"
    },
    "tone": {
      "type": "categorical",
      "description": "Response tone classification",
      "categories": [
        { "label": "professional", "value": 1 },
        { "label": "casual", "value": 0.5 },
        { "label": "inappropriate", "value": 0 }
      ]
    }
  }
}
Score schemas defined in agentmark.json are synced to AgentMark Cloud through the deployment pipeline and remain available in the Dashboard across deployments.

Function signature

interface EvalParams {
  input: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
  output: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
  expectedOutput?: string;  // Maps from dataset's expected_output field
  metadata?: Record<string, unknown> | null;
}

interface EvalResult {
  score?: number;    // Numeric score (0-1 recommended)
  passed?: boolean;  // Pass/fail status (used by --threshold)
  label?: string;    // Classification label for categorization
  reason?: string;   // Explanation for the result
}

type EvalFunction = (params: EvalParams) => Promise<EvalResult> | EvalResult;

Registering evals

Pass your eval functions to the AgentMark client using the evals option:
const client = createAgentMarkClient({
  loader,
  modelRegistry,
  evals: {
    accuracy: accuracyFn,
    contains_keyword: containsKeywordFn,
  },
});

Eval function registry

Eval functions are plain functions mapped by name. In TypeScript, use Record<string, EvalFunction>. In Python, use dict[str, EvalFunction]. Score schemas are defined separately in agentmark.json and deployed to AgentMark Cloud. The eval functions run during experiments and are connected to scores by name.
import type { EvalFunction } from "@agentmark-ai/prompt-core";

const evals: Record<string, EvalFunction> = {
  accuracy: accuracyFn,
  relevance: relevanceFn,
};

// Standard object operations
evals["new_eval"] = newEvalFn;
const fn = evals["accuracy"];
const exists = "accuracy" in evals;
delete evals["accuracy"];
const names = Object.keys(evals);
All fields in EvalResult are optional. Return whichever fields are relevant to your eval. The passed field is used by the CLI --threshold flag to calculate pass rates.

Evaluation types

Reference-based (ground truth)

Compare outputs against known correct answers:
export const exact_match = async ({ output, expectedOutput }) => {
  return {
    passed: output === expectedOutput,
    score: output === expectedOutput ? 1 : 0
  };
};
Use for: Classification, extraction, math problems, multiple choice

Reference-free (heuristic)

Check structural requirements without ground truth:
export const has_required_fields = async ({ output }) => {
  const required = ['name', 'email', 'summary'];
  const hasAll = required.every(field => output[field]);
  return {
    passed: hasAll,
    score: hasAll ? 1 : 0,
    reason: hasAll ? undefined : 'Missing required fields'
  };
};
Use for: Format validation, length checks, required content

Model-graded (LLM-as-judge)

Use an LLM to evaluate subjective criteria:
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

export const tone_eval = async ({ output, expectedOutput }) => {
  const { object } = await generateObject({
    model: openai('gpt-4o-mini'),
    schema: z.object({ passed: z.boolean(), reasoning: z.string() }),
    prompt: `Evaluate if this response has appropriate ${expectedOutput} tone:\n\n${output}`,
    temperature: 0.1,
  });

  return {
    passed: object.passed,
    score: object.passed ? 1 : 0,
    reason: object.reasoning,
  };
};
Use for: Tone, creativity, helpfulness, semantic similarity
Combine approaches - Use reference-based for correctness, reference-free for structure, and model-graded for subjective quality.

Common patterns

Classification

export const classification_accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return {
    passed: match,
    score: match ? 1 : 0,
    reason: match ? undefined : `Expected ${expectedOutput}, got ${output}`
  };
};

Contains keyword

export const contains_keyword = async ({ output, expectedOutput }) => {
  const contains = output.includes(expectedOutput);
  return {
    passed: contains,
    score: contains ? 1 : 0,
    reason: contains ? undefined : `Output missing "${expectedOutput}"`
  };
};

Field presence

export const required_fields = async ({ output }) => {
  const required = ['name', 'email', 'message'];
  const missing = required.filter(field => !(field in output));

  return {
    passed: missing.length === 0,
    score: (required.length - missing.length) / required.length,
    reason: missing.length > 0 ? `Missing: ${missing.join(', ')}` : undefined
  };
};

Length check

export const length_check = async ({ output }) => {
  const length = output.length;
  const passed = length >= 10 && length <= 500;
  return {
    passed,
    score: passed ? 1 : 0,
    reason: passed ? undefined : `Length ${length} outside range [10, 500]`
  };
};

Format validation

export const email_format = async ({ output }) => {
  const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
  const passed = emailRegex.test(output);
  return {
    passed,
    score: passed ? 1 : 0,
    reason: passed ? undefined : 'Invalid email format'
  };
};

Graduated scoring

Use label field to categorize results:
export const sentiment_gradual = async ({ output, expectedOutput }) => {
  if (output === expectedOutput) {
    return { passed: true, score: 1.0, label: 'exact_match' };
  }

  const partialMatches = {
    'positive': ['very positive', 'somewhat positive'],
    'negative': ['very negative', 'somewhat negative']
  };

  if (partialMatches[expectedOutput]?.includes(output)) {
    return {
      passed: true,
      score: 0.7,
      label: 'partial_match',
      reason: 'Close semantic match'
    };
  }

  return {
    passed: false,
    score: 0,
    label: 'no_match',
    reason: `Expected ${expectedOutput}, got ${output}`
  };
};
Filter by label (exact_match, partial_match, no_match) to understand patterns.

LLM-as-judge

1. Create eval prompt (agentmark/evals/tone-judge.prompt.mdx):
---
name: tone-judge
object_config:
  model_name: openai/gpt-4o-mini
  temperature: 0.1
  schema:
    type: object
    properties:
      passed:
        type: boolean
      reasoning:
        type: string
---

<System>
You are evaluating whether an AI response has appropriate professional tone.

First explain your reasoning step-by-step, then provide your final judgment.
</System>

<User>
**Output to evaluate:**
{props.output}

**Expected tone:**
{props.expectedOutput}
</User>
2. Use in eval function:
import { client } from './agentmark-client';
import { generateObject } from 'ai';

export const tone_check = async ({ output, expectedOutput }) => {
  const evalPrompt = await client.loadObjectPrompt('evals/tone-judge.prompt.mdx');
  const formatted = await evalPrompt.format({
    props: { output, expectedOutput }
  });
  const { object } = await generateObject(formatted);

  return {
    passed: object.passed,
    score: object.passed ? 1 : 0,
    reason: object.reasoning,
  };
};
Benefits: Version control eval logic, iterate independently, reuse prompts, leverage templating.

LLM-as-judge best practices

Configuration:
  • Use low temperature (0.1-0.3) for consistency
  • Ask for reasoning before judgment (chain-of-thought)
  • Use binary scoring (PASS/FAIL) not scales (1-10)
  • Test one dimension at a time
Model selection:
  • Use stronger model to grade weaker models (GPT-4 → GPT-3.5)
  • Avoid grading a model with itself
  • Validate with human evaluation before scaling
Usage:
  • Use sparingly - slower and more expensive
  • Reserve for subjective criteria
  • Watch for position bias, verbosity bias, self-enhancement bias
Avoid exact-match for open-ended outputs - Use only for classification or short outputs. For longer text, use semantic similarity or LLM-based evaluation.

Domain-specific evals

RAG (retrieval-augmented generation)

export const faithfulness = async ({ output, input }) => {
  const context = input.retrieved_context;
  const claims = extractClaims(output);
  const supported = claims.every(claim => isSupported(claim, context));

  return {
    passed: supported,
    score: supported ? 1 : 0,
    reason: supported ? undefined : 'Output contains unsupported claims'
  };
};

export const answer_relevancy = async ({ output, input }) => {
  const isRelevant = output.toLowerCase().includes(input.query.toLowerCase());
  return {
    passed: isRelevant,
    score: isRelevant ? 1 : 0,
    reason: isRelevant ? undefined : 'Answer not relevant to query'
  };
};

Agent / tool calling

export const tool_correctness = async ({ output, expectedOutput }) => {
  const correctTool = output.tool === expectedOutput.tool;
  const correctParams = JSON.stringify(output.parameters) ===
                       JSON.stringify(expectedOutput.parameters);

  return {
    passed: correctTool && correctParams,
    score: correctTool && correctParams ? 1 : 0.5,
    reason: !correctTool ? 'Wrong tool selected' :
            !correctParams ? 'Incorrect parameters' : undefined
  };
};

Best practices

  • Test one thing per eval - separate functions for different criteria
  • Provide helpful failure reasons for debugging
  • Use meaningful names (sentiment_accuracy not eval1)
  • Keep scores in 0-1 range
  • Make evals deterministic and consistent (avoid flaky tests)
  • Validate general behavior, not specific outputs (avoid overfitting)

Next steps

Datasets

Create test datasets

Running Experiments

Run your evaluations

Testing Overview

Learn testing concepts

Have Questions?

We’re here to help! Choose the best way to reach us: