Evaluating AI Responses

AgentMark provides a system to assess the quality and correctness of AI model responses. This guide explains how to set up and use evaluations.

Overview

Evaluations in AgentMark enable you to:

  • Assess model outputs against expected answers
  • Score responses based on custom criteria
  • Track performance metrics

Getting Started with Evaluations

Prerequisites

Before using evaluations, ensure you have:

  1. Configured your AgentMark environment
  2. Created prompt templates you want to evaluate
  3. Set up telemetry for tracing

Basic Evaluation Flow

The evaluation workflow:

  1. Run a prompt to get a model response
  2. Evaluate the response using an evaluation prompt
  3. Score the results

Required: Trace and Span IDs

While traceId and spanId are optional for general telemetry, they are required for evaluations:

  1. Generate unique IDs for each evaluation
  2. Include them in telemetry metadata when running the prompt
  3. Use the same traceId OR spanId as the resourceId when scoring

Implementation Example

import "dotenv/config";
import { AgentMarkSDK } from "@agentmark/sdk";
import {
  createAgentMarkClient,
  VercelAIModelRegistry,
  VercelAIToolRegistry
} from "@agentmark/vercel-ai-v4-adapter";
import { openai } from "@ai-sdk/openai";
import { generateObject, generateText } from "ai";

// Initialize AgentMark client
const sdk = new AgentMarkSDK(
  {
    apiKey: process.env.AGENTMARK_API_KEY!,
    appId: process.env.AGENTMARK_APP_ID!,
    baseUrl: process.env.AGENTMARK_BASE_URL!,
  }
);

// Enable telemetry
const tracer = agentmarkClient.initTracing({
  disableBatch: true,
});

// Configure agentmark with vercel ai v4 adapter
const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerModels("gpt-4o-mini", (name: string) => {
    return openai(name);
});

const agentmark = createAgentMarkClient({
  loader: agentmarkClient.getFileLoader(),
  modelRegistry,
});

async function evaluateMathProblem() {
  // Generate IDs - REQUIRED for evaluations
  const traceId = crypto.randomUUID();
  const spanId = crypto.randomUUID();

  // 1. Fetch and run the prompt
  const mathPrompt = await agentmark.loadTextPrompt("math_problem.prompt.mdx");

  const vercelInput = await mathPrompt.format({
    props: {
      problem: "If x² + 10x + 25 = 0, what is the value of x?"
    },
    telemetry: {
      isEnabled: true,
      metadata: {
        traceName: "math-eval-test",
        traceId,  // Required for evaluation
        spanId    // Alternative for evaluation
      }
    }
  });

  const result = await generateText(vercelInput);

  // 2. Run the evaluation prompt
  const evalPrompt = await agentmark.loadObjectPrompt("evals/correctness.prompt.mdx");
  
  const vercelInputEval = await evalPrompt.format({
    props: {
      input: "If x² + 10x + 25 = 0, what is the value of x?",
      output: result.result,
      expected_output: "x = -5",
    },
  });

  const evalResult = await generateObject(vercelInputEval);

  // 3. Score the results
  await agentmarkClient.score({
    resourceId: traceId,                                // You could use spanId instead
    label: evalResult.object.correctness.label,
    reason: evalResult.object.correctness.reason,
    score: evalResult.object.correctness.score,
    name: "correctness",
  });
  
  await tracer.shutdown();
  return evalResult;
}

Creating Evaluation Templates

Evaluation templates define how responses are assessed:

---
name: correctness
object_config:
  model_name: gpt-4
  max_tokens: 4096
  temperature: 0.7
  schema:
    type: object
    properties:
      correctness:
        type: object
        properties:
          label:
            type: string
            description: label of the answer
          score:
            type: number
            description: score of the answer (0-1)
          reason:
            type: string
            description: reason for the score
        required:
          - label
          - score
          - reason
    required:
    - correctness
---

<System>
You are an evaluation assistant focused on assessing the correctness of responses based on:
1. Accuracy - factually correct information
2. Completeness - all required elements present
3. Relevance - addresses the question

Assign a label ("correct", "partially correct", or "incorrect"), score (0-1), and provide reasoning.
</System>
<User>
Please evaluate the following response:

**Input:**
{props.input}
--------------------------
**Expected Output:**
{props.expected_output}
--------------------------
**Output:**
{props.output}
--------------------------

Analyze based on accuracy, completeness, and relevance.
</User>

Key Components

1. Trace and Span IDs

  • Either traceId OR spanId can be used as the resourceId when scoring
  • The chosen ID must be included in the telemetry metadata

2. Evaluation Template

  • MDX files with metadata, system prompt, and user prompt
  • Schema defines the structured output format

3. Scoring API Parameters

  • resourceId: Must match either traceId or spanId from the original prompt run
  • name: Evaluation metric name
  • label: Categorical assessment
  • score: Numerical value (0-1)
  • reason: Explanation for the score

Viewing Evaluations

Evaluations can only be viewed through the Traces tab:

  • Navigate to the Traces tab in the dashboard
  • Find the trace used for scoring
  • The trace details will show all associated evaluations

Best Practices

  • Use consistent IDs: Either traceId or spanId for all related evaluations
  • Generate unique IDs: Use crypto.randomUUID() for each evaluation
  • Add descriptive metadata: Makes traces easier to find
  • Use consistent metrics: Same evaluation criteria across similar prompts
  • Multiple dimensions: Evaluate on different aspects (correctness, clarity, etc.)

Have Questions?

We’re here to help! Choose the best way to reach us: