Evaluating AI Responses

AgentMark provides a system to assess the quality and correctness of AI model responses. This guide explains how to set up and use evaluations.

Overview

Evaluations in AgentMark enable you to:

Assess model outputs against expected answers
Score responses based on custom criteria
Track performance metrics

Getting Started with Evaluations

Prerequisites

Before using evaluations, ensure you have:

Configured your AgentMark environment
Created prompt templates you want to evaluate
Set up telemetry for tracing

Basic Evaluation Flow

The evaluation workflow:

Run a prompt to get a model response
Evaluate the response using an evaluation prompt
Score the results

Required: Trace and Span IDs

While traceId and spanId are optional for general telemetry, they are required for evaluations:

Generate unique IDs for each evaluation
Include them in telemetry metadata when running the prompt
Use the same traceId OR spanId as the resourceId when scoring

Implementation Example

import "dotenv/config";
import { AgentMarkSDK } from "@agentmark/sdk";
import {
  createAgentMarkClient,
  VercelAIModelRegistry,
  VercelAIToolRegistry
} from "@agentmark/vercel-ai-v4-adapter";
import { openai } from "@ai-sdk/openai";
import { generateObject, generateText } from "ai";

// Initialize AgentMark client
const sdk = new AgentMarkSDK(
  {
    apiKey: process.env.AGENTMARK_API_KEY!,
    appId: process.env.AGENTMARK_APP_ID!,
    baseUrl: process.env.AGENTMARK_BASE_URL!,
  }
);

// Enable telemetry
const tracer = agentmarkClient.initTracing({
  disableBatch: true,
});

// Configure agentmark with vercel ai v4 adapter
const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerModels("gpt-4o-mini", (name: string) => {
    return openai(name);
});

const agentmark = createAgentMarkClient({
  loader: agentmarkClient.getFileLoader(),
  modelRegistry,
});

async function evaluateMathProblem() {
  // Generate IDs - REQUIRED for evaluations
  const traceId = crypto.randomUUID();
  const spanId = crypto.randomUUID();

  // 1. Fetch and run the prompt
  const mathPrompt = await agentmark.loadTextPrompt("math_problem.prompt.mdx");

  const vercelInput = await mathPrompt.format({
    props: {
      problem: "If x² + 10x + 25 = 0, what is the value of x?"
    },
    telemetry: {
      isEnabled: true,
      metadata: {
        traceName: "math-eval-test",
        traceId,  // Required for evaluation
        spanId    // Alternative for evaluation
      }
    }
  });

  const result = await generateText(vercelInput);

  // 2. Run the evaluation prompt
  const evalPrompt = await agentmark.loadObjectPrompt("evals/correctness.prompt.mdx");
  
  const vercelInputEval = await evalPrompt.format({
    props: {
      input: "If x² + 10x + 25 = 0, what is the value of x?",
      output: result.result,
      expected_output: "x = -5",
    },
  });

  const evalResult = await generateObject(vercelInputEval);

  // 3. Score the results
  await agentmarkClient.score({
    resourceId: traceId,                                // You could use spanId instead
    label: evalResult.object.correctness.label,
    reason: evalResult.object.correctness.reason,
    score: evalResult.object.correctness.score,
    name: "correctness",
  });
  
  await tracer.shutdown();
  return evalResult;
}

Creating Evaluation Templates

Evaluation templates define how responses are assessed:

---
name: correctness
object_config:
  model_name: gpt-4
  max_tokens: 4096
  temperature: 0.7
  schema:
    type: object
    properties:
      correctness:
        type: object
        properties:
          label:
            type: string
            description: label of the answer
          score:
            type: number
            description: score of the answer (0-1)
          reason:
            type: string
            description: reason for the score
        required:
          - label
          - score
          - reason
    required:
    - correctness
---

<System>
You are an evaluation assistant focused on assessing the correctness of responses based on:
1. Accuracy - factually correct information
2. Completeness - all required elements present
3. Relevance - addresses the question

Assign a label ("correct", "partially correct", or "incorrect"), score (0-1), and provide reasoning.
</System>
<User>
Please evaluate the following response:

**Input:**
{props.input}
--------------------------
**Expected Output:**
{props.expected_output}
--------------------------
**Output:**
{props.output}
--------------------------

Analyze based on accuracy, completeness, and relevance.
</User>

Key Components

1. Trace and Span IDs

Either traceId OR spanId can be used as the resourceId when scoring
The chosen ID must be included in the telemetry metadata

2. Evaluation Template

MDX files with metadata, system prompt, and user prompt
Schema defines the structured output format

3. Scoring API Parameters

resourceId: Must match either traceId or spanId from the original prompt run
name: Evaluation metric name
label: Categorical assessment
score: Numerical value (0-1)
reason: Explanation for the score

Viewing Evaluations

Evaluations can only be viewed through the Traces tab:

Navigate to the Traces tab in the dashboard
Find the trace used for scoring
The trace details will show all associated evaluations

Best Practices

Use consistent IDs: Either traceId or spanId for all related evaluations
Generate unique IDs: Use crypto.randomUUID() for each evaluation
Add descriptive metadata: Makes traces easier to find
Use consistent metrics: Same evaluation criteria across similar prompts
Multiple dimensions: Evaluate on different aspects (correctness, clarity, etc.)

Have Questions?

We’re here to help! Choose the best way to reach us:

Join our Discord community for quick answers and discussions
Email us at hello@agentmark.co for support
Schedule an Enterprise Demo to learn about our business solutions

Getting Started

Configuring your Platform

Prompt Management

Observability

Testing

Further Reference

Evaluations

Evaluating AI Responses

Overview

Getting Started with Evaluations

Prerequisites

Basic Evaluation Flow

Required: Trace and Span IDs

Implementation Example

Creating Evaluation Templates

Key Components

1. Trace and Span IDs

2. Evaluation Template

3. Scoring API Parameters

Viewing Evaluations

Best Practices

Have Questions?

Getting Started

Configuring your Platform

Prompt Management

Observability

Testing

Further Reference

​Evaluating AI Responses

​Overview

​Getting Started with Evaluations

​Prerequisites

​Basic Evaluation Flow

​Required: Trace and Span IDs

​Implementation Example

​Creating Evaluation Templates

​Key Components

​1. Trace and Span IDs

​2. Evaluation Template

​3. Scoring API Parameters

​Viewing Evaluations

​Best Practices

Have Questions?

Evaluating AI Responses

Overview

Getting Started with Evaluations

Prerequisites

Basic Evaluation Flow

Required: Trace and Span IDs

Implementation Example

Creating Evaluation Templates

Key Components

1. Trace and Span IDs

2. Evaluation Template

3. Scoring API Parameters

Viewing Evaluations

Best Practices