Evaluating AI Responses
AgentMark provides a system to assess the quality and correctness of AI model responses. This guide explains how to set up and use evaluations.
Overview
Evaluations in AgentMark enable you to:
- Assess model outputs against expected answers
- Score responses based on custom criteria
- Track performance metrics
Getting Started with Evaluations
Prerequisites
Before using evaluations, ensure you have:
- Configured your AgentMark environment
- Created prompt templates you want to evaluate
- Set up telemetry for tracing
Basic Evaluation Flow
The evaluation workflow:
- Run a prompt to get a model response
- Evaluate the response using an evaluation prompt
- Score the results
Required: Trace and Span IDs
While traceId and spanId are optional for general telemetry, they are required for evaluations:
- Generate unique IDs for each evaluation
- Include them in telemetry metadata when running the prompt
- Use the same traceId OR spanId as the resourceId when scoring
Implementation Example
import "dotenv/config";
import { AgentMarkSDK } from "@agentmark/sdk";
import {
createAgentMarkClient,
VercelAIModelRegistry,
VercelAIToolRegistry
} from "@agentmark/vercel-ai-v4-adapter";
import { openai } from "@ai-sdk/openai";
import { generateObject, generateText } from "ai";
// Initialize AgentMark client
const sdk = new AgentMarkSDK(
{
apiKey: process.env.AGENTMARK_API_KEY!,
appId: process.env.AGENTMARK_APP_ID!,
baseUrl: process.env.AGENTMARK_BASE_URL!,
}
);
// Enable telemetry
const tracer = agentmarkClient.initTracing({
disableBatch: true,
});
// Configure agentmark with vercel ai v4 adapter
const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerModels("gpt-4o-mini", (name: string) => {
return openai(name);
});
const agentmark = createAgentMarkClient({
loader: agentmarkClient.getFileLoader(),
modelRegistry,
});
async function evaluateMathProblem() {
// Generate IDs - REQUIRED for evaluations
const traceId = crypto.randomUUID();
const spanId = crypto.randomUUID();
// 1. Fetch and run the prompt
const mathPrompt = await agentmark.loadTextPrompt("math_problem.prompt.mdx");
const vercelInput = await mathPrompt.format({
props: {
problem: "If x² + 10x + 25 = 0, what is the value of x?"
},
telemetry: {
isEnabled: true,
metadata: {
traceName: "math-eval-test",
traceId, // Required for evaluation
spanId // Alternative for evaluation
}
}
});
const result = await generateText(vercelInput);
// 2. Run the evaluation prompt
const evalPrompt = await agentmark.loadObjectPrompt("evals/correctness.prompt.mdx");
const vercelInputEval = await evalPrompt.format({
props: {
input: "If x² + 10x + 25 = 0, what is the value of x?",
output: result.result,
expected_output: "x = -5",
},
});
const evalResult = await generateObject(vercelInputEval);
// 3. Score the results
await agentmarkClient.score({
resourceId: traceId, // You could use spanId instead
label: evalResult.object.correctness.label,
reason: evalResult.object.correctness.reason,
score: evalResult.object.correctness.score,
name: "correctness",
});
await tracer.shutdown();
return evalResult;
}
Creating Evaluation Templates
Evaluation templates define how responses are assessed:
---
name: correctness
object_config:
model_name: gpt-4
max_tokens: 4096
temperature: 0.7
schema:
type: object
properties:
correctness:
type: object
properties:
label:
type: string
description: label of the answer
score:
type: number
description: score of the answer (0-1)
reason:
type: string
description: reason for the score
required:
- label
- score
- reason
required:
- correctness
---
<System>
You are an evaluation assistant focused on assessing the correctness of responses based on:
1. Accuracy - factually correct information
2. Completeness - all required elements present
3. Relevance - addresses the question
Assign a label ("correct", "partially correct", or "incorrect"), score (0-1), and provide reasoning.
</System>
<User>
Please evaluate the following response:
**Input:**
{props.input}
--------------------------
**Expected Output:**
{props.expected_output}
--------------------------
**Output:**
{props.output}
--------------------------
Analyze based on accuracy, completeness, and relevance.
</User>
Key Components
1. Trace and Span IDs
- Either
traceId
OR spanId
can be used as the resourceId
when scoring
- The chosen ID must be included in the telemetry metadata
2. Evaluation Template
- MDX files with metadata, system prompt, and user prompt
- Schema defines the structured output format
3. Scoring API Parameters
- resourceId: Must match either traceId or spanId from the original prompt run
- name: Evaluation metric name
- label: Categorical assessment
- score: Numerical value (0-1)
- reason: Explanation for the score
Viewing Evaluations
Evaluations can only be viewed through the Traces tab:
- Navigate to the Traces tab in the dashboard
- Find the trace used for scoring
- The trace details will show all associated evaluations
Best Practices
- Use consistent IDs: Either traceId or spanId for all related evaluations
- Generate unique IDs: Use crypto.randomUUID() for each evaluation
- Add descriptive metadata: Makes traces easier to find
- Use consistent metrics: Same evaluation criteria across similar prompts
- Multiple dimensions: Evaluate on different aspects (correctness, clarity, etc.)
Have Questions?
We’re here to help! Choose the best way to reach us: