Evaluating AI Responses

Puzzlet provides a system to assess the quality and correctness of AI model responses. This guide explains how to set up and use evaluations.

Overview

Evaluations in Puzzlet enable you to:

  • Assess model outputs against expected answers
  • Score responses based on custom criteria
  • Track performance metrics

Getting Started with Evaluations

Prerequisites

Before using evaluations, ensure you have:

  1. Configured your Puzzlet environment
  2. Created prompt templates you want to evaluate
  3. Set up telemetry for tracing

Basic Evaluation Flow

The evaluation workflow:

  1. Run a prompt to get a model response
  2. Evaluate the response using an evaluation prompt
  3. Score the results

Required: Trace and Span IDs

While traceId and spanId are optional for general telemetry, they are required for evaluations:

  1. Generate unique IDs for each evaluation
  2. Include them in telemetry metadata when running the prompt
  3. Use the same traceId OR spanId as the resourceId when scoring

Implementation Example

import "dotenv/config";
import AllModels from "@puzzlet/all-models";
import { Puzzlet } from "@puzzlet/sdk";
import {
  ModelPluginRegistry,
  createTemplateRunner,
  ToolPluginRegistry,
} from "@puzzlet/agentmark";

// Register models and tools
ModelPluginRegistry.registerAll(AllModels);

// Initialize Puzzlet client
const puzzletClient = new Puzzlet(
  {
    apiKey: process.env.PUZZLET_API_KEY!,
    appId: process.env.PUZZLET_APP_ID!,
    baseUrl: process.env.PUZZLET_BASE_URL!,
  },
  createTemplateRunner
);

// Enable telemetry
const tracer = puzzletClient.initTracing({
  disableBatch: true,
});

async function evaluateMathProblem() {
  // Generate IDs - REQUIRED for evaluations
  const traceId = crypto.randomUUID();
  const spanId = crypto.randomUUID();

  // 1. Fetch and run the prompt
  const mathPrompt = await puzzletClient.fetchPrompt("math_problem.prompt.mdx");
  
  const result = await mathPrompt.run(
    {
      problem: "If x² + 10x + 25 = 0, what is the value of x?"
    },
    {
      telemetry: {
        isEnabled: true,
        metadata: { 
          traceName: "math-eval-test", 
          traceId,  // Required for evaluation
          spanId    // Alternative for evaluation
        },
      },
    }
  );

  // 2. Run the evaluation prompt
  const evalPrompt = await puzzletClient.fetchPrompt("evals/correctness.prompt.mdx");
  
  const evalResult = await evalPrompt.run(
    {
      input: "If x² + 10x + 25 = 0, what is the value of x?",
      output: result.result,
      expected_output: "x = -5",
    },
    {}
  );

  // 3. Score the results
  puzzletClient.score({
    resourceId: traceId,                                // You could use spanId instead
    label: evalResult.result.correctness.label,
    reason: evalResult.result.correctness.reason,
    score: evalResult.result.correctness.score,
    name: "correctness",
  });
  
  await tracer.shutdown();
  return evalResult;
}

Creating Evaluation Templates

Evaluation templates define how responses are assessed:

---
name: correctness
metadata:
  model:
    name: gpt-4
    settings:
      max_tokens: 4096
      temperature: 0.7
      schema:
        type: object
        properties:
          correctness:
            type: object
            properties:
              label:
                type: string
                description: label of the answer
              score:
                type: number
                description: score of the answer (0-1)
              reason:
                type: string
                description: reason for the score
            required:
              - label
              - score
              - reason
        required:
        - correctness
---

<System>
You are an evaluation assistant focused on assessing the correctness of responses based on:
1. Accuracy - factually correct information
2. Completeness - all required elements present
3. Relevance - addresses the question

Assign a label ("correct", "partially correct", or "incorrect"), score (0-1), and provide reasoning.
</System>
<User>
Please evaluate the following response:

**Input:**
{props.input}
--------------------------
**Expected Output:**
{props.expected_output}
--------------------------
**Output:**
{props.output}
--------------------------

Analyze based on accuracy, completeness, and relevance.
</User>

Key Components

1. Trace and Span IDs

  • Either traceId OR spanId can be used as the resourceId when scoring
  • The chosen ID must be included in the telemetry metadata

2. Evaluation Template

  • MDX files with metadata, system prompt, and user prompt
  • Schema defines the structured output format

3. Scoring API Parameters

  • resourceId: Must match either traceId or spanId from the original prompt run
  • name: Evaluation metric name
  • label: Categorical assessment
  • score: Numerical value (0-1)
  • reason: Explanation for the score

Viewing Evaluations

Evaluations can only be viewed through the Traces tab:

  • Navigate to the Traces tab in the dashboard
  • Find the trace used for scoring
  • The trace details will show all associated evaluations

Best Practices

  • Use consistent IDs: Either traceId or spanId for all related evaluations
  • Generate unique IDs: Use crypto.randomUUID() for each evaluation
  • Add descriptive metadata: Makes traces easier to find
  • Use consistent metrics: Same evaluation criteria across similar prompts
  • Multiple dimensions: Evaluate on different aspects (correctness, clarity, etc.)

Have Questions?

We’re here to help! Choose the best way to reach us: