Testing
Evaluations
Guide to evaluating AI responses with Puzzlet
Evaluating AI Responses
Puzzlet provides a system to assess the quality and correctness of AI model responses. This guide explains how to set up and use evaluations.
Overview
Evaluations in Puzzlet enable you to:
- Assess model outputs against expected answers
- Score responses based on custom criteria
- Track performance metrics
Getting Started with Evaluations
Prerequisites
Before using evaluations, ensure you have:
- Configured your Puzzlet environment
- Created prompt templates you want to evaluate
- Set up telemetry for tracing
Basic Evaluation Flow
The evaluation workflow:
- Run a prompt to get a model response
- Evaluate the response using an evaluation prompt
- Score the results
Required: Trace and Span IDs
While traceId and spanId are optional for general telemetry, they are required for evaluations:
- Generate unique IDs for each evaluation
- Include them in telemetry metadata when running the prompt
- Use the same traceId OR spanId as the resourceId when scoring
Implementation Example
Creating Evaluation Templates
Evaluation templates define how responses are assessed:
Key Components
1. Trace and Span IDs
- Either
traceId
ORspanId
can be used as theresourceId
when scoring - The chosen ID must be included in the telemetry metadata
2. Evaluation Template
- MDX files with metadata, system prompt, and user prompt
- Schema defines the structured output format
3. Scoring API Parameters
- resourceId: Must match either traceId or spanId from the original prompt run
- name: Evaluation metric name
- label: Categorical assessment
- score: Numerical value (0-1)
- reason: Explanation for the score
Viewing Evaluations
Evaluations can only be viewed through the Traces tab:
- Navigate to the Traces tab in the dashboard
- Find the trace used for scoring
- The trace details will show all associated evaluations
Best Practices
- Use consistent IDs: Either traceId or spanId for all related evaluations
- Generate unique IDs: Use crypto.randomUUID() for each evaluation
- Add descriptive metadata: Makes traces easier to find
- Use consistent metrics: Same evaluation criteria across similar prompts
- Multiple dimensions: Evaluate on different aspects (correctness, clarity, etc.)
Have Questions?
We’re here to help! Choose the best way to reach us:
Join our Discord community for quick answers and discussions
Email us at hello@puzzlet.ai for support
Schedule an Enterprise Demo to learn about our business solutions