When you run a dataset in the AgentMark platform, it sends a dataset-run event to your webhook endpoint. This event contains the prompt AST, dataset path, and experiment metadata.
Most users should use the webhook handler from their adapter’s /runner export, which handles dataset runs automatically via the runExperiment() method. This page documents the manual approach for advanced use cases.
{
"event": {
"type": "dataset-run",
"data": {
"ast": { ... },
"experimentId": "string",
"datasetPath": "string"
}
}
}
Processing Dataset Runs
The example below uses the Vercel AI SDK (v5) to manually process dataset runs. Each dataset item is executed sequentially, with results streamed back to the platform.
import { NextResponse } from "next/server";
import { AgentMarkSDK } from "@agentmark-ai/sdk";
import {
createAgentMarkClient,
VercelAIModelRegistry,
EvalRegistry,
} from "@agentmark-ai/ai-sdk-v5-adapter";
import { getFrontMatter } from "@agentmark-ai/templatedx";
import { generateText, generateObject } from "ai";
Text Dataset Run
if (frontmatter.text_config) {
const prompt = await agentmark.loadTextPrompt(data.ast);
const dataset = await prompt.formatWithDataset({
datasetPath: frontmatter?.test_settings?.dataset,
telemetry: { isEnabled: true },
});
const stream = new ReadableStream({
async start(controller) {
const encoder = new TextEncoder();
let index = 0;
for await (const item of dataset) {
if (item.type === "error") {
controller.enqueue(
encoder.encode(
JSON.stringify({ error: item.error, type: "error" }) + "\n"
)
);
controller.close();
return;
}
const traceId = crypto.randomUUID();
const result = await generateText({
...item.formatted,
experimental_telemetry: {
...item.formatted.experimental_telemetry,
metadata: {
...item.formatted.experimental_telemetry?.metadata,
dataset_run_id: runId,
dataset_path: frontmatter?.test_settings?.dataset,
dataset_run_name: data.experimentId,
dataset_item_name: index,
traceName: `ds-run-${data.experimentId}-${index}`,
traceId,
dataset_expected_output: item.dataset.expected_output,
},
},
});
// Run evaluations if configured
let evalResults: any[] = [];
if (evalRegistry && item.evals) {
const evaluators = item.evals
.map((name: string) => {
const fn = evalRegistry.get(name);
return fn ? { name, fn } : undefined;
})
.filter(Boolean);
evalResults = await Promise.all(
evaluators.map(async (evaluator) => {
const evalResult = await evaluator.fn({
input: item.formatted.messages,
output: result.text,
expectedOutput: item.dataset.expected_output,
});
// Post scores to AgentMark
sdk?.score({
resourceId: traceId,
label: evalResult.label,
reason: evalResult.reason,
score: evalResult.score,
name: evaluator.name,
});
return { name: evaluator.name, ...evalResult };
})
);
}
controller.enqueue(
encoder.encode(
JSON.stringify({
type: "dataset",
result: {
input: item.dataset.input,
expectedOutput: item.dataset.expected_output,
actualOutput: result.text,
tokens: result.usage?.totalTokens,
evals: evalResults,
},
traceId,
runId,
runName: data.experimentId,
}) + "\n"
)
);
index++;
}
controller.close();
},
});
return new Response(stream, {
headers: { "AgentMark-Streaming": "true" },
});
}
Object Dataset Run
The pattern is the same as text, but uses generateObject and returns result.object as the actualOutput:
if (frontmatter.object_config) {
const prompt = await agentmark.loadObjectPrompt(data.ast);
const dataset = await prompt.formatWithDataset({
datasetPath: frontmatter?.test_settings?.dataset,
telemetry: { isEnabled: true },
});
// Same streaming pattern as text, but with:
const result = await generateObject({ ...item.formatted, /* telemetry */ });
// Use result.object instead of result.text for actualOutput
}
Each item in the dataset stream is a newline-delimited JSON object:
{
"type": "dataset",
"result": {
"input": "What is 2+2?",
"expectedOutput": "4",
"actualOutput": "The answer is 4",
"tokens": 15,
"evals": [
{
"name": "accuracy",
"score": 1.0,
"label": "PASS",
"reason": "Output matches expected result"
}
]
},
"traceId": "trace-uuid",
"runId": "run-uuid",
"runName": "test-dataset-run"
}
Telemetry
Each dataset item includes telemetry metadata for tracing:
| Field | Description |
|---|
dataset_run_id | Unique identifier for the entire run |
dataset_path | Path to the dataset file |
dataset_run_name | Name of the experiment |
dataset_item_name | Index of the current item |
traceName | Human-readable trace name |
traceId | Unique trace identifier |
dataset_expected_output | Expected output for comparison |
Evaluations
Register evaluators with the EvalRegistry to automatically score outputs during dataset runs:
import { EvalRegistry } from "@agentmark-ai/prompt-core";
const evalRegistry = new EvalRegistry();
evalRegistry.register("accuracy", ({ input, output, expectedOutput }) => {
const isCorrect = output === expectedOutput;
return {
score: isCorrect ? 1.0 : 0.0,
passed: isCorrect,
label: isCorrect ? "PASS" : "FAIL",
reason: isCorrect
? "Output matches expected result"
: "Output does not match expected result",
};
});
Eval results are included in each dataset item’s response and can be posted as scores to the AgentMark platform using sdk.score().
Best Practices
- Always stream — dataset runs must return streaming responses with the
AgentMark-Streaming: true header
- Include telemetry — use unique
traceId and runId values for each execution
- Handle item errors — if a single item fails, continue processing remaining items when possible
- Process sequentially — execute items one at a time to avoid rate limits
Next Steps
Have Questions?
We’re here to help! Choose the best way to reach us: