Skip to main content
When you run a dataset in the AgentMark platform, it sends a dataset-run event to your webhook endpoint. This event contains the prompt AST, dataset path, and experiment metadata.
Most users should use the webhook handler from their adapter’s /runner export, which handles dataset runs automatically via the runExperiment() method. This page documents the manual approach for advanced use cases.
Dataset Run Button Highlighted

Event Format

{
  "event": {
    "type": "dataset-run",
    "data": {
      "ast": { ... },
      "experimentId": "string",
      "datasetPath": "string"
    }
  }
}

Processing Dataset Runs

The example below uses the Vercel AI SDK (v5) to manually process dataset runs. Each dataset item is executed sequentially, with results streamed back to the platform.
import { NextResponse } from "next/server";
import { AgentMarkSDK } from "@agentmark-ai/sdk";
import {
  createAgentMarkClient,
  VercelAIModelRegistry,
  EvalRegistry,
} from "@agentmark-ai/ai-sdk-v5-adapter";
import { getFrontMatter } from "@agentmark-ai/templatedx";
import { generateText, generateObject } from "ai";

Text Dataset Run

if (frontmatter.text_config) {
  const prompt = await agentmark.loadTextPrompt(data.ast);
  const dataset = await prompt.formatWithDataset({
    datasetPath: frontmatter?.test_settings?.dataset,
    telemetry: { isEnabled: true },
  });

  const stream = new ReadableStream({
    async start(controller) {
      const encoder = new TextEncoder();
      let index = 0;

      for await (const item of dataset) {
        if (item.type === "error") {
          controller.enqueue(
            encoder.encode(
              JSON.stringify({ error: item.error, type: "error" }) + "\n"
            )
          );
          controller.close();
          return;
        }

        const traceId = crypto.randomUUID();
        const result = await generateText({
          ...item.formatted,
          experimental_telemetry: {
            ...item.formatted.experimental_telemetry,
            metadata: {
              ...item.formatted.experimental_telemetry?.metadata,
              dataset_run_id: runId,
              dataset_path: frontmatter?.test_settings?.dataset,
              dataset_run_name: data.experimentId,
              dataset_item_name: index,
              traceName: `ds-run-${data.experimentId}-${index}`,
              traceId,
              dataset_expected_output: item.dataset.expected_output,
            },
          },
        });

        // Run evaluations if configured
        let evalResults: any[] = [];
        if (evalRegistry && item.evals) {
          const evaluators = item.evals
            .map((name: string) => {
              const fn = evalRegistry.get(name);
              return fn ? { name, fn } : undefined;
            })
            .filter(Boolean);

          evalResults = await Promise.all(
            evaluators.map(async (evaluator) => {
              const evalResult = await evaluator.fn({
                input: item.formatted.messages,
                output: result.text,
                expectedOutput: item.dataset.expected_output,
              });

              // Post scores to AgentMark
              sdk?.score({
                resourceId: traceId,
                label: evalResult.label,
                reason: evalResult.reason,
                score: evalResult.score,
                name: evaluator.name,
              });

              return { name: evaluator.name, ...evalResult };
            })
          );
        }

        controller.enqueue(
          encoder.encode(
            JSON.stringify({
              type: "dataset",
              result: {
                input: item.dataset.input,
                expectedOutput: item.dataset.expected_output,
                actualOutput: result.text,
                tokens: result.usage?.totalTokens,
                evals: evalResults,
              },
              traceId,
              runId,
              runName: data.experimentId,
            }) + "\n"
          )
        );
        index++;
      }
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { "AgentMark-Streaming": "true" },
  });
}

Object Dataset Run

The pattern is the same as text, but uses generateObject and returns result.object as the actualOutput:
if (frontmatter.object_config) {
  const prompt = await agentmark.loadObjectPrompt(data.ast);
  const dataset = await prompt.formatWithDataset({
    datasetPath: frontmatter?.test_settings?.dataset,
    telemetry: { isEnabled: true },
  });

  // Same streaming pattern as text, but with:
  const result = await generateObject({ ...item.formatted, /* telemetry */ });

  // Use result.object instead of result.text for actualOutput
}

Stream Response Format

Each item in the dataset stream is a newline-delimited JSON object:
{
  "type": "dataset",
  "result": {
    "input": "What is 2+2?",
    "expectedOutput": "4",
    "actualOutput": "The answer is 4",
    "tokens": 15,
    "evals": [
      {
        "name": "accuracy",
        "score": 1.0,
        "label": "PASS",
        "reason": "Output matches expected result"
      }
    ]
  },
  "traceId": "trace-uuid",
  "runId": "run-uuid",
  "runName": "test-dataset-run"
}

Telemetry

Each dataset item includes telemetry metadata for tracing:
FieldDescription
dataset_run_idUnique identifier for the entire run
dataset_pathPath to the dataset file
dataset_run_nameName of the experiment
dataset_item_nameIndex of the current item
traceNameHuman-readable trace name
traceIdUnique trace identifier
dataset_expected_outputExpected output for comparison

Evaluations

Register evaluators with the EvalRegistry to automatically score outputs during dataset runs:
import { EvalRegistry } from "@agentmark-ai/prompt-core";

const evalRegistry = new EvalRegistry();

evalRegistry.register("accuracy", ({ input, output, expectedOutput }) => {
  const isCorrect = output === expectedOutput;
  return {
    score: isCorrect ? 1.0 : 0.0,
    passed: isCorrect,
    label: isCorrect ? "PASS" : "FAIL",
    reason: isCorrect
      ? "Output matches expected result"
      : "Output does not match expected result",
  };
});
Eval results are included in each dataset item’s response and can be posted as scores to the AgentMark platform using sdk.score().

Best Practices

  1. Always stream — dataset runs must return streaming responses with the AgentMark-Streaming: true header
  2. Include telemetry — use unique traceId and runId values for each execution
  3. Handle item errors — if a single item fails, continue processing remaining items when possible
  4. Process sequentially — execute items one at a time to avoid rate limits

Next Steps

Have Questions?

We’re here to help! Choose the best way to reach us: