Skip to main content
Running Experiments Run prompts against datasets with automatic evaluation to validate quality and consistency.

CLI Usage

Quick Start

npm run experiment agentmark/classifier.prompt.mdx
Requirements:
  • Dataset configured in prompt frontmatter
  • Development server running (npm run dev)
  • Optional: Evaluation functions defined

Command Options

Skip evaluations (output-only mode):
npm run experiment agentmark/test.prompt.mdx --skip-eval
Output format:
npm run experiment agentmark/test.prompt.mdx --format table   # Default
npm run experiment agentmark/test.prompt.mdx --format csv     # Spreadsheets
npm run experiment agentmark/test.prompt.mdx --format json    # Structured
npm run experiment agentmark/test.prompt.mdx --format jsonl   # Line-delimited
Pass rate threshold:
npm run experiment agentmark/test.prompt.mdx --threshold-percent 85
Exits with non-zero code if pass rate falls below threshold. Requires evaluations with passed field.

Output Example

#InputAI ResultExpected Outputsentiment_check
1{"text":"I love it"}positivepositivePASS (1.00)
2{"text":"Terrible"}negativenegativePASS (1.00)
3{"text":"It's okay"}neutralneutralPASS (1.00)
Summary:
Pass rate: 100% (3/3 passed)

How It Works

The run-experiment command:
  1. Loads your prompt file and parses the frontmatter
  2. Reads the dataset specified in test_settings.dataset
  3. Sends the prompt and dataset to the dev server (http://localhost:9417)
  4. The server runs the prompt against each dataset row
  5. Evaluates results using the evals specified in test_settings.evals
  6. Streams results back to the CLI as they complete
  7. Displays formatted output (table, CSV, JSON, or JSONL)

Configuration

Link dataset and evals in prompt frontmatter:
---
name: sentiment-classifier
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - sentiment_check
---

<System>Classify the sentiment</System>
<User>{props.text}</User>
Dataset (sentiment.jsonl):
{"input": {"text": "I love this!"}, "expected_output": "positive"}
{"input": {"text": "Terrible product"}, "expected_output": "negative"}
{"input": {"text": "It's okay"}, "expected_output": "neutral"}
Learn more about datasets → Learn more about evals →

Workflow

1. Develop prompts - Iterate on your prompt design 2. Create datasets - Add test cases covering your scenarios 3. Write evaluations - Define success criteria 4. Run experiments - Test against dataset
npm run experiment agentmark/my-prompt.prompt.mdx
5. Review results - Identify failures and patterns 6. Iterate - Fix issues, improve prompts, add test cases 7. Deploy with confidence - Pass rate meets your threshold

SDK Usage

Run experiments programmatically using formatWithDataset():
import { client } from './agentmark.client';
import { generateText } from 'ai';  // Or your adapter's generation function

const prompt = await client.loadTextPrompt('agentmark/classifier.prompt.mdx');

// Returns a stream of formatted inputs from the dataset
const datasetStream = await prompt.formatWithDataset();

// Process each test case
for await (const item of datasetStream) {
  const { dataset, formatted, evals } = item;

  // Run the prompt with your AI SDK
  const result = await generateText(formatted);

  // Check results
  const passed = result.text === dataset.expected_output;
  console.log(`Input: ${JSON.stringify(dataset.input)}`);
  console.log(`Expected: ${dataset.expected_output}`);
  console.log(`Got: ${result.text}`);
  console.log(`Result: ${passed ? 'PASS' : 'FAIL'}\n`);
}
The stream returns objects with:
  • dataset - The test case (input and expected_output)
  • formatted - The formatted prompt ready for your AI SDK
  • evals - List of evaluation names to run
  • type - Always "dataset"
Options (FormatWithDatasetOptions):
  • datasetPath?: string - Override dataset from frontmatter
  • format?: 'ndjson' | 'json' - Buffer all rows ('json') or stream as available ('ndjson', default)
When to use:
  • Custom test logic in your test framework
  • Fine-grained control over test execution
  • Integrating with existing test infrastructure
  • Running experiments in application code

Troubleshooting

CLI Issues

Dataset not found:
  • Check dataset path in frontmatter
  • Verify file exists and is valid JSONL
Server connection error:
  • Ensure npm run dev is running
  • Check ports are available
Invalid dataset format:
  • Each line must be valid JSON
  • Required: input field
  • Optional: expected_output field
No evaluations ran:
  • Add evals to test_settings in frontmatter
  • Or use --skip-eval flag for output-only mode

Next Steps