Skip to main content
AgentMark Experiments Testing ensures your prompts and agents work reliably across different inputs before reaching production.

Why Test Prompts?

LLM outputs are non-deterministic—the same prompt can produce different results. Without testing, you can’t confidently know if your prompts work correctly or if changes improve or break behavior. Testing helps you:
  • Catch regressions - Know immediately when prompt changes break existing functionality
  • Validate quality - Ensure outputs meet your standards across diverse scenarios
  • Measure improvements - Quantify whether prompt iterations actually perform better
  • Build confidence - Deploy changes backed by data, not guesswork

Testing Workflow

Follow this order when setting up tests:
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  1. Create  │────▶│  2. Write   │────▶│  3. Connect │────▶│  4. Run     │
│  Datasets   │     │  Evals      │     │  to Prompts │     │  Experiment │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
     │                    │                    │                    │
     ▼                    ▼                    ▼                    ▼
  JSONL files       TypeScript/         test_settings         CLI command
  with test         Python eval         in frontmatter        or platform
  inputs            functions
1

Create a Dataset

Define test inputs in a JSONL file. Each line is one test case.
2

Write Evaluations

Create eval functions that score outputs. Register them in your client.
3

Connect to Prompts

Add test_settings.dataset and test_settings.evals to your prompt frontmatter.
4

Run Experiments

Execute agentmark run-experiment to test your prompt against the dataset.
Prerequisites: You must have agentmark dev running in a separate terminal before running experiments. The CLI connects to the webhook server at port 9417.

Core Concepts

AgentMark testing has two components that work together:

Datasets

What they are: Collections of test inputs (and optionally expected outputs) stored as JSONL files. What they do: Define the scenarios your prompt should handle—common cases, edge cases, failure modes. Example:
{"input": {"text": "Great product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience"}, "expected_output": "negative"}
{"input": {"text": ""}, "expected_output": "neutral"}
Learn more about datasets →

Evaluations

What they are: Functions that score prompt outputs and determine pass/fail status. What they do: Define your success criteria—what makes an output correct, high-quality, or acceptable. Example:
export const accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return { passed: match, score: match ? 1 : 0 };
};
Learn more about evaluations →

How It Works

1. Create a dataset - Add test cases covering your use cases 2. Write evaluations - Define what “correct” means for your prompt 3. Connect your prompts - Reference your evals + datasets in prompt frontmatter 4. Run experiments - Test your prompt against the dataset
agentmark run-experiment agentmark/sentiment.prompt.mdx
5. Review results - See which test cases passed and why others failed
#InputAI ResultExpectedEval
1”Great!“positivepositive✅ PASS
2”Terrible”negativenegative✅ PASS
3""positiveneutral❌ FAIL
6. Iterate - Fix failures, improve prompts, add new test cases

Testing Strategies

Start small (5-10 cases), then grow:
  • Common inputs your prompt will handle
  • Edge cases (empty strings, extreme lengths, ambiguous inputs)
  • Known failure modes
Use real data when possible:
  • Anonymized production data
  • Realistic synthetic examples
  • Avoid overly simple test cases that don’t reflect real usage
Test multiple dimensions:
  • Accuracy (is the output correct?)
  • Completeness (does it include all required information?)
  • Tone (is it professional/friendly/appropriate?)
  • Format (does it follow structural requirements?)
Version control everything:
  • Datasets live alongside prompts in your repo
  • Track changes to test cases over time
  • Reproduce results from any point in history

Types of Testing

Unit testing - Test individual prompts in isolation Integration testing - Test prompt chains and multi-step workflows Regression testing - Maintain a suite that must pass before deploying Continuous testing - Run tests automatically in CI/CD pipelines

Measuring Success

Pass rate: Percentage of test cases that pass all evaluations
Pass rate: 85% (17/20 passed)
Per-evaluation scores: Identify specific weaknesses
accuracy: ✅ PASS (0.95)
completeness: ❌ FAIL (0.60)
tone: ✅ PASS (0.88)
Trends over time: Track whether changes improve or degrade performance

Best Practices

  • Focus datasets - Create separate datasets for different scenarios, not one massive file
  • Be specific - Clear expected outputs lead to reliable tests; vague expectations create noise
  • Avoid overfitting - Tests should validate general behavior, not memorize specific outputs
  • Test edge cases - Empty inputs, special characters, extreme lengths, ambiguous cases
  • Use meaningful names - sentiment_accuracy is clearer than eval1
  • Keep tests deterministic - If a test randomly passes/fails, fix the evaluation logic

Next Steps