Skip to main content
AgentMark Experiments Testing ensures your prompts and agents work reliably across different inputs before reaching production.

Why Test Prompts?

LLM outputs are non-deterministic—the same prompt can produce different results. Without testing, you can’t confidently know if your prompts work correctly or if changes improve or break behavior. Testing helps you:
  • Catch regressions - Know immediately when prompt changes break existing functionality
  • Validate quality - Ensure outputs meet your standards across diverse scenarios
  • Measure improvements - Quantify whether prompt iterations actually perform better
  • Build confidence - Deploy changes backed by data, not guesswork

Core Concepts

AgentMark testing has two components that work together:

Datasets

What they are: Collections of test inputs (and optionally expected outputs) stored as JSONL files. What they do: Define the scenarios your prompt should handle—common cases, edge cases, failure modes. Example:
{"input": {"text": "Great product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience"}, "expected_output": "negative"}
{"input": {"text": ""}, "expected_output": "neutral"}
Learn more about datasets →

Evaluations

What they are: Functions that score prompt outputs and determine pass/fail status. What they do: Define your success criteria—what makes an output correct, high-quality, or acceptable. Example:
export const accuracy = async ({ output, expected_output }) => {
  const match = output.trim().toLowerCase() === expected_output.trim().toLowerCase();
  return { passed: match, score: match ? 1 : 0 };
};
Learn more about evaluations →

How It Works

1. Create a dataset - Add test cases covering your use cases 2. Write evaluations - Define what “correct” means for your prompt 3. Connect your prompts - Reference your evals + datasets in prompt frontmatter 4. Run experiments - Test your prompt against the dataset
npm run experiment agentmark/sentiment.prompt.mdx
5. Review results - See which test cases passed and why others failed
#InputAI ResultExpectedEval
1”Great!“positivepositive✅ PASS
2”Terrible”negativenegative✅ PASS
3""positiveneutral❌ FAIL
6. Iterate - Fix failures, improve prompts, add new test cases

Testing Strategies

Start small (5-10 cases), then grow:
  • Common inputs your prompt will handle
  • Edge cases (empty strings, extreme lengths, ambiguous inputs)
  • Known failure modes
Use real data when possible:
  • Anonymized production data
  • Realistic synthetic examples
  • Avoid overly simple test cases that don’t reflect real usage
Test multiple dimensions:
  • Accuracy (is the output correct?)
  • Completeness (does it include all required information?)
  • Tone (is it professional/friendly/appropriate?)
  • Format (does it follow structural requirements?)
Version control everything:
  • Datasets live alongside prompts in your repo
  • Track changes to test cases over time
  • Reproduce results from any point in history

Types of Testing

Unit testing - Test individual prompts in isolation Integration testing - Test prompt chains and multi-step workflows Regression testing - Maintain a suite that must pass before deploying Continuous testing - Run tests automatically in CI/CD pipelines

Measuring Success

Pass rate: Percentage of test cases that pass all evaluations
Pass rate: 85% (17/20 passed)
Per-evaluation scores: Identify specific weaknesses
accuracy: ✅ PASS (0.95)
completeness: ❌ FAIL (0.60)
tone: ✅ PASS (0.88)
Trends over time: Track whether changes improve or degrade performance

Best Practices

  • Focus datasets - Create separate datasets for different scenarios, not one massive file
  • Be specific - Clear expected outputs lead to reliable tests; vague expectations create noise
  • Avoid overfitting - Tests should validate general behavior, not memorize specific outputs
  • Test edge cases - Empty inputs, special characters, extreme lengths, ambiguous cases
  • Use meaningful names - sentiment_accuracy is clearer than eval1
  • Keep tests deterministic - If a test randomly passes/fails, fix the evaluation logic

Next Steps