
Why Test Prompts?
LLM outputs are non-deterministic—the same prompt can produce different results. Without testing, you can’t confidently know if your prompts work correctly or if changes improve or break behavior. Testing helps you:- Catch regressions - Know immediately when prompt changes break existing functionality
- Validate quality - Ensure outputs meet your standards across diverse scenarios
- Measure improvements - Quantify whether prompt iterations actually perform better
- Build confidence - Deploy changes backed by data, not guesswork
Core Concepts
AgentMark testing has two components that work together:Datasets
What they are: Collections of test inputs (and optionally expected outputs) stored as JSONL files. What they do: Define the scenarios your prompt should handle—common cases, edge cases, failure modes. Example:Evaluations
What they are: Functions that score prompt outputs and determine pass/fail status. What they do: Define your success criteria—what makes an output correct, high-quality, or acceptable. Example:How It Works
1. Create a dataset - Add test cases covering your use cases 2. Write evaluations - Define what “correct” means for your prompt 3. Connect your prompts - Reference your evals + datasets in prompt frontmatter 4. Run experiments - Test your prompt against the dataset| # | Input | AI Result | Expected | Eval |
|---|---|---|---|---|
| 1 | ”Great!“ | positive | positive | ✅ PASS |
| 2 | ”Terrible” | negative | negative | ✅ PASS |
| 3 | "" | positive | neutral | ❌ FAIL |
Testing Strategies
Start small (5-10 cases), then grow:- Common inputs your prompt will handle
- Edge cases (empty strings, extreme lengths, ambiguous inputs)
- Known failure modes
- Anonymized production data
- Realistic synthetic examples
- Avoid overly simple test cases that don’t reflect real usage
- Accuracy (is the output correct?)
- Completeness (does it include all required information?)
- Tone (is it professional/friendly/appropriate?)
- Format (does it follow structural requirements?)
- Datasets live alongside prompts in your repo
- Track changes to test cases over time
- Reproduce results from any point in history
Types of Testing
Unit testing - Test individual prompts in isolation Integration testing - Test prompt chains and multi-step workflows Regression testing - Maintain a suite that must pass before deploying Continuous testing - Run tests automatically in CI/CD pipelinesMeasuring Success
Pass rate: Percentage of test cases that pass all evaluationsBest Practices
- Focus datasets - Create separate datasets for different scenarios, not one massive file
- Be specific - Clear expected outputs lead to reliable tests; vague expectations create noise
- Avoid overfitting - Tests should validate general behavior, not memorize specific outputs
- Test edge cases - Empty inputs, special characters, extreme lengths, ambiguous cases
- Use meaningful names -
sentiment_accuracyis clearer thaneval1 - Keep tests deterministic - If a test randomly passes/fails, fix the evaluation logic