Skip to main content

Experiments in AgentMark

Experiments allow you to systematically test prompts against datasets and compare results across different prompt versions, configurations, or models.
This feature is coming soon. In the meantime, you can run experiments programmatically using the AgentMark SDK.See Running Experiments in the Development documentation.

What You’ll Be Able to Do

Compare Prompt Versions - Test multiple versions of a prompt side-by-side to see which performs better. A/B Testing - Compare different models, temperature settings, or prompt strategies. Track Performance - View aggregated metrics across dataset runs to identify improvements or regressions. Historical Analysis - Compare current results against previous experiment runs. Visual Dashboards - See experiment results in easy-to-understand charts and tables.

Typical Workflow

  1. Create Experiment - Select a prompt and dataset to test
  2. Configure Variants - Set up different versions or configurations to compare
  3. Run Experiment - Execute all variants against the dataset
  4. Analyze Results - Compare metrics, scores, and outputs
  5. Deploy Winner - Merge the best-performing version to production

Experiment Types

Version Comparison - Test different versions of the same prompt (e.g., comparing a branch to main) Model Comparison - Compare performance across different LLM models (e.g., GPT-4 vs Claude) Configuration Testing - Test different parameter settings (temperature, max_tokens, etc.) Evaluation Testing - Run multiple evaluation functions to assess different quality dimensions

Integration with Other Features

Experiments work seamlessly with other AgentMark platform features:
  • Datasets - Use your existing test datasets
  • Evaluations - Apply evaluation functions to measure quality
  • Annotations - Manually review experiment outputs
  • Webhooks - Trigger custom workflows when experiments complete

Have Questions?

We’re here to help! Choose the best way to reach us: