When to use this versus --threshold
AgentMark has two complementary CI gates. They answer different questions, and you can run both at once.
- Absolute pass-rate gate (
--threshold <percent>): fails when the share of passing rows in this run falls below a fixed floor. It needs no baseline and answers “is this run good enough on its own?”. See Running experiments for the--thresholdflag and JUnit output. - Regression gate (this page): fails when a case scored worse than its own baseline, or when a scorer’s mean across the run drops below a configured floor. It needs a baseline run and answers “did this change make anything worse than before?”.
How it works
A regression gate compares this run’s per-(row × scorer) scores against a baseline run and applies two independent checks. Either one failing fails the build.The two checks
Per-case regression (test_settings.regression_tolerance): a single row × scorer pair fails when its score drops more than regression_tolerance below that same case’s baseline score. The tolerance is a fraction, so 0.05 means “fail if the score fell more than 5% below baseline.” This is relative and per-case: a score of 0.80 against a baseline of 0.90 is an 11% drop and fails at 0.05; the same 0.80 with no baseline does not fire this check.
Run-level threshold (test_settings.score_thresholds): a scorer fails when its mean score across the whole run falls below a configured floor. You write it as a { scorerName: minMeanScore } map, for example { groundedness: 0.9 }. This is absolute and run-level: it does not need a baseline, so it stays in force even on the first run.
The per-case regression check only fires when a baseline score exists for that row × scorer pair and the baseline score is greater than zero. It never fires on a missing baseline, a non-numeric score, or a zero baseline, so it cannot fail a build spuriously.
How the baseline is resolved
Each run is identified by a stableexperiment_key. It defaults to the prompt’s repo-relative entrypoint path (for example ./prompts/qa.prompt.mdx), so two distinct evaluations never collide even when they share a dataset. Set it explicitly when your subject has no single entrypoint file (a code-assembled agent or workflow), or to keep the identity stable across file renames.
AgentMark resolves the baseline by experiment_key, environment, and the git tree hash of the code at the base commit. It prefers the run recorded at that exact tree hash. If none exists, it falls back to the most recent prior run of the same experiment_key, and reports which one it used, so the comparison is never silent.
Rows are matched between runs by a content hash of the dataset input, not by position or ID. Reordering your dataset or regenerating row IDs does not break the comparison.
Prerequisites
A regression gate compares against a baseline run, so a baseline has to exist first.- Baselines are stored in AgentMark Cloud. The local dev server’s run storage is ephemeral, so it cannot serve as a durable baseline across CI runs. Both setup paths below require an
AGENTMARK_API_KEY. - Bootstrap by recording a baseline on your default branch. Run the experiment once on
main(through the same CLI command or SDK call you use in PRs). From then on, each PR gates against the run recorded on its base commit. - No prior run means the gate is inert, not failing. If AgentMark finds no baseline for the
experiment_key, the per-case regression check is skipped, since there’s nothing to compare against yet. The run-levelscore_thresholdsgate still applies.
Set it up for prompts (CLI)
For prompt-based evals, run the AgentMark CLI in your CI pipeline:run-experiment with --baseline-commit compares each case to the baseline run and exits non-zero when a gate fires, and --format junit writes per-case results on stdout for your CI reporter. The gate thresholds live in the prompt’s frontmatter, so the CI job only supplies the baseline ref.
Add the gate config to the prompt frontmatter
Set
regression_tolerance and score_thresholds in the prompt’s test_settings.Run the CLI against the PR base
Check out with full history so the base ref resolves to a tree hash, then run
run-experiment per prompt with --baseline-commit. The CLI resolves the ref to a tree hash itself, fails the job when a gate fires, and keeps stdout clean JUnit for redirecting to a file..github/workflows/evals.yml
--baseline-commit "$CI_MERGE_REQUEST_DIFF_BASE_SHA" with GIT_DEPTH: "0" and register the redirected XML as a junit artifact; the full YAML is in the raw-CLI setup.
Set it up for agents and workflows (SDK)
When the thing under test is an agent or a multi-step workflow rather than a single prompt, gate it from inside your existing test suite with the TypeScript SDK. There are no separate eval files and no CLI: yourtask function is the execution, so it works with any framework and needs no adapter.
The trade-off: the CLI derives the two git tree hashes automatically, but the SDK does not. You pass them yourself from your CI environment.
apiKey and appId. initTracing() registers the run with AgentMark Cloud so a later PR can use it as a baseline. Without it, the run executes and gates, but it won’t be stored as a baseline for next time.
Setting junitPath writes the run as JUnit XML, the same shape the CLI’s --format junit produces for prompts, so a code experiment surfaces in the PR check exactly like a prompt one. See Surface both in one check.
To compute the tree hashes in CI:
Pass a git tree hash, not a commit SHA, for both
sourceTreeHash and baselineTreeHash. Tree hashes are content-addressed, so two commits with identical file contents resolve to the same baseline. git rev-parse <ref>^{tree} converts any commit ref to its tree hash.Packaged CI integrations
Both integrations wrap the CLI command above, adding changed-prompt detection (running only the.prompt.mdx files the PR/MR touches) and automatic baseline-ref resolution (baseline-ref defaults to the PR base SHA on GitHub and $CI_MERGE_REQUEST_DIFF_BASE_SHA on GitLab). GitLab CI/CD documents the component’s inputs and how it will work once published.
Surface prompt and code experiments in one check
JUnit is the shared contract. The CLI emits it for prompt experiments (--format junit), and runExperiment({ junitPath }) emits the identical shape for code experiments: same per-(row × scorer) testcases, same regression <failure>s, same run-level threshold cases. Write both to the same glob and point one reporter at it, and a single PR check covers everything, regardless of origin:
Read the results
Both setup paths surface the same gate outcome: overall pass/fail plus the exact cases that regressed. In CI, every row × scorer pair is a JUnit<testcase>. A regressed case emits a <failure>, so the PR check panel and the Checks tab point at the specific inputs that got worse. The run-level score_thresholds failures appear as their own testcases.
In the SDK, the return value pinpoints each regression. result.passed is the gate verdict: false if any case regressed or a score_thresholds floor was breached (it does not consider each row’s absolute pass/fail; assert on row.evals[].passed yourself if you want that too). result.regressionFailures counts the regressed pairs, and each row carries per-eval detail so you can list exactly what dropped:
regressed (whether this specific score fell beyond tolerance) and baselineScore (what the matched baseline scored), alongside the run’s failedScoreThresholds and the resolved baseline descriptor.
See --baseline-commit in the CLI reference for the full flag semantics.
Caveats
- No baseline disables only the regression check. When no prior run exists for the
experiment_key, the per-case regression check is skipped;score_thresholdsstill runs. Absolute per-row pass/fail is gated only through the JUnit reporter: apassed: falsescorer becomes a JUnit<failure>the reporter fails on; the CLI’s own exit code does not gate it (use--thresholdfor that), and the SDK’sresult.passedcovers only the regression andscore_thresholdsgates. The CLI prints⚠️ No baseline run found for "<experiment_key>" — regression gate inactive.to stderr; stdout stays clean for redirecting to a results file. - Exact-match versus recency fallback is reported, never silent. If there’s no run at the base commit’s exact tree hash, AgentMark compares against the most recent prior run of the same key instead. The CLI prints
⚠️ No run at <tree-hash> for "<experiment_key>"; comparing against the most recent prior run instead.to stderr, and the SDK returnsresolved.matchedExactCommit: false. A recency fallback can compare against a different code state than the PR base, so treat its results as advisory. - Row matching is by input hash, so masking or input drift can leave it matching nothing. Rows are joined to the baseline by a content hash of the dataset input. If you redact inputs (the SDK tracing
hideInputsoption, or amaskfunction that rewrites the storedagentmark.dataset_inputthe gate hashes), or the dataset input otherwise differs from the baseline run, the live rows won’t match and the per-case check compares nothing. A baseline that matched nothing is treated as inert (like no baseline), not a failure, but it is reported, never silent: the CLI prints⚠️ Baseline resolved but 0/<N> rows matched it by input hash — regression gate compared nothing.to stderr, and the SDK returnsbaselineRowsMatched: 0(with aconsole.warn). Assert onresult.baselineRowsMatched > 0in CI if a silently inert gate would be worse than a hard failure. experiment_keymust be stable across runs to match. The CLI defaultsexperiment_keyto the repo-relative entrypoint path, derived from the git top level. A run recorded where git is unavailable falls back to the prompt name or file basename, which won’t match a git-derived key, so a baseline and a candidate computed in different environments can silently fail to resolve. Settest_settings.experiment_keyexplicitly to pin the identity when your runs span environments.- A non-positive baseline score is skipped. The per-case check needs a baseline score greater than zero to compute a fractional drop, so a baseline of
0never fires a regression for that pair.
Have questions?
Reach out any time:
- Email us at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about our business solutions