A regression gate fails a CI build when an experiment’s scores drop against a baseline run. Unlike the absoluteDocumentation Index
Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt
Use this file to discover all available pages before exploring further.
--threshold pass-rate gate — which fails when too few rows pass in this run alone — a regression gate compares each case to how that same case scored before, so it catches silent quality drops even when a scorer still passes in absolute terms.
Use it to keep a prompt or agent from getting quietly worse as you iterate: you record a baseline once on your default branch, and every PR after that is gated against it.
When to use this versus --threshold
AgentMark has two complementary CI gates. They answer different questions, and you can run both at once.
- Absolute pass-rate gate (
--threshold <percent>): fails when the share of passing rows in this run falls below a fixed floor. It needs no baseline and answers “is this run good enough on its own?”. See Running experiments for the--thresholdflag and JUnit output. - Regression gate (this page): fails when a case scored worse than its own baseline, or when a scorer’s mean across the run drops below a configured floor. It needs a baseline run and answers “did this change make anything worse than before?”.
How it works
A regression gate compares this run’s per-(row × scorer) scores against a baseline run and applies two independent checks. Either one failing fails the build.The two checks
Per-case regression (test_settings.regression_tolerance): a single row × scorer pair fails when its score drops more than regression_tolerance below that same case’s baseline score. The tolerance is a fraction — 0.05 means “fail if the score fell more than 5% below baseline.” This is relative and per-case: a score of 0.80 against a baseline of 0.90 is an 11% drop and fails at 0.05; the same 0.80 with no baseline does not fire this check.
Run-level threshold (test_settings.score_thresholds): a scorer fails when its mean score across the whole run falls below a configured floor. You write it as a { scorerName: minMeanScore } map, for example { groundedness: 0.9 }. This is absolute and run-level — it does not need a baseline, so it stays in force even on the first run.
The per-case regression check only fires when a baseline score exists for that row × scorer pair and the baseline score is greater than zero. It never fires on a missing baseline, a non-numeric score, or a zero baseline — so it cannot fail a build spuriously.
How the baseline is resolved
Each run is identified by a stableexperiment_key. It defaults to the prompt’s repo-relative entrypoint path (for example ./prompts/qa.prompt.mdx), so two distinct evaluations never collide even when they share a dataset. Set it explicitly when your subject has no single entrypoint file (a code-assembled agent or workflow), or to keep the identity stable across file renames.
AgentMark resolves the baseline by experiment_key, environment, and the git tree hash of the code at the base commit. It prefers the run recorded at that exact tree hash. If none exists, it falls back to the most recent prior run of the same experiment_key — and reports which one it used, so the comparison is never silent.
Rows are matched between runs by a content hash of the dataset input, not by position or ID. Reordering your dataset or regenerating row IDs does not break the comparison.
Prerequisites
A regression gate compares against a baseline run, so a baseline has to exist first.- Baselines are stored in AgentMark Cloud. The local dev server’s run storage is ephemeral, so it cannot serve as a durable baseline across CI runs. Both setup paths below require an
AGENTMARK_API_KEY. (The eval-action’sapi-keyinput is optional for plain evals, but the regression gate needs Cloud-stored baselines, so a key is required here.) - Bootstrap by recording a baseline on your default branch. Run the experiment once on
main(through the same eval-action or SDK call you use in PRs). From then on, each PR gates against the run recorded on its base commit. - No prior run means the gate is inert, not failing. If AgentMark finds no baseline for the
experiment_key, the per-case regression check is skipped — there’s nothing to compare against yet. The run-levelscore_thresholdsgate still applies.
Set it up for prompts (eval-action)
For prompt-based evals, theagentmark-ai/eval-action GitHub Action runs the changed .prompt.mdx files on each PR, compares each case to the base commit’s run, and fails the check with per-case annotations. The gate thresholds live in the prompt’s frontmatter, so the workflow only needs to point the action at a baseline ref.
The
agentmark-ai/eval-action repo publishes alongside the first regression-gate release. If agentmark-ai/eval-action@v1 resolves to a 404 for you, the action hasn’t been published yet — use the SDK setup below in the meantime (it runs the same gate from your existing test suite).Add the gate config to the prompt frontmatter
Set
regression_tolerance and score_thresholds in the prompt’s test_settings.agentmark-ai/eval-action@v1 — a standalone action, so the uses: path is just the org and repo. It writes JUnit XML, so failures appear in the same PR check panel as your existing pytest, jest, or vitest runs.
Set it up for agents and workflows (SDK)
When the thing under test is an agent or a multi-step workflow rather than a single prompt, gate it from inside your existing test suite with the TypeScript SDK. There are no separate eval files and no CLI — yourtask function is the execution, so it works with any framework and needs no adapter.
The trade-off: the CLI derives the two git tree hashes automatically, but the SDK does not. You pass them yourself from your CI environment.
apiKey and appId. initTracing() registers the run with AgentMark Cloud so a later PR can use it as a baseline — without it, the run executes and gates, but it won’t be stored as a baseline for next time.
Setting junitPath writes the run as JUnit XML — the same shape the eval-action produces for prompts — so a code experiment surfaces in the PR check exactly like a prompt one. See Surface both in one check.
To compute the tree hashes in CI:
Pass a git tree hash, not a commit SHA, for both
sourceTreeHash and baselineTreeHash. Tree hashes are content-addressed, so two commits with identical file contents resolve to the same baseline. git rev-parse <ref>^{tree} converts any commit ref to its tree hash.Surface prompt and code experiments in one check
JUnit is the shared contract. The eval-action emits it for prompt experiments, andrunExperiment({ junitPath }) emits the identical shape for code experiments — same per-(row × scorer) testcases, same regression <failure>s, same run-level threshold cases. Point one reporter at both and a single PR check covers everything, regardless of origin.
The action’s report: false makes it produce the XML without opening its own check, so a downstream reporter can combine both producers:
Read the results
Both setup paths surface the same gate outcome — overall pass/fail plus the exact cases that regressed. In CI with eval-action, every row × scorer pair is a JUnit<testcase>. A regressed case emits a <failure>, so the PR check panel and the Checks tab point at the specific inputs that got worse. The run-level score_thresholds failures appear as their own testcases.
In the SDK, the return value pinpoints each regression. result.passed is the gate verdict — false if any case regressed or a score_thresholds floor was breached (it does not consider each row’s absolute pass/fail; assert on row.evals[].passed yourself if you want that too). result.regressionFailures counts the regressed pairs; and each row carries per-eval detail so you can list exactly what dropped:
regressed (whether this specific score fell beyond tolerance) and baselineScore (what the matched baseline scored), alongside the run’s failedScoreThresholds and the resolved baseline descriptor.
For the underlying flag that drives the CLI path, see --baseline-commit in the CLI reference. The eval-action resolves baseline-ref to a tree hash and passes it as --baseline-commit for you.
Caveats
- No baseline disables only the regression check. When no prior run exists for the
experiment_key, the per-case regression check is skipped;score_thresholdsstill runs. Absolute per-row pass/fail is gated only in eval-action — apassed: falsescorer becomes a JUnit<failure>the reporter fails on; the CLI’s own exit code does not gate it, and the SDK’sresult.passedcovers only the regression andscore_thresholdsgates. The CLI prints⚠️ No baseline run found for "<experiment_key>" — regression gate inactive.to stderr; stdout stays clean for redirecting to a results file. - Exact-match versus recency fallback is reported, never silent. If there’s no run at the base commit’s exact tree hash, AgentMark compares against the most recent prior run of the same key instead. The CLI prints
⚠️ No run at <tree-hash> for "<experiment_key>"; comparing against the most recent prior run instead.to stderr, and the SDK returnsresolved.matchedExactCommit: false. A recency fallback can compare against a different code state than the PR base, so treat its results as advisory. - Row matching is by input hash, so masking or input drift can leave it matching nothing. Rows are joined to the baseline by a content hash of the dataset input. If you redact inputs — the SDK tracing
hideInputsoption, or amaskfunction that rewrites the storedagentmark.dataset_inputthe gate hashes — or the dataset input otherwise differs from the baseline run, the live rows won’t match and the per-case check compares nothing. A baseline that matched nothing is treated as inert (like no baseline), not a failure, but it is reported, never silent: the CLI prints⚠️ Baseline resolved but 0/<N> rows matched it by input hash — regression gate compared nothing.to stderr, and the SDK returnsbaselineRowsMatched: 0(with aconsole.warn). Assert onresult.baselineRowsMatched > 0in CI if a silently inert gate would be worse than a hard failure. experiment_keymust be stable across runs to match. The CLI defaultsexperiment_keyto the repo-relative entrypoint path, derived from the git top level. A run recorded where git is unavailable falls back to the prompt name or file basename, which won’t match a git-derived key — so a baseline and a candidate computed in different environments can silently fail to resolve. Settest_settings.experiment_keyexplicitly to pin the identity when your runs span environments.- A non-positive baseline score is skipped. The per-case check needs a baseline score greater than zero to compute a fractional drop, so a baseline of
0never fires a regression for that pair.
Have Questions?
We’re here to help! Choose the best way to reach us:
- Email us at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about our business solutions