Skip to main content
A regression gate fails a CI build when a change scores worse than it did before. It compares each test case in this run against the score that same case got on an earlier saved run (the baseline), so it catches quality drops even when the overall pass rate still looks fine. Use it to keep a prompt or agent from getting quietly worse as you iterate: you record a baseline once on your default branch, and every PR after that is gated against it.

When to use this versus --threshold

AgentMark has two complementary CI gates. They answer different questions, and you can run both at once.
  • Absolute pass-rate gate (--threshold <percent>): fails when the share of passing rows in this run falls below a fixed floor. It needs no baseline and answers “is this run good enough on its own?”. See Running experiments for the --threshold flag and JUnit output.
  • Regression gate (this page): fails when a case scored worse than its own baseline, or when a scorer’s mean across the run drops below a configured floor. It needs a baseline run and answers “did this change make anything worse than before?”.
A prompt change can keep the overall pass rate at 90% while quietly degrading ten specific cases that used to score higher. The absolute gate misses that; the regression gate catches it.

How it works

A regression gate compares this run’s per-(row × scorer) scores against a baseline run and applies two independent checks. Either one failing fails the build.

The two checks

Per-case regression (test_settings.regression_tolerance): a single row × scorer pair fails when its score drops more than regression_tolerance below that same case’s baseline score. The tolerance is a fraction, so 0.05 means “fail if the score fell more than 5% below baseline.” This is relative and per-case: a score of 0.80 against a baseline of 0.90 is an 11% drop and fails at 0.05; the same 0.80 with no baseline does not fire this check. Run-level threshold (test_settings.score_thresholds): a scorer fails when its mean score across the whole run falls below a configured floor. You write it as a { scorerName: minMeanScore } map, for example { groundedness: 0.9 }. This is absolute and run-level: it does not need a baseline, so it stays in force even on the first run.
The per-case regression check only fires when a baseline score exists for that row × scorer pair and the baseline score is greater than zero. It never fires on a missing baseline, a non-numeric score, or a zero baseline, so it cannot fail a build spuriously.

How the baseline is resolved

Each run is identified by a stable experiment_key. It defaults to the prompt’s repo-relative entrypoint path (for example ./prompts/qa.prompt.mdx), so two distinct evaluations never collide even when they share a dataset. Set it explicitly when your subject has no single entrypoint file (a code-assembled agent or workflow), or to keep the identity stable across file renames. AgentMark resolves the baseline by experiment_key, environment, and the git tree hash of the code at the base commit. It prefers the run recorded at that exact tree hash. If none exists, it falls back to the most recent prior run of the same experiment_key, and reports which one it used, so the comparison is never silent. Rows are matched between runs by a content hash of the dataset input, not by position or ID. Reordering your dataset or regenerating row IDs does not break the comparison.

Prerequisites

A regression gate compares against a baseline run, so a baseline has to exist first.
  • Baselines are stored in AgentMark Cloud. The local dev server’s run storage is ephemeral, so it cannot serve as a durable baseline across CI runs. Both setup paths below require an AGENTMARK_API_KEY.
  • Bootstrap by recording a baseline on your default branch. Run the experiment once on main (through the same CLI command or SDK call you use in PRs). From then on, each PR gates against the run recorded on its base commit.
  • No prior run means the gate is inert, not failing. If AgentMark finds no baseline for the experiment_key, the per-case regression check is skipped, since there’s nothing to compare against yet. The run-level score_thresholds gate still applies.

Set it up for prompts (CLI)

For prompt-based evals, run the AgentMark CLI in your CI pipeline: run-experiment with --baseline-commit compares each case to the baseline run and exits non-zero when a gate fires, and --format junit writes per-case results on stdout for your CI reporter. The gate thresholds live in the prompt’s frontmatter, so the CI job only supplies the baseline ref.
1

Add the gate config to the prompt frontmatter

Set regression_tolerance and score_thresholds in the prompt’s test_settings.
test_settings:
  dataset: ./data/qa.jsonl
  regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
  score_thresholds:
    groundedness: 0.9                   # fail the run if mean groundedness < 0.9
2

Run the CLI against the PR base

Check out with full history so the base ref resolves to a tree hash, then run run-experiment per prompt with --baseline-commit. The CLI resolves the ref to a tree hash itself, fails the job when a gate fires, and keeps stdout clean JUnit for redirecting to a file.
.github/workflows/evals.yml
name: Evals
on: pull_request

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0                # required: the gate resolves the base ref to a tree hash
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: >
          npx --yes @agentmark-ai/cli run-experiment agentmark/qa.prompt.mdx
          --baseline-commit "${{ github.event.pull_request.base.sha }}"
          --format junit > agentmark-results-qa.xml
        env:
          AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
          AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}
      - uses: mikepenz/action-junit-report@v5
        if: always()
        with:
          report_paths: 'agentmark-results-*.xml'
          check_name: 'AgentMark Evals'
          fail_on_failure: true
The same command works on any CI platform. On GitLab, pass --baseline-commit "$CI_MERGE_REQUEST_DIFF_BASE_SHA" with GIT_DEPTH: "0" and register the redirected XML as a junit artifact; the full YAML is in the raw-CLI setup.
A full-history checkout is required (fetch-depth: 0 on GitHub, GIT_DEPTH: "0" on GitLab). With a shallow checkout the base ref can’t resolve to a tree hash, so the comparison degrades to the most recent prior run (reported on stderr) or the gate goes inert when no prior run matches.

Set it up for agents and workflows (SDK)

When the thing under test is an agent or a multi-step workflow rather than a single prompt, gate it from inside your existing test suite with the TypeScript SDK. There are no separate eval files and no CLI: your task function is the execution, so it works with any framework and needs no adapter. The trade-off: the CLI derives the two git tree hashes automatically, but the SDK does not. You pass them yourself from your CI environment.
import { AgentMarkSDK } from "@agentmark-ai/sdk";

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
});

// Record this run as a trace so it can serve as a future baseline.
sdk.initTracing();

const result = await sdk.runExperiment({
  experimentKey: "support-agent",
  dataset,                                    // [{ input, expectedOutput? }, ...]
  task: (input) => supportAgent.run(input),   // any callable — your agent or workflow
  evaluators: [groundedness],
  sourceTreeHash: process.env.TREE_SHA,       // `git rev-parse HEAD^{tree}`
  baselineTreeHash: process.env.BASE_SHA,     // base commit's tree hash; omit to skip the gate
  regressionTolerance: 0.05,
  scoreThresholds: { groundedness: 0.9 },
  junitPath: "agentmark-results-support-agent.xml", // emit the same JUnit the CLI does
});

// Fail the test when either gate fired (the JUnit file also lands for CI to report).
expect(result.passed).toBe(true);
The SDK constructor needs both apiKey and appId. initTracing() registers the run with AgentMark Cloud so a later PR can use it as a baseline. Without it, the run executes and gates, but it won’t be stored as a baseline for next time. Setting junitPath writes the run as JUnit XML, the same shape the CLI’s --format junit produces for prompts, so a code experiment surfaces in the PR check exactly like a prompt one. See Surface both in one check. To compute the tree hashes in CI:
TREE_SHA=$(git rev-parse "HEAD^{tree}")
# After actions/checkout@v4, the base branch only exists as a remote-tracking
# ref (origin/<branch>), so prefix with `origin/`. For non-PR events, swap in
# ${{ github.event.pull_request.base.sha }} or the appropriate base SHA.
BASE_SHA=$(git rev-parse "origin/$GITHUB_BASE_REF^{tree}")
Pass a git tree hash, not a commit SHA, for both sourceTreeHash and baselineTreeHash. Tree hashes are content-addressed, so two commits with identical file contents resolve to the same baseline. git rev-parse <ref>^{tree} converts any commit ref to its tree hash.

Packaged CI integrations

The agentmark-ai/eval-action GitHub Action and the agentmark-ai/eval-component GitLab Catalog component are not yet published: uses: agentmark-ai/eval-action@v1 and the include: component: reference do not resolve yet. Use the CLI or SDK setup on this page instead; both run the same gate.
Both integrations wrap the CLI command above, adding changed-prompt detection (running only the .prompt.mdx files the PR/MR touches) and automatic baseline-ref resolution (baseline-ref defaults to the PR base SHA on GitHub and $CI_MERGE_REQUEST_DIFF_BASE_SHA on GitLab). GitLab CI/CD documents the component’s inputs and how it will work once published.

Surface prompt and code experiments in one check

JUnit is the shared contract. The CLI emits it for prompt experiments (--format junit), and runExperiment({ junitPath }) emits the identical shape for code experiments: same per-(row × scorer) testcases, same regression <failure>s, same run-level threshold cases. Write both to the same glob and point one reporter at it, and a single PR check covers everything, regardless of origin:
- uses: actions/checkout@v4
  with:
    fetch-depth: 0

# Prompt experiments → agentmark-results-*.xml (no check yet)
- run: >
    npx --yes @agentmark-ai/cli run-experiment agentmark/qa.prompt.mdx
    --baseline-commit "${{ github.event.pull_request.base.sha }}"
    --format junit > agentmark-results-qa.xml
  env:
    AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
    AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}

# Code experiments → the same glob, from your test suite
- run: npm test            # runExperiment({ junitPath: 'agentmark-results-<key>.xml' })
  if: always()
  env:
    AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
    AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}

# One reporter for both → a single "AgentMark Evals" check
- uses: mikepenz/action-junit-report@v5
  if: always()
  with:
    report_paths: 'agentmark-results-*.xml'
    check_name: 'AgentMark Evals'
    fail_on_failure: true
Because the format is shared, this also works outside GitHub. Any JUnit consumer (GitLab, Jenkins, CircleCI) renders both the same way.

Read the results

Both setup paths surface the same gate outcome: overall pass/fail plus the exact cases that regressed. In CI, every row × scorer pair is a JUnit <testcase>. A regressed case emits a <failure>, so the PR check panel and the Checks tab point at the specific inputs that got worse. The run-level score_thresholds failures appear as their own testcases. In the SDK, the return value pinpoints each regression. result.passed is the gate verdict: false if any case regressed or a score_thresholds floor was breached (it does not consider each row’s absolute pass/fail; assert on row.evals[].passed yourself if you want that too). result.regressionFailures counts the regressed pairs, and each row carries per-eval detail so you can list exactly what dropped:
const regressed = result.rows.flatMap((row) =>
  row.evals
    .filter((e) => e.regressed)
    .map((e) => ({ input: row.input, scorer: e.name, score: e.score, baseline: e.baselineScore })),
);

for (const r of regressed) {
  console.log(`${r.scorer}: ${r.score} (baseline ${r.baseline}) — ${JSON.stringify(r.input)}`);
}
Each eval result carries regressed (whether this specific score fell beyond tolerance) and baselineScore (what the matched baseline scored), alongside the run’s failedScoreThresholds and the resolved baseline descriptor. See --baseline-commit in the CLI reference for the full flag semantics.

Caveats

  • No baseline disables only the regression check. When no prior run exists for the experiment_key, the per-case regression check is skipped; score_thresholds still runs. Absolute per-row pass/fail is gated only through the JUnit reporter: a passed: false scorer becomes a JUnit <failure> the reporter fails on; the CLI’s own exit code does not gate it (use --threshold for that), and the SDK’s result.passed covers only the regression and score_thresholds gates. The CLI prints ⚠️ No baseline run found for "<experiment_key>" — regression gate inactive. to stderr; stdout stays clean for redirecting to a results file.
  • Exact-match versus recency fallback is reported, never silent. If there’s no run at the base commit’s exact tree hash, AgentMark compares against the most recent prior run of the same key instead. The CLI prints ⚠️ No run at <tree-hash> for "<experiment_key>"; comparing against the most recent prior run instead. to stderr, and the SDK returns resolved.matchedExactCommit: false. A recency fallback can compare against a different code state than the PR base, so treat its results as advisory.
  • Row matching is by input hash, so masking or input drift can leave it matching nothing. Rows are joined to the baseline by a content hash of the dataset input. If you redact inputs (the SDK tracing hideInputs option, or a mask function that rewrites the stored agentmark.dataset_input the gate hashes), or the dataset input otherwise differs from the baseline run, the live rows won’t match and the per-case check compares nothing. A baseline that matched nothing is treated as inert (like no baseline), not a failure, but it is reported, never silent: the CLI prints ⚠️ Baseline resolved but 0/<N> rows matched it by input hash — regression gate compared nothing. to stderr, and the SDK returns baselineRowsMatched: 0 (with a console.warn). Assert on result.baselineRowsMatched > 0 in CI if a silently inert gate would be worse than a hard failure.
  • experiment_key must be stable across runs to match. The CLI defaults experiment_key to the repo-relative entrypoint path, derived from the git top level. A run recorded where git is unavailable falls back to the prompt name or file basename, which won’t match a git-derived key, so a baseline and a candidate computed in different environments can silently fail to resolve. Set test_settings.experiment_key explicitly to pin the identity when your runs span environments.
  • A non-positive baseline score is skipped. The per-case check needs a baseline score greater than zero to compute a fractional drop, so a baseline of 0 never fires a regression for that pair.

Have questions?

Reach out any time: