Regression gates - AgentMark Docs

A regression gate fails a CI build when a change scores worse than it did before. It compares each test case in this run against the score that same case got on an earlier saved run (the baseline), so it catches quality drops even when the overall pass rate still looks fine. Use it to keep a prompt or agent from getting quietly worse as you iterate: you record a baseline once on your default branch, and every PR after that’s gated against it.

When to use this versus `--threshold`

AgentMark has two complementary CI gates. They answer different questions, and you can run both at once.

Absolute pass-rate gate (--threshold <percent>): fails when the share of passing rows in this run falls below a fixed floor. It needs no baseline and answers “is this run good enough on its own?”. See Running experiments for the --threshold flag and JUnit output.
Regression gate (this page): fails when a case scored worse than its own baseline, or when a scorer’s mean across the run drops below a configured floor. It needs a baseline run and answers “did this change make anything worse than before?”.

A prompt change can keep the overall pass rate at 90% while quietly degrading ten specific cases that used to score higher. The absolute gate misses that; the regression gate catches it.

How it works

A regression gate compares this run’s per-(row × scorer) scores against a baseline run and applies two independent checks. Either one failing fails the build.

The two checks

Per-case regression (test_settings.regression_tolerance): a single row × scorer pair fails when its score drops more than regression_tolerance below that same case’s baseline score. The tolerance is a fraction, so 0.05 means “fail if the score fell more than 5% below baseline.” This is relative and per-case: a score of 0.80 against a baseline of 0.90 is an 11% drop and fails at 0.05; the same 0.80 with no baseline doesn’t fire this check. Run-level threshold (test_settings.score_thresholds): a scorer fails when its mean score across the whole run falls below a configured floor. You write it as a { scorerName: minMeanScore } map, for example { groundedness: 0.9 }. This is absolute and run-level: it doesn’t need a baseline, so it stays in force even on the first run.

The per-case regression check only fires when a baseline score exists for that row × scorer pair and the baseline score is greater than zero. It never fires on a missing baseline, a non-numeric score, or a zero baseline, so it can’t fail a build spuriously.

How AgentMark resolves the baseline

AgentMark resolves the baseline by experiment_key, environment, and the git tree hash of the code at the base commit, preferring the run at that exact tree hash and falling back (reported, never silent) to the most recent prior run of the same key. AgentMark matches rows between runs by a content hash of the dataset input, so reordering your dataset or regenerating row IDs doesn’t break the comparison. The experiment_key defaults to the prompt’s repo-relative entrypoint path (for example ./prompts/qa.prompt.mdx); set it explicitly when your subject has no single entrypoint file (a code-assembled agent or workflow) or to keep the identity stable across file renames. For the full model behind these three axes (stable key, content-addressed tree hash, and the separate deployed-version commit), see Experiment identity and baselines.

Prerequisites

A regression gate compares against a baseline run, so a baseline has to exist first.

AgentMark Cloud stores baselines. The local dev server’s run storage is ephemeral, so it can’t serve as a durable baseline across CI runs. Both setup paths below require an AGENTMARK_API_KEY.
Bootstrap by recording a baseline on your default branch. Run the experiment once on main (through the same CLI command or SDK call you use in PRs). From then on, each PR gates against the run recorded on its base commit.
No prior run means the gate is inert, not failing. If AgentMark finds no baseline for the experiment_key, it skips the per-case regression check, since there’s nothing to compare against yet. The run-level score_thresholds gate still applies.

Set it up for prompts (CLI)

For prompt-based evals, run the AgentMark CLI in your CI pipeline: run-experiment with --baseline-commit compares each case to the baseline run and exits non-zero when a gate fires, and --format junit writes per-case results on stdout for your CI reporter. The gate thresholds live in the prompt’s frontmatter, so the CI job only supplies the baseline ref.

Add the gate config to the prompt frontmatter

Set regression_tolerance and score_thresholds in the prompt’s test_settings.

test_settings:
  dataset: ./data/qa.jsonl
  regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
  score_thresholds:
    groundedness: 0.9                   # fail the run if mean groundedness < 0.9

Run the CLI against the PR base

Check out with full history so the base ref resolves to a tree hash. run-experiment executes each prompt through a webhook server, so the job boots agentmark dev headless and waits for it, then runs the experiment per prompt with --baseline-commit. The CLI resolves the ref to a tree hash itself, fails the job when a gate fires, and keeps stdout clean JUnit for redirecting to a file. The API key here points the baseline lookup at AgentMark Cloud (the durable store CI gates against); the provider key is what the dev server uses to actually run the prompts.

.github/workflows/evals.yml

name: Evals
on: pull_request

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0                # required: the gate resolves the base ref to a tree hash
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm ci                     # the dev server runs your project's client code
      - name: Run evals
        run: |
          # run-experiment executes prompts through a webhook server, so boot
          # one headless (webhook 9417, API 9418) and wait before running.
          npx --yes @agentmark-ai/cli dev --no-ui --no-forward &
          timeout 60 bash -c 'until (echo > /dev/tcp/127.0.0.1/9417) 2>/dev/null; do sleep 1; done'
          npx --yes @agentmark-ai/cli run-experiment agentmark/qa.prompt.mdx \
            --baseline-commit "${{ github.event.pull_request.base.sha }}" \
            --format junit > agentmark-results-qa.xml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}        # the dev server runs your prompts
          AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
          AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}
      - uses: mikepenz/action-junit-report@v5
        if: always()
        with:
          report_paths: 'agentmark-results-*.xml'
          check_name: 'AgentMark Evals'
          fail_on_failure: true

The same command works on any CI platform. On GitLab, pass --baseline-commit "$CI_MERGE_REQUEST_DIFF_BASE_SHA" with GIT_DEPTH: "0" and register the redirected XML as a junit artifact; the full YAML is in the CI/CD raw-CLI setup.

The gate requires a full-history checkout (fetch-depth: 0 on GitHub, GIT_DEPTH: "0" on GitLab). With a shallow checkout the base ref can’t resolve to a tree hash, so the comparison degrades to the most recent prior run (reported on stderr) or the gate goes inert when no prior run matches.

Set it up for agents and workflows (SDK)

When the thing under test is an agent or a multi-step workflow rather than a single prompt, gate it from inside your existing test suite with the TypeScript SDK. There are no separate eval files and no CLI: your task function is the execution, so it works with any framework and needs no adapter. The trade-off: the CLI derives the two git tree hashes automatically, but the SDK doesn’t. You pass them yourself from your CI environment.

import { AgentMarkSDK } from "@agentmark-ai/sdk";

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
});

// Record this run as a trace so it can serve as a future baseline.
sdk.initTracing();

const result = await sdk.runExperiment({
  experimentKey: "support-agent",
  dataset,                                    // [{ input, expectedOutput? }, ...]
  task: (input) => supportAgent.run(input),   // any callable — your agent or workflow
  evaluators: [groundedness],
  sourceTreeHash: process.env.TREE_SHA,       // `git rev-parse HEAD^{tree}`
  baselineTreeHash: process.env.BASE_SHA,     // base commit's tree hash; omit to skip the gate
  regressionTolerance: 0.05,
  scoreThresholds: { groundedness: 0.9 },
  junitPath: "agentmark-results-support-agent.xml", // emit the same JUnit the CLI does
});

// Fail the test when either gate fired (the JUnit file also lands for CI to report).
expect(result.passed).toBe(true);

The SDK constructor needs both apiKey and appId. initTracing() registers the run with AgentMark Cloud so a later PR can use it as a baseline. Without it, the run executes and gates, but AgentMark won’t store it as a baseline for next time. Setting junitPath writes the run as JUnit XML, the same shape the CLI’s --format junit produces for prompts, so a code experiment surfaces in the PR check exactly like a prompt one. See Surface both in one check. To compute the tree hashes in CI:

TREE_SHA=$(git rev-parse "HEAD^{tree}")
# After actions/checkout@v4, the base branch only exists as a remote-tracking
# ref (origin/<branch>), so prefix with `origin/`. For non-PR events, swap in
# ${{ github.event.pull_request.base.sha }} or the appropriate base SHA.
BASE_SHA=$(git rev-parse "origin/$GITHUB_BASE_REF^{tree}")

Pass a git tree hash, not a commit SHA, for both sourceTreeHash and baselineTreeHash. Tree hashes are content-addressed, so two commits with identical file contents resolve to the same baseline. git rev-parse <ref>^{tree} converts any commit ref to its tree hash.

Packaged CI integrations

The agentmark-ai/eval-action GitHub Action and the agentmark-ai/eval-component GitLab Catalog component are not yet published: uses: agentmark-ai/eval-action@v1 and the include: component: reference don’t resolve yet. Use the CLI or SDK setup on this page instead; both run the same gate.

Both integrations wrap the CLI command above, adding changed-prompt detection (running only the .prompt.mdx files the PR/MR touches) and automatic baseline-ref resolution (baseline-ref defaults to the PR base SHA on GitHub and $CI_MERGE_REQUEST_DIFF_BASE_SHA on GitLab). CI/CD documents the component’s inputs and how it works once published.

Surface prompt and code experiments in one check

JUnit is the shared contract. The CLI emits it for prompt experiments (--format junit), and runExperiment({ junitPath }) emits the identical shape for code experiments: same per-(row × scorer) testcases, same regression <failure>s, same run-level threshold cases. Write both to the same glob and point one reporter at it, and a single PR check covers everything, regardless of origin:

- uses: actions/checkout@v4
  with:
    fetch-depth: 0

# Prompt experiments → agentmark-results-*.xml. run-experiment needs a running
# webhook server: boot `agentmark dev` first (the boot-and-wait step from "Set
# it up for prompts (CLI)" above), omitted here to keep the focus on the glob.
- run: >
    npx --yes @agentmark-ai/cli run-experiment agentmark/qa.prompt.mdx
    --baseline-commit "${{ github.event.pull_request.base.sha }}"
    --format junit > agentmark-results-qa.xml
  env:
    AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
    AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}

# Code experiments → the same glob, from your test suite
- run: npm test            # runExperiment({ junitPath: 'agentmark-results-<key>.xml' })
  if: always()
  env:
    AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
    AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}

# One reporter for both → a single "AgentMark Evals" check
- uses: mikepenz/action-junit-report@v5
  if: always()
  with:
    report_paths: 'agentmark-results-*.xml'
    check_name: 'AgentMark Evals'
    fail_on_failure: true

Because both consumers share the format, this also works outside GitHub. Any JUnit consumer (GitLab, Jenkins, CircleCI) renders both the same way.

Read the results

Both setup paths surface the same gate outcome: overall pass/fail plus the exact cases that regressed. In CI, every row × scorer pair is a JUnit <testcase>. A regressed case emits a <failure>, so the PR check panel and the Checks tab point at the specific inputs that got worse. The run-level score_thresholds failures appear as their own testcases. In the SDK, the return value pinpoints each regression. result.passed is the gate verdict: false if any case regressed or this run breached a score_thresholds floor (it doesn’t consider each row’s absolute pass/fail; assert on row.evals[].passed yourself if you want that too). result.regressionFailures counts the regressed pairs, and each row carries per-eval detail so you can list exactly what dropped:

const regressed = result.rows.flatMap((row) =>
  row.evals
    .filter((e) => e.regressed)
    .map((e) => ({ input: row.input, scorer: e.name, score: e.score, baseline: e.baselineScore })),
);

for (const r of regressed) {
  console.log(`${r.scorer}: ${r.score} (baseline ${r.baseline}) — ${JSON.stringify(r.input)}`);
}

Each eval result carries regressed (whether this specific score fell beyond tolerance) and baselineScore (what the matched baseline scored), alongside the run’s failedScoreThresholds and the resolved baseline descriptor. See --baseline-commit in the CLI reference for the full flag semantics.

Caveats

No baseline disables only the regression check. When no prior run exists for the experiment_key, AgentMark skips the per-case regression check; score_thresholds still runs. Only the JUnit reporter gates absolute per-row pass/fail: a passed: false scorer becomes a JUnit <failure> the reporter fails on; the CLI’s own exit code doesn’t gate it (use --threshold for that), and the SDK’s result.passed covers only the regression and score_thresholds gates. The CLI prints ⚠️ No baseline run found for "<experiment_key>" — regression gate inactive. to stderr; stdout stays clean for redirecting to a results file.
AgentMark reports exact-match versus recency fallback, never silently. If there’s no run at the base commit’s exact tree hash, AgentMark compares against the most recent prior run of the same key instead. The CLI prints ⚠️ No run at <tree-hash> for "<experiment_key>"; comparing against the most recent prior run instead. to stderr, and the SDK returns resolved.matchedExactCommit: false. A recency fallback can compare against a different code state than the PR base, so treat its results as advisory.
Row matching is by input hash, so masking or input drift can leave it matching nothing. AgentMark joins rows to the baseline by a content hash of the dataset input. If you redact inputs (the SDK tracing hideInputs option, or a mask function that rewrites the stored agentmark.dataset_input the gate hashes), or the dataset input otherwise differs from the baseline run, the live rows won’t match and the per-case check compares nothing. AgentMark treats a baseline that matched nothing as inert (like no baseline), not a failure, but reports it, never silently: the CLI prints ⚠️ Baseline resolved but 0/<N> rows matched it by input hash — regression gate compared nothing. to stderr, and the SDK returns baselineRowsMatched: 0 (with a console.warn). Assert on result.baselineRowsMatched > 0 in CI if a silently inert gate would be worse than a hard failure.
experiment_key must be stable across runs to match. The CLI defaults experiment_key to the repo-relative entrypoint path, derived from the git top level. A run recorded where git is unavailable falls back to the prompt name or file basename, which won’t match a git-derived key, so a baseline and a candidate computed in different environments can silently fail to resolve. Set test_settings.experiment_key explicitly to pin the identity when your runs span environments.
AgentMark skips a non-positive baseline score. The per-case check needs a baseline score greater than zero to compute a fractional drop, so a baseline of 0 never fires a regression for that pair.

Have questions?

Reach out any time:

Email the team at hello@agentmark.co for support
Schedule an Enterprise Demo to learn about AgentMark’s business solutions

​When to use this versus --threshold

​How it works

​The two checks

​How AgentMark resolves the baseline

​Prerequisites

​Set it up for prompts (CLI)

​Set it up for agents and workflows (SDK)

​Packaged CI integrations

​Surface prompt and code experiments in one check

​Read the results

​Caveats

​Have questions?

When to use this versus `--threshold`

How it works

The two checks

How AgentMark resolves the baseline

Prerequisites

Set it up for prompts (CLI)

Set it up for agents and workflows (SDK)

Packaged CI integrations

Surface prompt and code experiments in one check

Read the results

Caveats

Have questions?