Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt

Use this file to discover all available pages before exploring further.

A regression gate fails a CI build when an experiment’s scores drop against a baseline run. Unlike the absolute --threshold pass-rate gate — which fails when too few rows pass in this run alone — a regression gate compares each case to how that same case scored before, so it catches silent quality drops even when a scorer still passes in absolute terms. Use it to keep a prompt or agent from getting quietly worse as you iterate: you record a baseline once on your default branch, and every PR after that is gated against it.

When to use this versus --threshold

AgentMark has two complementary CI gates. They answer different questions, and you can run both at once.
  • Absolute pass-rate gate (--threshold <percent>): fails when the share of passing rows in this run falls below a fixed floor. It needs no baseline and answers “is this run good enough on its own?”. See Running experiments for the --threshold flag and JUnit output.
  • Regression gate (this page): fails when a case scored worse than its own baseline, or when a scorer’s mean across the run drops below a configured floor. It needs a baseline run and answers “did this change make anything worse than before?”.
A prompt change can keep the overall pass rate at 90% while quietly degrading ten specific cases that used to score higher. The absolute gate misses that; the regression gate catches it.

How it works

A regression gate compares this run’s per-(row × scorer) scores against a baseline run and applies two independent checks. Either one failing fails the build.

The two checks

Per-case regression (test_settings.regression_tolerance): a single row × scorer pair fails when its score drops more than regression_tolerance below that same case’s baseline score. The tolerance is a fraction — 0.05 means “fail if the score fell more than 5% below baseline.” This is relative and per-case: a score of 0.80 against a baseline of 0.90 is an 11% drop and fails at 0.05; the same 0.80 with no baseline does not fire this check. Run-level threshold (test_settings.score_thresholds): a scorer fails when its mean score across the whole run falls below a configured floor. You write it as a { scorerName: minMeanScore } map, for example { groundedness: 0.9 }. This is absolute and run-level — it does not need a baseline, so it stays in force even on the first run.
The per-case regression check only fires when a baseline score exists for that row × scorer pair and the baseline score is greater than zero. It never fires on a missing baseline, a non-numeric score, or a zero baseline — so it cannot fail a build spuriously.

How the baseline is resolved

Each run is identified by a stable experiment_key. It defaults to the prompt’s repo-relative entrypoint path (for example ./prompts/qa.prompt.mdx), so two distinct evaluations never collide even when they share a dataset. Set it explicitly when your subject has no single entrypoint file (a code-assembled agent or workflow), or to keep the identity stable across file renames. AgentMark resolves the baseline by experiment_key, environment, and the git tree hash of the code at the base commit. It prefers the run recorded at that exact tree hash. If none exists, it falls back to the most recent prior run of the same experiment_key — and reports which one it used, so the comparison is never silent. Rows are matched between runs by a content hash of the dataset input, not by position or ID. Reordering your dataset or regenerating row IDs does not break the comparison.

Prerequisites

A regression gate compares against a baseline run, so a baseline has to exist first.
  • Baselines are stored in AgentMark Cloud. The local dev server’s run storage is ephemeral, so it cannot serve as a durable baseline across CI runs. Both setup paths below require an AGENTMARK_API_KEY. (The eval-action’s api-key input is optional for plain evals, but the regression gate needs Cloud-stored baselines, so a key is required here.)
  • Bootstrap by recording a baseline on your default branch. Run the experiment once on main (through the same eval-action or SDK call you use in PRs). From then on, each PR gates against the run recorded on its base commit.
  • No prior run means the gate is inert, not failing. If AgentMark finds no baseline for the experiment_key, the per-case regression check is skipped — there’s nothing to compare against yet. The run-level score_thresholds gate still applies.

Set it up for prompts (eval-action)

For prompt-based evals, the agentmark-ai/eval-action GitHub Action runs the changed .prompt.mdx files on each PR, compares each case to the base commit’s run, and fails the check with per-case annotations. The gate thresholds live in the prompt’s frontmatter, so the workflow only needs to point the action at a baseline ref.
The agentmark-ai/eval-action repo publishes alongside the first regression-gate release. If agentmark-ai/eval-action@v1 resolves to a 404 for you, the action hasn’t been published yet — use the SDK setup below in the meantime (it runs the same gate from your existing test suite).
1

Add the gate config to the prompt frontmatter

Set regression_tolerance and score_thresholds in the prompt’s test_settings.
test_settings:
  dataset: ./data/qa.jsonl
  regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
  score_thresholds:
    groundedness: 0.9                   # fail the run if mean groundedness < 0.9
2

Add the action to your PR workflow

Check out with full history so the action can resolve the base ref to a tree hash, then add the action. baseline-ref defaults to the PR base SHA, so you can omit it.
name: Evals
on: pull_request

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0                # required: the gate resolves the base ref to a tree hash
      - uses: agentmark-ai/eval-action@v1
        with:
          api-key: ${{ secrets.AGENTMARK_API_KEY }}
          # baseline-ref defaults to ${{ github.event.pull_request.base.sha }}
fetch-depth: 0 is required. The default shallow checkout does not contain the base commit, so the action cannot resolve it to a tree hash. When that happens the action prints a warning and disables the regression gate for that run rather than failing.
The action reference is agentmark-ai/eval-action@v1 — a standalone action, so the uses: path is just the org and repo. It writes JUnit XML, so failures appear in the same PR check panel as your existing pytest, jest, or vitest runs.

Set it up for agents and workflows (SDK)

When the thing under test is an agent or a multi-step workflow rather than a single prompt, gate it from inside your existing test suite with the TypeScript SDK. There are no separate eval files and no CLI — your task function is the execution, so it works with any framework and needs no adapter. The trade-off: the CLI derives the two git tree hashes automatically, but the SDK does not. You pass them yourself from your CI environment.
import { AgentMarkSDK } from "@agentmark-ai/sdk";

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
});

// Record this run as a trace so it can serve as a future baseline.
sdk.initTracing();

const result = await sdk.runExperiment({
  experimentKey: "support-agent",
  dataset,                                    // [{ input, expectedOutput? }, ...]
  task: (input) => supportAgent.run(input),   // any callable — your agent or workflow
  evaluators: [groundedness],
  sourceTreeHash: process.env.TREE_SHA,       // `git rev-parse HEAD^{tree}`
  baselineTreeHash: process.env.BASE_SHA,     // base commit's tree hash; omit to skip the gate
  regressionTolerance: 0.05,
  scoreThresholds: { groundedness: 0.9 },
  junitPath: "agentmark-results-support-agent.xml", // emit the same JUnit the CLI does
});

// Fail the test when either gate fired (the JUnit file also lands for CI to report).
expect(result.passed).toBe(true);
The SDK constructor needs both apiKey and appId. initTracing() registers the run with AgentMark Cloud so a later PR can use it as a baseline — without it, the run executes and gates, but it won’t be stored as a baseline for next time. Setting junitPath writes the run as JUnit XML — the same shape the eval-action produces for prompts — so a code experiment surfaces in the PR check exactly like a prompt one. See Surface both in one check. To compute the tree hashes in CI:
TREE_SHA=$(git rev-parse "HEAD^{tree}")
# After actions/checkout@v4, the base branch only exists as a remote-tracking
# ref (origin/<branch>), so prefix with `origin/`. For non-PR events, swap in
# ${{ github.event.pull_request.base.sha }} or the appropriate base SHA.
BASE_SHA=$(git rev-parse "origin/$GITHUB_BASE_REF^{tree}")
Pass a git tree hash, not a commit SHA, for both sourceTreeHash and baselineTreeHash. Tree hashes are content-addressed, so two commits with identical file contents resolve to the same baseline. git rev-parse <ref>^{tree} converts any commit ref to its tree hash.

Surface prompt and code experiments in one check

JUnit is the shared contract. The eval-action emits it for prompt experiments, and runExperiment({ junitPath }) emits the identical shape for code experiments — same per-(row × scorer) testcases, same regression <failure>s, same run-level threshold cases. Point one reporter at both and a single PR check covers everything, regardless of origin. The action’s report: false makes it produce the XML without opening its own check, so a downstream reporter can combine both producers:
- uses: actions/checkout@v4
  with:
    fetch-depth: 0

# Prompt experiments → agentmark-results-*.xml (no check yet)
- uses: agentmark-ai/eval-action@v1
  with:
    api-key: ${{ secrets.AGENTMARK_API_KEY }}
    report: false

# Code experiments → the same glob, from your test suite
- run: npm test            # runExperiment({ junitPath: 'agentmark-results-<key>.xml' })
  env:
    AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
    AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}

# One reporter for both → a single "AgentMark Evals" check
- uses: mikepenz/action-junit-report@v5
  if: always()
  with:
    report_paths: 'agentmark-results-*.xml'
    check_name: 'AgentMark Evals'
    fail_on_failure: true
Because the format is shared, this also works outside GitHub — any JUnit consumer (GitLab, Jenkins, CircleCI) renders both the same way.

Read the results

Both setup paths surface the same gate outcome — overall pass/fail plus the exact cases that regressed. In CI with eval-action, every row × scorer pair is a JUnit <testcase>. A regressed case emits a <failure>, so the PR check panel and the Checks tab point at the specific inputs that got worse. The run-level score_thresholds failures appear as their own testcases. In the SDK, the return value pinpoints each regression. result.passed is the gate verdict — false if any case regressed or a score_thresholds floor was breached (it does not consider each row’s absolute pass/fail; assert on row.evals[].passed yourself if you want that too). result.regressionFailures counts the regressed pairs; and each row carries per-eval detail so you can list exactly what dropped:
const regressed = result.rows.flatMap((row) =>
  row.evals
    .filter((e) => e.regressed)
    .map((e) => ({ input: row.input, scorer: e.name, score: e.score, baseline: e.baselineScore })),
);

for (const r of regressed) {
  console.log(`${r.scorer}: ${r.score} (baseline ${r.baseline}) — ${JSON.stringify(r.input)}`);
}
Each eval result carries regressed (whether this specific score fell beyond tolerance) and baselineScore (what the matched baseline scored), alongside the run’s failedScoreThresholds and the resolved baseline descriptor. For the underlying flag that drives the CLI path, see --baseline-commit in the CLI reference. The eval-action resolves baseline-ref to a tree hash and passes it as --baseline-commit for you.

Caveats

  • No baseline disables only the regression check. When no prior run exists for the experiment_key, the per-case regression check is skipped; score_thresholds still runs. Absolute per-row pass/fail is gated only in eval-action — a passed: false scorer becomes a JUnit <failure> the reporter fails on; the CLI’s own exit code does not gate it, and the SDK’s result.passed covers only the regression and score_thresholds gates. The CLI prints ⚠️ No baseline run found for "<experiment_key>" — regression gate inactive. to stderr; stdout stays clean for redirecting to a results file.
  • Exact-match versus recency fallback is reported, never silent. If there’s no run at the base commit’s exact tree hash, AgentMark compares against the most recent prior run of the same key instead. The CLI prints ⚠️ No run at <tree-hash> for "<experiment_key>"; comparing against the most recent prior run instead. to stderr, and the SDK returns resolved.matchedExactCommit: false. A recency fallback can compare against a different code state than the PR base, so treat its results as advisory.
  • Row matching is by input hash, so masking or input drift can leave it matching nothing. Rows are joined to the baseline by a content hash of the dataset input. If you redact inputs — the SDK tracing hideInputs option, or a mask function that rewrites the stored agentmark.dataset_input the gate hashes — or the dataset input otherwise differs from the baseline run, the live rows won’t match and the per-case check compares nothing. A baseline that matched nothing is treated as inert (like no baseline), not a failure, but it is reported, never silent: the CLI prints ⚠️ Baseline resolved but 0/<N> rows matched it by input hash — regression gate compared nothing. to stderr, and the SDK returns baselineRowsMatched: 0 (with a console.warn). Assert on result.baselineRowsMatched > 0 in CI if a silently inert gate would be worse than a hard failure.
  • experiment_key must be stable across runs to match. The CLI defaults experiment_key to the repo-relative entrypoint path, derived from the git top level. A run recorded where git is unavailable falls back to the prompt name or file basename, which won’t match a git-derived key — so a baseline and a candidate computed in different environments can silently fail to resolve. Set test_settings.experiment_key explicitly to pin the identity when your runs span environments.
  • A non-positive baseline score is skipped. The per-case check needs a baseline score greater than zero to compute a fractional drop, so a baseline of 0 never fires a regression for that pair.

Have Questions?

We’re here to help! Choose the best way to reach us: