Skip to main content
Run AgentMark evals in CI and gate pull requests and merge requests on the results. The CLI’s --format junit output is JUnit XML, the same format pytest, jest, and vitest emit, so every major CI system parses it natively: GitHub Actions (via marketplace parsers), GitLab CI (via artifacts:reports:junit:), Jenkins, and CircleCI. Failures show up alongside your other tests, with no third-party reporter to install.

The gates

Three independent gates can fire in CI. They answer different questions and can all run at once:
  • Validation: does every prompt still compile? Runs agentmark build. See Status checks, which covers both the Cloud-managed check and the self-hosted CI job.
  • Absolute pass rate (--threshold <percent>): is this run good enough on its own? Fails when the share of passing rows falls below a fixed floor. Needs no baseline. See Running experiments.
  • Regression (--baseline-commit <ref>): did this change make anything worse than before? Fails when a case scores below its own baseline, or a scorer’s mean drops below a floor. See Regression gates for the full mechanics.
This page covers wiring the eval run itself into your pipeline. The two eval gates (--threshold and --baseline-commit) are flags on the same run-experiment command.

Run evals in CI (raw CLI)

run-experiment sends each prompt and dataset to a running AgentMark dev server, so a CI job boots one headless and waits for it before running the experiment. This is the platform-agnostic pattern: install dependencies, boot agentmark dev --no-ui --no-forward, wait for port 9417, run the experiment, and point your CI at the JUnit output. The GitLab job below is the fully worked example; the same boot-run-report shape ports to GitHub Actions, Jenkins, and CircleCI.
.gitlab-ci.yml
agentmark_eval:
  image: node:20-bookworm-slim
  variables:
    GIT_DEPTH: "0"
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  script:
    # Project dependencies: the dev server runs your project's client code
    - npm ci
    # Boot the dev server headless (webhook server on 9417, API server on 9418).
    # `npx` resolves the @agentmark-ai/cli pinned by `npm ci` (node_modules/.bin
    # isn't on PATH in a raw CI shell, so the bare `agentmark` command isn't).
    - npx @agentmark-ai/cli dev --no-ui --no-forward &
    # Wait until the webhook server accepts connections
    - timeout 60 bash -c 'until (echo > /dev/tcp/127.0.0.1/9417) 2>/dev/null; do sleep 1; done'
    # Run the experiment and emit JUnit XML
    - npx @agentmark-ai/cli run-experiment agentmark/qa.prompt.mdx --format junit > results.xml
  artifacts:
    when: always
    reports:
      junit: results.xml
The dev server is what executes your prompts, so it needs your model provider keys (for example OPENAI_API_KEY). Add them as masked CI/CD variables in Settings → CI/CD → Variables; the job environment passes through to the dev server. Without a running server, run-experiment exits with ❌ Could not connect to AgentMark server. List each prompt you want to gate as its own run-experiment line writing to its own XML file. Add --threshold <percent> for a pass-rate gate, or --baseline-commit "$CI_MERGE_REQUEST_DIFF_BASE_SHA" for the regression gate. The baseline lookup resolves from AgentMark Cloud when you set an AGENTMARK_API_KEY variable (see Set up the API key), and GIT_DEPTH: "0" keeps the diff base resolvable. See Regression gates for the gate mechanics.
run-experiment always executes through a webhook server (the boot-and-wait step above), on every platform; AGENTMARK_API_KEY / AGENTMARK_APP_ID don’t change that. They point the regression baseline lookup at AgentMark Cloud (durable across CI runs) instead of the ephemeral local store. The regression gate setup has the complete GitHub Actions .github/workflows/evals.yml. To gate agents or workflows from inside your own test suite (no CLI and no dev server, since your task function is the execution), use the SDK setup.

Set up the API key

Add AGENTMARK_API_KEY as a masked, protected CI/CD variable in your project’s settings (Settings → CI/CD → Variables on GitLab, Settings → Secrets and variables → Actions on GitHub):
1

Get the key from AgentMark Cloud

In the AgentMark Dashboard, open Settings → API Keys and create a key scoped to the app whose prompts you’re gating.
2

Store it as a masked variable

On GitLab, Settings → CI/CD → Variables → Add variable:
  • Key: AGENTMARK_API_KEY
  • Value: the key from step 1
  • Type: Variable (not File)
  • Flags: Masked, Protected
On GitHub, add it as a repository secret and reference it as ${{ secrets.AGENTMARK_API_KEY }}.
3

Reference it from the job

The CLI reads AGENTMARK_API_KEY directly from the job environment. Once the GitLab component ships, pass it via inputs.api-key: $AGENTMARK_API_KEY. Don’t hard-code the key in your pipeline config.
Cloud-backed runs (regression-gate baselines, dataset sync) need the key. For fully local evals with no Cloud features, you can skip the variable and run without a key.

What gets gated

Up to four independent gate predicates fire on every run; any failing fails the job.
  1. Per-row gate: every (row × scorer) pair is a <testcase> in the JUnit XML. If the scorer’s passed flag is false, the run emits <failure> and your CI reports it inline (in the MR widget on GitLab, the Checks tab on GitHub).
  2. Threshold gate (optional): when you set --threshold, the job fails if the overall pass rate is below the threshold.
  3. Regression gate (optional): when a baseline run resolves and the prompt sets test_settings.regression_tolerance, a row fails if a scorer’s score dropped more than the tolerance below its baseline. This catches silent quality drops even when the scorer still “passes” in absolute terms.
  4. Per-scorer threshold gate (optional): when the prompt sets test_settings.score_thresholds (a { scorer: minMeanScore } map), the run fails if a scorer’s mean score across the run falls below the configured minimum.
Regression gates documents the full mechanics: how the gate resolves the baseline by tree hash, matches rows by input content, and keeps missing baselines inert.

Packaged integrations

The agentmark-ai/eval-component GitLab CI/CD Catalog component and the agentmark-ai/eval-action GitHub Action automate the wiring: they diff each PR/MR, run @agentmark-ai/cli against the changed .prompt.mdx files, and emit the JUnit XML for you. They share a contract: both wrap the same CLI command (run-experiment --format junit), accept the same threshold / baseline-ref semantics, and emit the same JUnit XML schema.
Neither the agentmark-ai/eval-component Catalog component nor the agentmark-ai/eval-action GitHub Action is published yet, so include: component: gitlab.com/agentmark-ai/eval-component/eval@v1 and uses: agentmark-ai/eval-action@v1 won’t resolve. Use the raw-CLI setup above; it produces identical JUnit output and gate behavior. The sections below document how the GitLab component works once it’s published.

GitLab component quick start

Once the component ships, the include replaces the hand-rolled job:
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY    # masked, protected CI variable

variables:
  GIT_DEPTH: "0"                     # required so the diff base resolves
On every MR, the component evaluates the .prompt.mdx files changed in the diff and surfaces results inline in the MR widget.
The component needs GIT_DEPTH: "0". GitLab’s default shallow checkout doesn’t contain the diff base, so the component can’t resolve $CI_MERGE_REQUEST_DIFF_BASE_SHA to a tree hash. When that happens, the component disables the regression gate for the run rather than failing the job.

Inputs

InputRequiredDefaultDescription
api-keyoptionalNoneAgentMark API key. Required for Cloud-backed runs; omit for fully local evals.
promptsoptionalchanged .prompt.mdx filesNewline- or space-separated list of prompt files to evaluate.
thresholdoptionalNonePass-rate threshold (0–100). Fails the job if overall pass rate is below this number.
baseline-refoptionalMR diff baseGit ref to compare scores against for the regression gate. Resolved to a tree hash and passed to the CLI as --baseline-commit. Requires GIT_DEPTH=0. Set empty ('') to disable.
working-directoryoptional.Directory to run from.
results-globoptionalagentmark-results-*.xmlPattern for per-prompt JUnit XML output files. Must contain exactly one * wildcard; the prefix and suffix around it become the per-prompt filename template.
cli-versionoptionallatestnpm version specifier for @agentmark-ai/cli. Pin for reproducible CI.
imageoptionalnode:20-bookworm-slimDocker image. Must include npm, git, bash.

When the job runs

The component’s default rules: runs on:
  • every merge request pipeline (the primary gate), and
  • pushes to the default branch (so the run records a fresh baseline after merge).
Override in your .gitlab-ci.yml to change the cadence:
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

# Run only on MRs — skip the default-branch baseline write.
agentmark_eval:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
The default-branch run is what populates the baseline that subsequent MRs gate against. Skip it only if you’re recording baselines through a separate process (a scheduled job, a manual trigger, or the SDK).

With a regression-tolerance threshold

Set the per-case tolerance and run-level floors in the prompt’s frontmatter. The component reads them automatically from test_settings, so it doesn’t need any extra inputs.
# agentmark/qa.prompt.mdx (frontmatter)
test_settings:
  dataset: ./data/qa.jsonl
  regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
  score_thresholds:
    groundedness: 0.9                   # fail the run if mean groundedness < 0.9
# .gitlab-ci.yml
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

variables:
  GIT_DEPTH: "0"
baseline-ref defaults to $CI_MERGE_REQUEST_DIFF_BASE_SHA, so MR pipelines pick up the right comparison automatically. The first run on the default branch records the baseline; from then on every MR gates against the run captured at its base commit’s tree hash. See Regression gates for the full gate semantics.

Coexists with your existing tests

The CLI job and the component both emit JUnit XML, the same format pytest, jest, and vitest already emit. Failures appear alongside any other failing test: in the MR widget and the pipeline Tests tab on GitLab, in the Checks tab on GitHub. No new UI to learn, no additional reporter to install.

See also

  • Status checks: the validation gate (does every prompt compile?), Cloud-managed and self-hosted.
  • Regression gates: full mechanics of the per-case and run-level gates, plus the GitHub Actions workflow and SDK setup.
  • Running experiments: CLI reference and JUnit output details.

Have questions?

Reach out any time: