--format junit output is JUnit XML, the same format pytest, jest, and vitest emit, so every major CI system parses it natively: GitHub Actions (via marketplace parsers), GitLab CI (via artifacts:reports:junit:), Jenkins, and CircleCI. Failures show up alongside your other tests, with no third-party reporter to install.
The gates
Three independent gates can fire in CI. They answer different questions and can all run at once:- Validation: does every prompt still compile? Runs
agentmark build. See Status checks, which covers both the Cloud-managed check and the self-hosted CI job. - Absolute pass rate (
--threshold <percent>): is this run good enough on its own? Fails when the share of passing rows falls below a fixed floor. Needs no baseline. See Running experiments. - Regression (
--baseline-commit <ref>): did this change make anything worse than before? Fails when a case scores below its own baseline, or a scorer’s mean drops below a floor. See Regression gates for the full mechanics.
--threshold and --baseline-commit) are flags on the same run-experiment command.
Run evals in CI (raw CLI)
run-experiment sends each prompt and dataset to a running AgentMark dev server, so a CI job boots one headless and waits for it before running the experiment. This is the platform-agnostic pattern: install dependencies, boot agentmark dev --no-ui --no-forward, wait for port 9417, run the experiment, and point your CI at the JUnit output. The GitLab job below is the fully worked example; the same boot-run-report shape ports to GitHub Actions, Jenkins, and CircleCI.
.gitlab-ci.yml
OPENAI_API_KEY). Add them as masked CI/CD variables in Settings → CI/CD → Variables; the job environment passes through to the dev server. Without a running server, run-experiment exits with ❌ Could not connect to AgentMark server.
List each prompt you want to gate as its own run-experiment line writing to its own XML file. Add --threshold <percent> for a pass-rate gate, or --baseline-commit "$CI_MERGE_REQUEST_DIFF_BASE_SHA" for the regression gate. The baseline lookup resolves from AgentMark Cloud when you set an AGENTMARK_API_KEY variable (see Set up the API key), and GIT_DEPTH: "0" keeps the diff base resolvable. See Regression gates for the gate mechanics.
run-experiment always executes through a webhook server (the boot-and-wait step above), on every platform; AGENTMARK_API_KEY / AGENTMARK_APP_ID don’t change that. They point the regression baseline lookup at AgentMark Cloud (durable across CI runs) instead of the ephemeral local store. The regression gate setup has the complete GitHub Actions .github/workflows/evals.yml. To gate agents or workflows from inside your own test suite (no CLI and no dev server, since your task function is the execution), use the SDK setup.Set up the API key
AddAGENTMARK_API_KEY as a masked, protected CI/CD variable in your project’s settings (Settings → CI/CD → Variables on GitLab, Settings → Secrets and variables → Actions on GitHub):
Get the key from AgentMark Cloud
In the AgentMark Dashboard, open Settings → API Keys and create a key scoped to the app whose prompts you’re gating.
Store it as a masked variable
On GitLab, Settings → CI/CD → Variables → Add variable:
- Key:
AGENTMARK_API_KEY - Value: the key from step 1
- Type: Variable (not File)
- Flags: Masked, Protected
${{ secrets.AGENTMARK_API_KEY }}.What gets gated
Up to four independent gate predicates fire on every run; any failing fails the job.- Per-row gate: every
(row × scorer)pair is a<testcase>in the JUnit XML. If the scorer’spassedflag isfalse, the run emits<failure>and your CI reports it inline (in the MR widget on GitLab, the Checks tab on GitHub). - Threshold gate (optional): when you set
--threshold, the job fails if the overall pass rate is below the threshold. - Regression gate (optional): when a baseline run resolves and the prompt sets
test_settings.regression_tolerance, a row fails if a scorer’s score dropped more than the tolerance below its baseline. This catches silent quality drops even when the scorer still “passes” in absolute terms. - Per-scorer threshold gate (optional): when the prompt sets
test_settings.score_thresholds(a{ scorer: minMeanScore }map), the run fails if a scorer’s mean score across the run falls below the configured minimum.
Packaged integrations
Theagentmark-ai/eval-component GitLab CI/CD Catalog component and the agentmark-ai/eval-action GitHub Action automate the wiring: they diff each PR/MR, run @agentmark-ai/cli against the changed .prompt.mdx files, and emit the JUnit XML for you. They share a contract: both wrap the same CLI command (run-experiment --format junit), accept the same threshold / baseline-ref semantics, and emit the same JUnit XML schema.
GitLab component quick start
Once the component ships, the include replaces the hand-rolled job:.prompt.mdx files changed in the diff and surfaces results inline in the MR widget.
Inputs
| Input | Required | Default | Description |
|---|---|---|---|
api-key | optional | None | AgentMark API key. Required for Cloud-backed runs; omit for fully local evals. |
prompts | optional | changed .prompt.mdx files | Newline- or space-separated list of prompt files to evaluate. |
threshold | optional | None | Pass-rate threshold (0–100). Fails the job if overall pass rate is below this number. |
baseline-ref | optional | MR diff base | Git ref to compare scores against for the regression gate. Resolved to a tree hash and passed to the CLI as --baseline-commit. Requires GIT_DEPTH=0. Set empty ('') to disable. |
working-directory | optional | . | Directory to run from. |
results-glob | optional | agentmark-results-*.xml | Pattern for per-prompt JUnit XML output files. Must contain exactly one * wildcard; the prefix and suffix around it become the per-prompt filename template. |
cli-version | optional | latest | npm version specifier for @agentmark-ai/cli. Pin for reproducible CI. |
image | optional | node:20-bookworm-slim | Docker image. Must include npm, git, bash. |
When the job runs
The component’s defaultrules: runs on:
- every merge request pipeline (the primary gate), and
- pushes to the default branch (so the run records a fresh baseline after merge).
.gitlab-ci.yml to change the cadence:
With a regression-tolerance threshold
Set the per-case tolerance and run-level floors in the prompt’s frontmatter. The component reads them automatically fromtest_settings, so it doesn’t need any extra inputs.
baseline-ref defaults to $CI_MERGE_REQUEST_DIFF_BASE_SHA, so MR pipelines pick up the right comparison automatically. The first run on the default branch records the baseline; from then on every MR gates against the run captured at its base commit’s tree hash. See Regression gates for the full gate semantics.
Coexists with your existing tests
The CLI job and the component both emit JUnit XML, the same formatpytest, jest, and vitest already emit. Failures appear alongside any other failing test: in the MR widget and the pipeline Tests tab on GitLab, in the Checks tab on GitHub. No new UI to learn, no additional reporter to install.
See also
- Status checks: the validation gate (does every prompt compile?), Cloud-managed and self-hosted.
- Regression gates: full mechanics of the per-case and run-level gates, plus the GitHub Actions workflow and SDK setup.
- Running experiments: CLI reference and JUnit output details.
Have questions?
Reach out any time:
- Email the team at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about AgentMark’s business solutions