--format junit output is JUnit XML, which GitLab parses natively via artifacts:reports:junit:, so failures show up in the MR widget and the pipeline Tests tab with no third-party reporter.
The agentmark-ai/eval-component GitLab CI/CD Catalog component automates the wiring: it diffs each merge request, runs @agentmark-ai/cli against the changed .prompt.mdx files, and emits the JUnit XML for you. It shares its contract with the agentmark-ai/eval-action GitHub Actions sibling: both wrap the same CLI command (run-experiment --format junit), accept the same threshold / baseline-ref semantics, and emit the same JUnit XML schema.
Raw-CLI fallback
Until the component is published, run the CLI directly in a hand-rolled job. This is also the right choice when you want the full YAML pinned in your repo.run-experiment sends each prompt and dataset to a running AgentMark dev server, so the job boots one headless and waits for it before running the experiment:
OPENAI_API_KEY). Add them as masked CI/CD variables in Settings → CI/CD → Variables; the job environment passes through to the dev server. Without a running server, run-experiment exits with ❌ Could not connect to AgentMark server.
List each prompt you want to gate as its own run-experiment line writing to its own XML file. Add --threshold <percent> for a pass-rate gate, or --baseline-commit "$CI_MERGE_REQUEST_DIFF_BASE_SHA" for the regression gate. The baseline lookup resolves from AgentMark Cloud when an AGENTMARK_API_KEY variable is set (see Set up the API key), and GIT_DEPTH: "0" keeps the diff base resolvable. See Regression gates for the gate mechanics.
Compared to the component, this loses the automatic prompt-diff scoping (you list each prompt manually) and the baseline-ref resolution helper, but the JUnit output and the gate semantics are identical.
Set up the API key
AddAGENTMARK_API_KEY as a masked, protected CI/CD variable in your project’s Settings → CI/CD → Variables:
Get the key from AgentMark Cloud
In the AgentMark Dashboard, open Settings → API Keys and create a key scoped to the app whose prompts you’re gating.
Store it as a masked variable
In GitLab, Settings → CI/CD → Variables → Add variable:
- Key:
AGENTMARK_API_KEY - Value: the key from step 1
- Type: Variable (not File)
- Flags: Masked, Protected
Component quick start
Once the component is published, the include replaces the hand-rolled job:.prompt.mdx files changed in the diff and surfaces results inline in the MR widget.
Inputs
| Input | Required | Default | Description |
|---|---|---|---|
api-key | optional | None | AgentMark API key. Required for Cloud-backed runs; omit for fully local evals. |
prompts | optional | changed .prompt.mdx files | Newline- or space-separated list of prompt files to evaluate. |
threshold | optional | None | Pass-rate threshold (0–100). Fails the job if overall pass rate is below this number. |
baseline-ref | optional | MR diff base | Git ref to compare scores against for the regression gate. Resolved to a tree hash and passed to the CLI as --baseline-commit. Requires GIT_DEPTH=0. Set empty ('') to disable. |
working-directory | optional | . | Directory to run from. |
results-glob | optional | agentmark-results-*.xml | Pattern for per-prompt JUnit XML output files. Must contain exactly one * wildcard; the prefix and suffix around it become the per-prompt filename template. |
cli-version | optional | latest | npm version specifier for @agentmark-ai/cli. Pin for reproducible CI. |
image | optional | node:20-bookworm-slim | Docker image. Must include npm, git, bash. |
What gets gated
Up to four independent gate predicates fire on every run; any failing fails the job.- Per-row gate: every
(row × scorer)pair is a<testcase>in the JUnit XML. If the scorer’spassedflag isfalse, the component emits<failure>and GitLab reports it inline in the MR widget. - Threshold gate (optional): when
threshold:is set, the job fails if the overall pass rate is below the threshold. - Regression gate (optional): when
baseline-ref:resolves to a prior run and the prompt setstest_settings.regression_tolerance, a row fails if a scorer’s score dropped more than the tolerance below its baseline. This catches silent quality drops even when the scorer still “passes” in absolute terms. - Per-scorer threshold gate (optional): when the prompt sets
test_settings.score_thresholds(a{ scorer: minMeanScore }map), the run fails if a scorer’s mean score across the run falls below the configured minimum.
When the job runs
The component’s defaultrules: runs on:
- every merge request pipeline (the primary gate), and
- pushes to the default branch (so a fresh baseline is recorded after merge).
.gitlab-ci.yml to change the cadence:
With a regression-tolerance threshold
Set the per-case tolerance and run-level floors in the prompt’s frontmatter. The component reads them automatically fromtest_settings, so it doesn’t need any extra inputs.
baseline-ref defaults to $CI_MERGE_REQUEST_DIFF_BASE_SHA, so MR pipelines pick up the right comparison automatically. The first run on the default branch records the baseline; from then on every MR gates against the run captured at its base commit’s tree hash.
See Regression gates for the full gate semantics. Both the per-case tolerance check and the run-level score_thresholds apply identically here.
Coexists with your existing tests
The CLI job and the component both emit JUnit XML, the same formatpytest, jest, and vitest already emit. Failures appear in the MR widget alongside any other failing test, and in the Tests tab of the pipeline view. No new UI to learn, no additional reporter to install.
See also
- Regression gates: full mechanics of the per-case and run-level gates.
- Running experiments: CLI reference and JUnit output details.
Have questions?
Reach out any time:
- Email us at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about our business solutions