GitLab CI/CD

The agentmark-ai/eval-component GitLab CI/CD Catalog component diffs each merge request, runs @agentmark-ai/cli against the changed .prompt.mdx files, and emits JUnit XML. Failures show up in the MR widget and the pipeline Tests tab natively — GitLab parses the JUnit via artifacts:reports:junit:, no third-party reporter required. This is the GitLab counterpart of agentmark-ai/eval-action. Both wrap the same CLI command (agentmark run-experiment --format junit), accept the same threshold / baseline-ref semantics, and emit the same JUnit XML schema. Switching CI platforms doesn’t require relearning the gates.

The agentmark-ai/eval-component Catalog project publishes alongside the first GitLab-parity release. If gitlab.com/agentmark-ai/eval-component/eval@v1 resolves to a 404 for you, the component hasn’t been published yet — use the raw-CLI fallback at the bottom of this page in the meantime (it runs the same gate from a hand-rolled .gitlab-ci.yml).

Quick start

Paste this into your repo’s .gitlab-ci.yml:

include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY    # masked, protected CI variable

variables:
  GIT_DEPTH: "0"                     # required so the diff base resolves

That’s it. On every MR, the component evaluates the .prompt.mdx files changed in the diff and surfaces results inline in the MR widget.

GIT_DEPTH: "0" is required. GitLab’s default shallow checkout does not contain the diff base, so the component cannot resolve $CI_MERGE_REQUEST_DIFF_BASE_SHA to a tree hash. When that happens the regression gate is disabled for the run rather than failing the job.

Set up the API key

Add AGENTMARK_API_KEY as a masked, protected CI/CD variable in your project’s Settings → CI/CD → Variables:

Get the key from AgentMark Cloud

In the AgentMark Dashboard, open Settings → API Keys and create a key scoped to the app whose prompts you’re gating.

Store it as a masked variable

In GitLab, Settings → CI/CD → Variables → Add variable:

Key: AGENTMARK_API_KEY
Value: the key from step 1
Type: Variable (not File)
Flags: Masked, Protected

Reference it in inputs

The component reads it via inputs.api-key: $AGENTMARK_API_KEY. Don’t hard-code the key in .gitlab-ci.yml.

The key is required for cloud-backed runs (regression-gate baselines, dataset sync). For fully local evals — no Cloud features — you can omit the input and run without a key.

Inputs

Input	Required	Default	Description
`api-key`	optional	—	AgentMark API key. Required for cloud-backed runs; omit for fully local evals.
`prompts`	optional	changed `.prompt.mdx` files	Newline- or space-separated list of prompt files to evaluate.
`threshold`	optional	—	Pass-rate threshold (0–100). Fails the job if overall pass rate is below this number.
`baseline-ref`	optional	MR diff base	Git ref to compare scores against for the regression gate. Resolved to a tree hash and passed to the CLI as `--baseline-commit`. Requires `GIT_DEPTH=0`. Set empty (`''`) to disable.
`working-directory`	optional	`.`	Directory to run from.
`results-glob`	optional	`agentmark-results-*.xml`	Pattern for per-prompt JUnit XML output files. Must contain exactly one `*` wildcard — the prefix and suffix around it become the per-prompt filename template.
`cli-version`	optional	`latest`	npm version specifier for `@agentmark-ai/cli`. Pin for reproducible CI.
`image`	optional	`node:20-bookworm-slim`	Docker image. Must include npm, git, bash.

What gets gated

Up to four independent gate predicates fire on every run; any failing fails the job.

Per-row gate — every (row × scorer) pair is a <testcase> in the JUnit XML. If the scorer’s passed flag is false, the component emits <failure> and GitLab reports it inline in the MR widget.
Threshold gate (optional) — when threshold: is set, the job fails if the overall pass rate is below the threshold.
Regression gate (optional) — when baseline-ref: resolves to a prior run and the prompt sets test_settings.regression_tolerance, a row fails if a scorer’s score dropped more than the tolerance below its baseline. Catches silent quality drops even when the scorer still “passes” in absolute terms.
Per-scorer threshold gate (optional) — when the prompt sets test_settings.score_thresholds (a { scorer: minMeanScore } map), the run fails if a scorer’s mean score across the run falls below the configured minimum.

The contract is identical to the GitHub Action because both wrap the same CLI. The full mechanics — how the baseline is resolved by tree hash, how rows are matched by input content, how missing baselines stay inert — are documented in Regression gates.

When the job runs

The component’s default rules: runs on:

every merge request pipeline (the primary gate), and
pushes to the default branch (so a fresh baseline is recorded after merge).

Override in your .gitlab-ci.yml to change the cadence:

include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

# Run only on MRs — skip the default-branch baseline write.
agentmark_eval:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

The default-branch run is what populates the baseline that subsequent MRs gate against. Skip it only if you’re recording baselines through a separate process (a scheduled job, a manual trigger, or the SDK).

With a regression-tolerance threshold

Set the per-case tolerance and run-level floors in the prompt’s frontmatter — the component reads them automatically from test_settings. The component itself doesn’t need any extra inputs.

# agentmark/qa.prompt.mdx (frontmatter)
test_settings:
  dataset: ./data/qa.jsonl
  regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
  score_thresholds:
    groundedness: 0.9                   # fail the run if mean groundedness < 0.9

# .gitlab-ci.yml
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

variables:
  GIT_DEPTH: "0"

baseline-ref defaults to $CI_MERGE_REQUEST_DIFF_BASE_SHA, so MR pipelines pick up the right comparison automatically. The first run on the default branch records the baseline; from then on every MR gates against the run captured at its base commit’s tree hash. See Regression gates for the full gate semantics — both the per-case tolerance check and the run-level score_thresholds apply identically here.

Coexists with your existing tests

The component emits JUnit XML — the same format pytest, jest, and vitest already emit. Failures appear in the MR widget alongside any other failing test, and in the Tests tab of the pipeline view. No new dashboard to learn, no additional reporter to install.

Raw-CLI fallback

If you can’t yet consume the Catalog component — it hasn’t been published yet (see the note at the top of this page), or you’re pinning the CLI to a specific version and want the YAML in your repo — drop down to the raw CLI:

agentmark_eval:
  image: node:20-bookworm-slim
  variables:
    GIT_DEPTH: "0"
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  script:
    - npm install -g @agentmark-ai/cli@latest
    - npx agentmark run-experiment agentmark/qa.prompt.mdx --format junit > results.xml
  artifacts:
    when: always
    reports:
      junit: results.xml

This loses the automatic prompt-diff scoping (you list each prompt manually) and the baseline-ref resolution helper, but the JUnit output and the gate semantics are identical.

Quick start

Set up the API key

Inputs

What gets gated

When the job runs

With a regression-tolerance threshold

Coexists with your existing tests

Raw-CLI fallback

See also

Have Questions?

Documentation Index

​Quick start

​Set up the API key

​Inputs

​What gets gated

​When the job runs

​With a regression-tolerance threshold

​Coexists with your existing tests

​Raw-CLI fallback

​See also

​Have Questions?

Quick start

Set up the API key

Inputs

What gets gated

When the job runs

With a regression-tolerance threshold

Coexists with your existing tests

Raw-CLI fallback

See also

Have Questions?