Skip to main content
Run AgentMark evals in GitLab CI and gate merge requests on the results. The CLI’s --format junit output is JUnit XML, which GitLab parses natively via artifacts:reports:junit:, so failures show up in the MR widget and the pipeline Tests tab with no third-party reporter. The agentmark-ai/eval-component GitLab CI/CD Catalog component automates the wiring: it diffs each merge request, runs @agentmark-ai/cli against the changed .prompt.mdx files, and emits the JUnit XML for you. It shares its contract with the agentmark-ai/eval-action GitHub Actions sibling: both wrap the same CLI command (run-experiment --format junit), accept the same threshold / baseline-ref semantics, and emit the same JUnit XML schema.
Neither the agentmark-ai/eval-component Catalog component nor the agentmark-ai/eval-action GitHub Action is published yet, so include: component: gitlab.com/agentmark-ai/eval-component/eval@v1 will not resolve. Use the raw-CLI setup below; it produces identical JUnit output and gate behavior. The component sections on this page document how the component will work once it’s published.

Raw-CLI fallback

Until the component is published, run the CLI directly in a hand-rolled job. This is also the right choice when you want the full YAML pinned in your repo. run-experiment sends each prompt and dataset to a running AgentMark dev server, so the job boots one headless and waits for it before running the experiment:
agentmark_eval:
  image: node:20-bookworm-slim
  variables:
    GIT_DEPTH: "0"
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  script:
    # Project dependencies: the dev server runs your project's client code
    - npm ci
    # Boot the dev server headless (webhook server on 9417, API server on 9418)
    - npx @agentmark-ai/cli dev --no-ui --no-forward &
    # Wait until the webhook server accepts connections
    - timeout 60 bash -c 'until (echo > /dev/tcp/127.0.0.1/9417) 2>/dev/null; do sleep 1; done'
    # Run the experiment and emit JUnit XML
    - npx @agentmark-ai/cli run-experiment agentmark/qa.prompt.mdx --format junit > results.xml
  artifacts:
    when: always
    reports:
      junit: results.xml
The dev server is what executes your prompts, so it needs your model provider keys (for example OPENAI_API_KEY). Add them as masked CI/CD variables in Settings → CI/CD → Variables; the job environment passes through to the dev server. Without a running server, run-experiment exits with ❌ Could not connect to AgentMark server. List each prompt you want to gate as its own run-experiment line writing to its own XML file. Add --threshold <percent> for a pass-rate gate, or --baseline-commit "$CI_MERGE_REQUEST_DIFF_BASE_SHA" for the regression gate. The baseline lookup resolves from AgentMark Cloud when an AGENTMARK_API_KEY variable is set (see Set up the API key), and GIT_DEPTH: "0" keeps the diff base resolvable. See Regression gates for the gate mechanics. Compared to the component, this loses the automatic prompt-diff scoping (you list each prompt manually) and the baseline-ref resolution helper, but the JUnit output and the gate semantics are identical.

Set up the API key

Add AGENTMARK_API_KEY as a masked, protected CI/CD variable in your project’s Settings → CI/CD → Variables:
1

Get the key from AgentMark Cloud

In the AgentMark Dashboard, open Settings → API Keys and create a key scoped to the app whose prompts you’re gating.
2

Store it as a masked variable

In GitLab, Settings → CI/CD → Variables → Add variable:
  • Key: AGENTMARK_API_KEY
  • Value: the key from step 1
  • Type: Variable (not File)
  • Flags: Masked, Protected
3

Reference it from the job

GitLab injects CI/CD variables into the job environment, where the CLI reads AGENTMARK_API_KEY directly. Once the component ships, pass it via inputs.api-key: $AGENTMARK_API_KEY. Don’t hard-code the key in .gitlab-ci.yml.
The key is required for Cloud-backed runs (regression-gate baselines, dataset sync). For fully local evals with no Cloud features, you can skip the variable and run without a key.

Component quick start

Once the component is published, the include replaces the hand-rolled job:
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY    # masked, protected CI variable

variables:
  GIT_DEPTH: "0"                     # required so the diff base resolves
On every MR, the component evaluates the .prompt.mdx files changed in the diff and surfaces results inline in the MR widget.
GIT_DEPTH: "0" is required. GitLab’s default shallow checkout does not contain the diff base, so the component cannot resolve $CI_MERGE_REQUEST_DIFF_BASE_SHA to a tree hash. When that happens the regression gate is disabled for the run rather than failing the job.

Inputs

InputRequiredDefaultDescription
api-keyoptionalNoneAgentMark API key. Required for Cloud-backed runs; omit for fully local evals.
promptsoptionalchanged .prompt.mdx filesNewline- or space-separated list of prompt files to evaluate.
thresholdoptionalNonePass-rate threshold (0–100). Fails the job if overall pass rate is below this number.
baseline-refoptionalMR diff baseGit ref to compare scores against for the regression gate. Resolved to a tree hash and passed to the CLI as --baseline-commit. Requires GIT_DEPTH=0. Set empty ('') to disable.
working-directoryoptional.Directory to run from.
results-globoptionalagentmark-results-*.xmlPattern for per-prompt JUnit XML output files. Must contain exactly one * wildcard; the prefix and suffix around it become the per-prompt filename template.
cli-versionoptionallatestnpm version specifier for @agentmark-ai/cli. Pin for reproducible CI.
imageoptionalnode:20-bookworm-slimDocker image. Must include npm, git, bash.

What gets gated

Up to four independent gate predicates fire on every run; any failing fails the job.
  1. Per-row gate: every (row × scorer) pair is a <testcase> in the JUnit XML. If the scorer’s passed flag is false, the component emits <failure> and GitLab reports it inline in the MR widget.
  2. Threshold gate (optional): when threshold: is set, the job fails if the overall pass rate is below the threshold.
  3. Regression gate (optional): when baseline-ref: resolves to a prior run and the prompt sets test_settings.regression_tolerance, a row fails if a scorer’s score dropped more than the tolerance below its baseline. This catches silent quality drops even when the scorer still “passes” in absolute terms.
  4. Per-scorer threshold gate (optional): when the prompt sets test_settings.score_thresholds (a { scorer: minMeanScore } map), the run fails if a scorer’s mean score across the run falls below the configured minimum.
The contract is identical to the GitHub Action because both wrap the same CLI. Regression gates documents the full mechanics: how the baseline is resolved by tree hash, how rows are matched by input content, and how missing baselines stay inert.

When the job runs

The component’s default rules: runs on:
  • every merge request pipeline (the primary gate), and
  • pushes to the default branch (so a fresh baseline is recorded after merge).
Override in your .gitlab-ci.yml to change the cadence:
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

# Run only on MRs — skip the default-branch baseline write.
agentmark_eval:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
The default-branch run is what populates the baseline that subsequent MRs gate against. Skip it only if you’re recording baselines through a separate process (a scheduled job, a manual trigger, or the SDK).

With a regression-tolerance threshold

Set the per-case tolerance and run-level floors in the prompt’s frontmatter. The component reads them automatically from test_settings, so it doesn’t need any extra inputs.
# agentmark/qa.prompt.mdx (frontmatter)
test_settings:
  dataset: ./data/qa.jsonl
  regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
  score_thresholds:
    groundedness: 0.9                   # fail the run if mean groundedness < 0.9
# .gitlab-ci.yml
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

variables:
  GIT_DEPTH: "0"
baseline-ref defaults to $CI_MERGE_REQUEST_DIFF_BASE_SHA, so MR pipelines pick up the right comparison automatically. The first run on the default branch records the baseline; from then on every MR gates against the run captured at its base commit’s tree hash. See Regression gates for the full gate semantics. Both the per-case tolerance check and the run-level score_thresholds apply identically here.

Coexists with your existing tests

The CLI job and the component both emit JUnit XML, the same format pytest, jest, and vitest already emit. Failures appear in the MR widget alongside any other failing test, and in the Tests tab of the pipeline view. No new UI to learn, no additional reporter to install.

See also

Have questions?

Reach out any time: