TheDocumentation Index
Fetch the complete documentation index at: https://docs.agentmark.co/llms.txt
Use this file to discover all available pages before exploring further.
agentmark-ai/eval-component GitLab CI/CD Catalog component diffs each merge request, runs @agentmark-ai/cli against the changed .prompt.mdx files, and emits JUnit XML. Failures show up in the MR widget and the pipeline Tests tab natively — GitLab parses the JUnit via artifacts:reports:junit:, no third-party reporter required.
This is the GitLab counterpart of agentmark-ai/eval-action. Both wrap the same CLI command (agentmark run-experiment --format junit), accept the same threshold / baseline-ref semantics, and emit the same JUnit XML schema. Switching CI platforms doesn’t require relearning the gates.
The
agentmark-ai/eval-component Catalog project publishes alongside the first GitLab-parity release. If gitlab.com/agentmark-ai/eval-component/eval@v1 resolves to a 404 for you, the component hasn’t been published yet — use the raw-CLI fallback at the bottom of this page in the meantime (it runs the same gate from a hand-rolled .gitlab-ci.yml).Quick start
Paste this into your repo’s.gitlab-ci.yml:
.prompt.mdx files changed in the diff and surfaces results inline in the MR widget.
Set up the API key
AddAGENTMARK_API_KEY as a masked, protected CI/CD variable in your project’s Settings → CI/CD → Variables:
Get the key from AgentMark Cloud
In the AgentMark Dashboard, open Settings → API Keys and create a key scoped to the app whose prompts you’re gating.
Store it as a masked variable
In GitLab, Settings → CI/CD → Variables → Add variable:
- Key:
AGENTMARK_API_KEY - Value: the key from step 1
- Type: Variable (not File)
- Flags: Masked, Protected
Inputs
| Input | Required | Default | Description |
|---|---|---|---|
api-key | optional | — | AgentMark API key. Required for cloud-backed runs; omit for fully local evals. |
prompts | optional | changed .prompt.mdx files | Newline- or space-separated list of prompt files to evaluate. |
threshold | optional | — | Pass-rate threshold (0–100). Fails the job if overall pass rate is below this number. |
baseline-ref | optional | MR diff base | Git ref to compare scores against for the regression gate. Resolved to a tree hash and passed to the CLI as --baseline-commit. Requires GIT_DEPTH=0. Set empty ('') to disable. |
working-directory | optional | . | Directory to run from. |
results-glob | optional | agentmark-results-*.xml | Pattern for per-prompt JUnit XML output files. Must contain exactly one * wildcard — the prefix and suffix around it become the per-prompt filename template. |
cli-version | optional | latest | npm version specifier for @agentmark-ai/cli. Pin for reproducible CI. |
image | optional | node:20-bookworm-slim | Docker image. Must include npm, git, bash. |
What gets gated
Up to four independent gate predicates fire on every run; any failing fails the job.- Per-row gate — every
(row × scorer)pair is a<testcase>in the JUnit XML. If the scorer’spassedflag isfalse, the component emits<failure>and GitLab reports it inline in the MR widget. - Threshold gate (optional) — when
threshold:is set, the job fails if the overall pass rate is below the threshold. - Regression gate (optional) — when
baseline-ref:resolves to a prior run and the prompt setstest_settings.regression_tolerance, a row fails if a scorer’s score dropped more than the tolerance below its baseline. Catches silent quality drops even when the scorer still “passes” in absolute terms. - Per-scorer threshold gate (optional) — when the prompt sets
test_settings.score_thresholds(a{ scorer: minMeanScore }map), the run fails if a scorer’s mean score across the run falls below the configured minimum.
When the job runs
The component’s defaultrules: runs on:
- every merge request pipeline (the primary gate), and
- pushes to the default branch (so a fresh baseline is recorded after merge).
.gitlab-ci.yml to change the cadence:
With a regression-tolerance threshold
Set the per-case tolerance and run-level floors in the prompt’s frontmatter — the component reads them automatically fromtest_settings. The component itself doesn’t need any extra inputs.
baseline-ref defaults to $CI_MERGE_REQUEST_DIFF_BASE_SHA, so MR pipelines pick up the right comparison automatically. The first run on the default branch records the baseline; from then on every MR gates against the run captured at its base commit’s tree hash.
See Regression gates for the full gate semantics — both the per-case tolerance check and the run-level score_thresholds apply identically here.
Coexists with your existing tests
The component emits JUnit XML — the same formatpytest, jest, and vitest already emit. Failures appear in the MR widget alongside any other failing test, and in the Tests tab of the pipeline view. No new dashboard to learn, no additional reporter to install.
Raw-CLI fallback
If you can’t yet consume the Catalog component — it hasn’t been published yet (see the note at the top of this page), or you’re pinning the CLI to a specific version and want the YAML in your repo — drop down to the raw CLI:See also
agentmark-ai/eval-componentREADME — the source, examples, and changelog.agentmark-ai/eval-action— GitHub Actions sibling that wraps the same CLI.- Regression gates — full mechanics of the per-case and run-level gates the component applies.
- Running experiments — CLI reference and JUnit output details.
Have Questions?
We’re here to help! Choose the best way to reach us:
- Email us at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about our business solutions