Skip to main content
When you run the same evaluation over and over, across PRs, across machines, across months, AgentMark needs a way to know that two runs are the same experiment at different points in time. That answer powers regression gates: a run can only be compared against an earlier “baseline” run if AgentMark can line the two up. This page explains the model behind that lineup. There are three independent axes at work, and keeping them separate is the whole idea:
  1. Which evaluation is this? The experiment_key.
  2. Which code state did it run against? The git tree hash.
  3. Which deployed version served the prompt? The commit SHA.
For the steps to wire a regression gate into CI, see Regression gates. This page is the “why”.

The experiment key: stable identity

An experiment_key is the stable name of one evaluation. Two runs share a baseline relationship only if they share an experiment_key, so this is the identity that has to stay constant as everything else (the code, the dataset order, the machine) changes around it. By default, the key is the prompt’s repo-relative entrypoint path, for example ./prompts/qa.prompt.mdx. That default does two useful things: two distinct evaluations never collide even when they happen to share a dataset, and the identity is derived from something already stable in your repo. You set the key explicitly in two situations:
  • The subject has no single entrypoint file. A code-assembled agent or a multi-step workflow isn’t one .prompt.mdx, so there’s no path to default to. You give it a name like support-agent.
  • You want the identity to survive a rename. If you move or rename the prompt file, a path-derived key changes and the baseline chain breaks. Pinning test_settings.experiment_key keeps it stable across the rename.
The key must resolve to the same string in every environment that runs it. The CLI derives the default from the git top level, but a run recorded where git is unavailable falls back to the prompt name or file basename, which won’t match a git-derived key. If your baseline and candidate run in different environments, set experiment_key explicitly so they can’t drift apart.

Tree hashes: why baselines are content-addressed

Once AgentMark knows which evaluation a run belongs to, it needs to find the right baseline run to compare against. It addresses runs by the git tree hash of the code at the run’s commit, not by the commit SHA. The reason is content-addressing. A tree hash is a fingerprint of file contents, so two commits with identical file contents produce the same tree hash. Two commits with different histories but the same files are, for evaluation purposes, the same experiment, and they should share a baseline. Addressing by tree hash gets that for free:
  • A rebase, a squash, or a cherry-pick that doesn’t change the files keeps the tree hash, so the baseline still matches.
  • A merge commit that changes nothing in your files doesn’t invalidate the comparison.
  • Only an actual change to the evaluated content produces a new tree hash, which is exactly when you’d want a fresh baseline.
When you gate a PR, AgentMark resolves the baseline by experiment_key, environment, and the tree hash at the PR’s base commit. It prefers the run recorded at that exact tree hash. If none exists, it falls back to the most recent prior run of the same key and reports that it did so, so a fallback comparison is never silent.
Pass git rev-parse <ref>^{tree} to convert any commit ref to its tree hash. The CLI does this for you from --baseline-commit; the SDK takes the tree hashes directly because it can’t assume a git checkout. This is why a regression gate needs full git history in CI (fetch-depth: 0): a shallow checkout can’t resolve the base ref to a tree hash, so the comparison degrades to the recency fallback.

Commit SHA: the deployed-version axis

The commit SHA answers a different question from the tree hash: which deployed version of the prompt actually served this run? It’s the axis that links a trace back to a point in your deployment history, and it’s recorded separately from the tree hash used for baseline matching. AgentMark stamps it automatically. When a prompt is loaded from AgentMark Cloud, the gateway records the commit the content was served at into the prompt’s metadata: the pinned environment’s commit for a key bound to a pinned environment, or the latest synced commit otherwise. The local dev server stamps your repo’s HEAD the same way. The SDK echoes that commit onto the trace with no code change on your part. What makes this axis trustworthy is server-side verification. On ingest, the gateway checks the SDK-supplied commit against the environment’s own deployment pointer (trust, but verify). For an API key bound to an environment, the recorded commit always reflects the server’s deployment record, so a client can’t claim it ran an arbitrary version. That’s what lets you answer “what version did prod run last Tuesday?” from trace data and trust the answer. Keeping the two axes distinct matters: the tree hash decides what to compare against, while the commit SHA records what was deployed. A run can match a baseline by tree hash even though it carries a different commit SHA, because identical content can ship from different commits.

Row matching: input hashes, not positions

Identity at the run level isn’t enough. To report that a specific test case regressed, AgentMark also has to line up rows between this run and the baseline. It does that with a content hash of each dataset row’s input, not by row position or by an ID. The payoff is that your dataset is free to move:
  • Reordering rows doesn’t break the comparison, because matching is by content, not position.
  • Regenerating row IDs doesn’t break it either, because IDs aren’t used for matching.
  • Adding or removing rows only affects the rows that changed; the rest still line up.
The trade-off is that the input has to stay byte-stable to match. If you redact or rewrite the stored input (a masking function, or the tracing option that hides inputs), the live rows hash differently from the baseline and match nothing. A baseline that matched zero rows is treated as inert, the same as having no baseline, and AgentMark reports it rather than failing silently.

How the pieces fit

A regression gate uses all three axes together:
  1. experiment_key selects the lineage of runs to consider.
  2. The tree hash picks the baseline run within that lineage (exact match preferred, recency as a reported fallback).
  3. Input-hash row matching aligns each case in this run with its counterpart in that baseline, so a per-case score drop can be detected.
The commit SHA rides alongside as the deployed-version record. It doesn’t decide the comparison, but it’s what makes any run, baseline or candidate, traceable back to a specific deployment.

Where to go next

Regression gates

Wire the baseline comparison into a PR or CI build.

Running experiments

Run an experiment from the CLI, SDK, or Dashboard.

Span reference

The exact attributes that carry experiment key, tree hash, and commit SHA.

Environments

How a pinned environment’s commit becomes the served-at version.

Have questions?

Reach out any time: