- Which evaluation is this? The
experiment_key. - Which code state did it run against? The git tree hash.
- Which deployed version served the prompt? The commit SHA.
The experiment key: stable identity
Anexperiment_key is the stable name of one evaluation. Two runs share a baseline relationship only if they share an experiment_key, so this is the identity that has to stay constant as everything else (the code, the dataset order, the machine) changes around it.
By default, the key is the prompt’s repo-relative entrypoint path, for example ./prompts/qa.prompt.mdx. That default does two useful things: two distinct evaluations never collide even when they happen to share a dataset, and the identity is derived from something already stable in your repo.
You set the key explicitly in two situations:
- The subject has no single entrypoint file. A code-assembled agent or a multi-step workflow isn’t one
.prompt.mdx, so there’s no path to default to. You give it a name likesupport-agent. - You want the identity to survive a rename. If you move or rename the prompt file, a path-derived key changes and the baseline chain breaks. Pinning
test_settings.experiment_keykeeps it stable across the rename.
Tree hashes: why baselines are content-addressed
Once AgentMark knows which evaluation a run belongs to, it needs to find the right baseline run to compare against. It addresses runs by the git tree hash of the code at the run’s commit, not by the commit SHA. The reason is content-addressing. A tree hash is a fingerprint of file contents, so two commits with identical file contents produce the same tree hash. Two commits with different histories but the same files are, for evaluation purposes, the same experiment, and they should share a baseline. Addressing by tree hash gets that for free:- A rebase, a squash, or a cherry-pick that doesn’t change the files keeps the tree hash, so the baseline still matches.
- A merge commit that changes nothing in your files doesn’t invalidate the comparison.
- Only an actual change to the evaluated content produces a new tree hash, which is exactly when you’d want a fresh baseline.
experiment_key, environment, and the tree hash at the PR’s base commit. It prefers the run recorded at that exact tree hash. If none exists, it falls back to the most recent prior run of the same key and reports that it did so, so a fallback comparison is never silent.
Pass
git rev-parse <ref>^{tree} to convert any commit ref to its tree hash. The CLI does this for you from --baseline-commit; the SDK takes the tree hashes directly because it can’t assume a git checkout. This is why a regression gate needs full git history in CI (fetch-depth: 0): a shallow checkout can’t resolve the base ref to a tree hash, so the comparison degrades to the recency fallback.Commit SHA: the deployed-version axis
The commit SHA answers a different question from the tree hash: which deployed version of the prompt actually served this run? It’s the axis that links a trace back to a point in your deployment history, and it’s recorded separately from the tree hash used for baseline matching. AgentMark stamps it automatically. When a prompt is loaded from AgentMark Cloud, the gateway records the commit the content was served at into the prompt’s metadata: the pinned environment’s commit for a key bound to a pinned environment, or the latest synced commit otherwise. The local dev server stamps your repo’sHEAD the same way. The SDK echoes that commit onto the trace with no code change on your part.
What makes this axis trustworthy is server-side verification. On ingest, the gateway checks the SDK-supplied commit against the environment’s own deployment pointer (trust, but verify). For an API key bound to an environment, the recorded commit always reflects the server’s deployment record, so a client can’t claim it ran an arbitrary version. That’s what lets you answer “what version did prod run last Tuesday?” from trace data and trust the answer.
Keeping the two axes distinct matters: the tree hash decides what to compare against, while the commit SHA records what was deployed. A run can match a baseline by tree hash even though it carries a different commit SHA, because identical content can ship from different commits.
Row matching: input hashes, not positions
Identity at the run level isn’t enough. To report that a specific test case regressed, AgentMark also has to line up rows between this run and the baseline. It does that with a content hash of each dataset row’s input, not by row position or by an ID. The payoff is that your dataset is free to move:- Reordering rows doesn’t break the comparison, because matching is by content, not position.
- Regenerating row IDs doesn’t break it either, because IDs aren’t used for matching.
- Adding or removing rows only affects the rows that changed; the rest still line up.
How the pieces fit
A regression gate uses all three axes together:experiment_keyselects the lineage of runs to consider.- The tree hash picks the baseline run within that lineage (exact match preferred, recency as a reported fallback).
- Input-hash row matching aligns each case in this run with its counterpart in that baseline, so a per-case score drop can be detected.
Where to go next
Regression gates
Wire the baseline comparison into a PR or CI build.
Running experiments
Run an experiment from the CLI, SDK, or Dashboard.
Span reference
The exact attributes that carry experiment key, tree hash, and commit SHA.
Environments
How a pinned environment’s commit becomes the served-at version.
Have questions?
Reach out any time:
- Email us at hello@agentmark.co for support
- Schedule an Enterprise Demo to learn about our business solutions