# Create alert
Source: https://docs.agentmark.co/api-reference/alerts/create-alert

/openapi.yaml post /v1/alerts
Creates a threshold alert.

Threshold units depend on `metric`:
- `error_rate` — percent (0-100)
- `latency` — milliseconds (positive integer)
- `cost` — dollars (positive number)
- `evaluation_score` — score (0-1); also requires `evaluation_name`, `evaluation_aggregation` (`avg` or `individual`), and `evaluation_threshold_direction` (`above` or `below`).

Returns 400 with `field` on the error envelope when the metric/evaluation field coupling is wrong, so a calling agent can self-correct.

Returns 409 `duplicate_alert_name` if an alert with the same `name` already exists for this app.


# Delete alert
Source: https://docs.agentmark.co/api-reference/alerts/delete-alert

/openapi.yaml delete /v1/alerts/{alertId}
Deletes an alert. Cascades to its alert_history rows.


# Get alert
Source: https://docs.agentmark.co/api-reference/alerts/get-alert

/openapi.yaml get /v1/alerts/{alertId}
Returns a single alert by ID.


# List alert trigger history
Source: https://docs.agentmark.co/api-reference/alerts/list-alert-trigger-history

/openapi.yaml get /v1/alerts/{alertId}/history
Returns the chronological trigger/resolve history for an alert, newest first. Each record captures the value that crossed the threshold and the commit SHA at the time, so agents can correlate alerts with deploys.


# List alerts
Source: https://docs.agentmark.co/api-reference/alerts/list-alerts

/openapi.yaml get /v1/alerts
Returns alerts for the authenticated application, newest first.

Filters:
- `status=triggered|resolved` — useful for an agent checking "what is currently firing?"
- `metric=error_rate|latency|cost|evaluation_score` — filter to one metric type.


# List Slack channels available for alert notifications
Source: https://docs.agentmark.co/api-reference/alerts/list-slack-channels-available-for-alert-notifications

/openapi.yaml get /v1/alerts/slack-channels
Returns every channel the Slack bot installed for this app is a member of, in the order Slack returns them. Use this to discover a `channel_id` to wire into an alert with `use_slack: true`.

Returns 404 `slack_integration_not_found` when no Slack workspace has been connected to the app — surface that to the user/agent so they know to connect Slack first via the dashboard.


# Update alert
Source: https://docs.agentmark.co/api-reference/alerts/update-alert

/openapi.yaml put /v1/alerts/{alertId}
Replaces an alert with the supplied body. The full alert state must be sent (PUT semantics) so the metric/evaluation field-coupling rule can be enforced in one validation pass. To toggle a single field, GET the alert, mutate, and PUT.


# Add items to queue
Source: https://docs.agentmark.co/api-reference/annotation-queue-items/add-items-to-queue

/openapi.yaml post /v1/annotation-queues/{queueId}/items
Adds one or more traces/spans/sessions to the queue for review. Duplicate `(queue_id, resource_id)` pairs are ignored silently.


# Delete queue item
Source: https://docs.agentmark.co/api-reference/annotation-queue-items/delete-queue-item

/openapi.yaml delete /v1/annotation-queues/{queueId}/items/{itemId}
Removes an item from a queue. Cascades to reviewer records.


# Get queue item
Source: https://docs.agentmark.co/api-reference/annotation-queue-items/get-queue-item

/openapi.yaml get /v1/annotation-queues/{queueId}/items/{itemId}
Returns a single queue item by ID.


# List queue items
Source: https://docs.agentmark.co/api-reference/annotation-queue-items/list-queue-items

/openapi.yaml get /v1/annotation-queues/{queueId}/items
Returns every item enqueued for review, in the order they were added.


# Submit review
Source: https://docs.agentmark.co/api-reference/annotation-queue-items/submit-review

/openapi.yaml post /v1/annotation-queues/{queueId}/items/{itemId}/reviews
Submit a review (`completed` or `skipped`) on behalf of the authenticated user. When the queue's `reviewers_required` threshold is met, the item auto-advances to `completed`.

This is the endpoint that enables LLM-as-judge pipelines to submit annotations through the same path human reviewers use in the dashboard.


# Update queue item
Source: https://docs.agentmark.co/api-reference/annotation-queue-items/update-queue-item

/openapi.yaml patch /v1/annotation-queues/{queueId}/items/{itemId}
Updates item status or assigned reviewer. Setting status to `completed` auto-records `completed_by` / `completed_at`.


# Create annotation queue
Source: https://docs.agentmark.co/api-reference/annotation-queues/create-annotation-queue

/openapi.yaml post /v1/annotation-queues
Creates a new queue for collecting human review on traces, spans, or sessions.


# Delete annotation queue
Source: https://docs.agentmark.co/api-reference/annotation-queues/delete-annotation-queue

/openapi.yaml delete /v1/annotation-queues/{queueId}
Deletes a queue. Cascades to its items and reviewer records.


# Get annotation queue
Source: https://docs.agentmark.co/api-reference/annotation-queues/get-annotation-queue

/openapi.yaml get /v1/annotation-queues/{queueId}
Returns metadata for a single queue by ID.


# List annotation queues
Source: https://docs.agentmark.co/api-reference/annotation-queues/list-annotation-queues

/openapi.yaml get /v1/annotation-queues
Returns every annotation queue for the authenticated application, with per-queue progress counters (pending / in_progress / completed / skipped / total).

When `assigned_to_me=true`, the result is restricted to queues with at least one item assigned to the authenticated reviewer. Under API-key authentication this filter yields an empty result set since API keys are not associated with a user identity.


# Update annotation queue
Source: https://docs.agentmark.co/api-reference/annotation-queues/update-annotation-queue

/openapi.yaml patch /v1/annotation-queues/{queueId}
Updates mutable queue metadata (`name`, `description`, `status`, `instructions`, `reviewers_required`, `score_config_names`). Fields not provided are left unchanged.


# Create API key
Source: https://docs.agentmark.co/api-reference/api-keys/create-api-key

/openapi.yaml post /v1/api-keys
Creates a new API key. The plaintext key is returned EXACTLY ONCE in `data.plaintext_key` — record it immediately. Subsequent reads expose only metadata.


# List API keys
Source: https://docs.agentmark.co/api-reference/api-keys/list-api-keys

/openapi.yaml get /v1/api-keys
Returns API keys for the authenticated tenant. Plaintext is never returned on this endpoint — record it at creation time.


# Revoke API key
Source: https://docs.agentmark.co/api-reference/api-keys/revoke-api-key

/openapi.yaml delete /v1/api-keys/{apiKeyId}
Revokes the API key and removes its local metadata row. Note: a brief revocation lag may occur — the gateway caches verified credentials for a short TTL, so a freshly-revoked key may still verify until the cache expires.


# Clear the linked repository and branch
Source: https://docs.agentmark.co/api-reference/apps/clear-the-linked-repository-and-branch

/openapi.yaml delete /v1/apps/{appId}/git/link
Clears `git_connection.repository` and removes the `git_branch` row. The OAuth install is preserved so the user can re-link without re-clicking the install URL.

Does NOT delete templates/datasets/storage objects today — that cleanup needs an admin Supabase client and will land in a follow-up endpoint. Until then, callers that want a full reset should delete the app and recreate it.


# Create app
Source: https://docs.agentmark.co/api-reference/apps/create-app

/openapi.yaml post /v1/apps
Creates a new app in the authenticated tenant. `runtime` defaults to `nodejs` if omitted.

Returns 409 `duplicate_app_name` when the tenant already has an app with the supplied `name`.

Returns 402 `entitlement_required` with `entitlement: "max_apps"`, `limit`, and `current` when the tenant has hit their app quota — agents can branch on this to prompt a tier upgrade.


# Delete app
Source: https://docs.agentmark.co/api-reference/apps/delete-app

/openapi.yaml delete /v1/apps/{appId}
Deletes an app. Cascades to all child rows via ON DELETE CASCADE (git_connection, git_branch, app_connection, llm_api_url, alerts, deployments, etc.).

**Headless callers note:** API key rows are NOT cleaned up by this endpoint — the dashboard wraps DELETE + key-revoke separately. Headless agents that delete apps should either accept some orphan API key rows or call the API-key revoke endpoint directly with the keys they minted.


# Get app
Source: https://docs.agentmark.co/api-reference/apps/get-app

/openapi.yaml get /v1/apps/{appId}
Returns a single app by ID.


# Get git connection status for an app
Source: https://docs.agentmark.co/api-reference/apps/get-git-connection-status-for-an-app

/openapi.yaml get /v1/apps/{appId}/git
Returns the current git connection state. `connected: true` means a `git_connection` row exists (OAuth handshake completed). `repository` is null between OAuth-done and repo-picked.

Works for both `github` and `gitlab` providers. The `installation_id` field is GitHub-specific and is null for GitLab connections.

Headless flow: after sending a human to the install URL, poll this endpoint until `connected: true`.


# Link a repository and branch to an app
Source: https://docs.agentmark.co/api-reference/apps/link-a-repository-and-branch-to-an-app

/openapi.yaml post /v1/apps/{appId}/git/link
Persists the chosen repository + branch onto the app's existing git_connection. The next push to `branch` triggers the deploy webhook which materializes templates and datasets.

Requires an existing OAuth install (returns 409 `git_connection_missing` otherwise). Idempotent — re-linking the same repo+branch is a no-op aside from a commit_sha refresh.


# List apps
Source: https://docs.agentmark.co/api-reference/apps/list-apps

/openapi.yaml get /v1/apps
Returns apps for the authenticated tenant, newest first. Use `?name=<exact>` to look up a specific app by name without paginating.

**Auth headers:** `X-Agentmark-App-Id` is OPTIONAL for bearer (session JWT) auth — the listing is tenant-scoped, so headless agents on a cold tenant can call this without supplying any app id. API-key callers still send the header (API keys are app-scoped), but its value is ignored by this handler.


# List branches in a repository
Source: https://docs.agentmark.co/api-reference/apps/list-branches-in-a-repository

/openapi.yaml get /v1/apps/{appId}/git/branches
Returns the list of branch names for `repository` (passed as a query param because the value contains a slash). Use after `list-app-git-repositories` to populate a branch picker before linking.

Returns 409 `git_connection_missing` when no OAuth install exists for this app.


# List repositories accessible to the app's git installation
Source: https://docs.agentmark.co/api-reference/apps/list-repositories-accessible-to-the-apps-git-installation

/openapi.yaml get /v1/apps/{appId}/git/repositories
Returns repositories the linked git installation can see. For GitHub, this is the App's `/installation/repositories`. For GitLab, the OAuth user's `membership=true` projects.

Returns 409 `git_connection_missing` when the OAuth handshake hasn't completed yet — call `POST /v1/apps/:appId/git/connect` first and have a human click through.


# Mint an OAuth authorization URL for git-provider connect
Source: https://docs.agentmark.co/api-reference/apps/mint-an-oauth-authorization-url-for-git-provider-connect

/openapi.yaml post /v1/apps/{appId}/git/connect
Returns a per-provider OAuth authorization URL plus a signed state token. Headless flow:

1. Headless agent POSTs `{ provider }`.
2. Gateway returns `{ authorization_url, state, expires_at }`.
3. Agent surfaces the URL to a human (Slack DM, console output, etc.).
4. Human clicks; provider redirects to the dashboard OAuth callback.
5. Callback validates the state token and writes the `git_connection` row.
6. Agent polls `GET /v1/apps/:appId/git` until `{ connected: true }`.

The `state` token is HMAC-signed (10-minute TTL) and binds the OAuth callback to the originating app + tenant + provider. Each call mints a fresh nonce — no replay across attempts.

Returns 503 `git_connect_not_configured` when the relevant provider OAuth env vars are missing on this environment (e.g. `GITHUB_APP_SLUG` not set).


# Update app
Source: https://docs.agentmark.co/api-reference/apps/update-app

/openapi.yaml patch /v1/apps/{appId}
Updates writable fields on an app. PATCH semantics — any subset of `name`, `runtime`, `entry_point` may be sent. An empty body is rejected with 400.

Returns 409 `duplicate_app_name` if renaming would collide with another app in the same tenant.


# Authentication
Source: https://docs.agentmark.co/api-reference/authentication

How to authenticate with the AgentMark Gateway API.

All API endpoints (except health checks and capabilities) require two headers:

| Header               | Description                              |
| -------------------- | ---------------------------------------- |
| `Authorization`      | Your API key — `Bearer sk_agentmark_...` |
| `X-Agentmark-App-Id` | Your application ID — `app_...`          |

The `Authorization` value can be either an API key (`sk_agentmark_...`) or a Supabase session JWT (used by the dashboard and `agentmark login`).

## Creating an API key

1. Open the [AgentMark Dashboard](https://app.agentmark.co)
2. Switch to the app you want to scope the key to (shown in the breadcrumb)
3. Navigate to the app's **Settings → API keys** page (app-level, not org-level)
4. Click **Create API key**
5. Select a **role** (SDK, Read-Only, or Full Access) or choose **Custom** to toggle individual permissions
6. Copy the key — it is only shown once

API keys enforce per-endpoint permissions. A key with **SDK** access can ingest traces and read templates but cannot delete anything. See [Users and access control](/deploy/users-and-access-control#api-keys) for the permission catalog and role presets.

## Endpoint permissions

Every API endpoint requires a specific permission. If your API key lacks the required permission, the request returns `403 Forbidden`.

### Traces and spans

| Endpoint                                  | Permission    |
| ----------------------------------------- | ------------- |
| `POST /v1/traces`                         | `trace.write` |
| `GET /v1/traces`                          | `trace.read`  |
| `GET /v1/traces/{traceId}`                | `trace.read`  |
| `GET /v1/traces/{traceId}/graph`          | `trace.read`  |
| `GET /v1/traces/{traceId}/spans`          | `span.read`   |
| `GET /v1/traces/{traceId}/spans/{spanId}` | `span.read`   |
| `GET /v1/spans`                           | `span.read`   |

### Sessions

| Endpoint                              | Permission     |
| ------------------------------------- | -------------- |
| `GET /v1/sessions`                    | `session.read` |
| `GET /v1/sessions/{sessionId}/traces` | `session.read` |

### Scores

| Endpoint                       | Permission          |
| ------------------------------ | ------------------- |
| `POST /v1/scores`              | `score.write`       |
| `POST /v1/scores/batch`        | `score.write`       |
| `GET /v1/scores`               | `score.read`        |
| `GET /v1/scores/{scoreId}`     | `score.read`        |
| `GET /v1/scores/aggregations`  | `score.read`        |
| `GET /v1/scores/names`         | `score.read`        |
| `DELETE /v1/scores/{scoreId}`  | `score.delete`      |
| `GET /v1/score-configs`        | `score_config.read` |
| `GET /v1/score-configs/{name}` | `score_config.read` |

### Metrics, datasets, experiments, prompts

| Endpoint                                           | Permission        |
| -------------------------------------------------- | ----------------- |
| `GET /v1/metrics`                                  | `metrics.read`    |
| `GET /v1/config`                                   | `template.read`   |
| `GET /v1/datasets`                                 | `dataset.read`    |
| `POST /v1/datasets/{datasetName}/rows`             | `dataset.write`   |
| `POST /v1/datasets/{datasetName}/rows/from-traces` | `dataset.write`   |
| `POST /v1/datasets/{datasetName}/rows/from-spans`  | `dataset.write`   |
| `GET /v1/experiments`                              | `experiment.read` |
| `GET /v1/experiments/{experimentId}`               | `experiment.read` |
| `GET /v1/prompts`                                  | `template.read`   |
| `GET /v1/runs/{runId}/traces`                      | `trace.read`      |
| `GET /v1/templates/{templatePath}`                 | `template.read`   |

### Annotation queues

| Endpoint                                                      | Permission                |
| ------------------------------------------------------------- | ------------------------- |
| `GET /v1/annotation-queues`                                   | `annotation_queue.read`   |
| `POST /v1/annotation-queues`                                  | `annotation_queue.write`  |
| `GET /v1/annotation-queues/{queueId}`                         | `annotation_queue.read`   |
| `PATCH /v1/annotation-queues/{queueId}`                       | `annotation_queue.write`  |
| `DELETE /v1/annotation-queues/{queueId}`                      | `annotation_queue.delete` |
| `GET /v1/annotation-queues/{queueId}/items`                   | `annotation_queue.read`   |
| `POST /v1/annotation-queues/{queueId}/items`                  | `annotation_queue.write`  |
| `GET /v1/annotation-queues/{queueId}/items/{itemId}`          | `annotation_queue.read`   |
| `PATCH /v1/annotation-queues/{queueId}/items/{itemId}`        | `annotation_queue.write`  |
| `DELETE /v1/annotation-queues/{queueId}/items/{itemId}`       | `annotation_queue.delete` |
| `POST /v1/annotation-queues/{queueId}/items/{itemId}/reviews` | `annotation_queue.review` |

### Unauthenticated

| Endpoint                   | Notes            |
| -------------------------- | ---------------- |
| `GET /v1/capabilities`     | No auth required |
| `GET /v1/pricing`          | No auth required |
| `GET /health`              | No auth required |
| `GET /v1/health/ingestion` | No auth required |
| `GET /v1/health/files`     | No auth required |

## Making requests

<Tabs>
  <Tab title="curl">
    ```bash theme={null}
    curl https://api.agentmark.co/v1/traces \
      -H "Authorization: Bearer sk_agentmark_your_key_here" \
      -H "X-Agentmark-App-Id: app_your_app_id_here"
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    const response = await fetch("https://api.agentmark.co/v1/traces", {
      headers: {
        "Authorization": `Bearer ${process.env.AGENTMARK_API_KEY}`,
        "X-Agentmark-App-Id": process.env.AGENTMARK_APP_ID!,
      },
    });
    const data = await response.json();
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    import os
    import requests

    response = requests.get(
        "https://api.agentmark.co/v1/traces",
        headers={
            "Authorization": f"Bearer {os.environ['AGENTMARK_API_KEY']}",
            "X-Agentmark-App-Id": os.environ["AGENTMARK_APP_ID"],
        },
    )
    data = response.json()
    ```
  </Tab>
</Tabs>

## Rate limiting

API requests are rate-limited per tenant. Limits vary by plan — see [Billing and usage](/deploy/billing-and-usage#rate-limits). When you exceed your rate limit, the API returns `429 Too Many Requests` with a `Retry-After` header indicating how long to wait before retrying.

## Span and storage limits

Trace ingestion (`POST /v1/traces`) enforces monthly quotas depending on your plan:

* **Span limit** — maximum number of spans per month (Hobby/Growth tiers)
* **Storage cap** — maximum storage used (certain plans only)

When exceeded, the API returns `429` with a `span_limit_exceeded` or `storage_cap_exceeded` error code and a `Retry-After` header. Upgrade your plan at [Settings → Billing](https://app.agentmark.co/settings/billing).

## Error responses

Every `/v1` error body follows the same shape — `{ error: { code, message } }`:

```json theme={null}
{
  "error": {
    "code": "unauthorized",
    "message": "Not authorized"
  }
}
```

Some errors add machine-readable context **as fields alongside `code` and `message`**. Quota errors, for example, spread their context as flat siblings:

```json theme={null}
{
  "error": {
    "code": "span_limit_exceeded",
    "message": "Monthly unit limit exceeded. Upgrade your plan for unlimited units.",
    "currentCount": 20000,
    "limit": 20000,
    "upgradeUrl": "https://app.agentmark.co/settings/billing"
  }
}
```

`400` validation errors are the one case that nests: `error.message` is a generic summary (`Invalid request body` / `Invalid query parameters` / `Invalid path parameters`), and the per-field messages are collected in an `error.details` map keyed by field name:

```json theme={null}
{
  "error": {
    "code": "invalid_request_body",
    "message": "Invalid request body",
    "details": {
      "score": "Expected number, received string"
    }
  }
}
```

| Status | Meaning                                                                                                 |
| ------ | ------------------------------------------------------------------------------------------------------- |
| `400`  | Invalid request — `error.message` is a generic summary; per-field errors are in the `error.details` map |
| `401`  | Missing or invalid `Authorization` / `X-Agentmark-App-Id` header                                        |
| `403`  | API key is valid but does not have the required permission for this endpoint                            |
| `429`  | Rate limited or plan quota exceeded                                                                     |
| `503`  | Service temporarily overloaded — retry after the interval in the `Retry-After` header                   |


# Get capabilities
Source: https://docs.agentmark.co/api-reference/capabilities/get-capabilities

/openapi.yaml get /v1/capabilities
Returns a map of available API endpoints for the current target (cloud or local). Use this to discover which features are supported before calling other endpoints.

This endpoint does not require authentication.


# Get config
Source: https://docs.agentmark.co/api-reference/config/get-config

/openapi.yaml get /v1/config
Returns the effective project configuration synced from `agentmark.json` for the authenticated application, plus the current synced commit SHA when available.


# Append dataset row
Source: https://docs.agentmark.co/api-reference/datasets/append-dataset-row

/openapi.yaml post /v1/datasets/{datasetName}/rows
Appends a single row to the specified dataset. The `datasetName` parameter is the dataset path without the `.jsonl` extension, URL-encoded.

For example, to append to `evals/sentiment-test.jsonl`, use `evals%2Fsentiment-test` as the `datasetName`.


# Import dataset rows from spans
Source: https://docs.agentmark.co/api-reference/datasets/import-dataset-rows-from-spans

/openapi.yaml post /v1/datasets/{datasetName}/rows/from-spans
Transforms one or more spans into canonical dataset rows and appends them to the specified dataset.


# Import dataset rows from traces
Source: https://docs.agentmark.co/api-reference/datasets/import-dataset-rows-from-traces

/openapi.yaml post /v1/datasets/{datasetName}/rows/from-traces
Transforms one or more traces into canonical dataset rows and appends them to the specified dataset.


# List datasets
Source: https://docs.agentmark.co/api-reference/datasets/list-datasets

/openapi.yaml get /v1/datasets
Returns a paginated list of datasets for your application with per-dataset metadata (`row_count`, `created_at`).

`?name=X` does an **exact match** on the dataset's leaf name (the file name without the `.jsonl` extension or any folder prefix), mirroring `/v1/prompts?name=`. For substring or prefix search, fetch the unfiltered list and filter client-side.

Standard `limit` / `offset` pagination.


# Get deployment
Source: https://docs.agentmark.co/api-reference/deployments/get-deployment

/openapi.yaml get /v1/deployments/{deploymentId}
Returns a single deployment by id. The response includes status (`deployment_status`, `files_status`, `code_status`), commit metadata, timing, and any failure reason — sufficient to monitor progress without needing the dedicated logs endpoint for most use cases. For deployments created by an environment promote or rollback, the response also carries the saga columns (`env_version`, `source_env_id`, `source_deployment_id`, `aborted_by_deployment_id`, `note`, `actor_id`, `environment_name`, `environment_epoch`); these are null on legacy build-pipeline rows.


# List deployments
Source: https://docs.agentmark.co/api-reference/deployments/list-deployments

/openapi.yaml get /v1/deployments
Returns a paginated list of deployments for the authenticated application, newest first. Supports `?status=running|success|failure` to filter by deployment status — the typical CI use case ("is my deploy still going?") is `?status=running`.


# Create environment
Source: https://docs.agentmark.co/api-reference/environments/create-environment

/openapi.yaml post /v1/environments
Creates a new environment in the current app. The env starts in the no-pin state — promote (POST `/v1/environments/{id}/promote`) is the only path to pin a version. Synchronously provisions a per-env Fly app; failure rolls back the env row and surfaces 500 `env_fly_provisioning_failed`. Per FR-028 no API key is auto-minted; the response includes `api_key_creation_url` for the post-create CTA.


# Delete environment
Source: https://docs.agentmark.co/api-reference/environments/delete-environment

/openapi.yaml delete /v1/environments/{id}
Hard-deletes a non-default environment after a typed-name confirmation. Cascade order: env row → CASCADE removes api_key / alert / deployment / template_snapshot / env_config_snapshot; SET NULL preserves alert_history / annotation_queue (denormalized name + epoch kept intact). Fly app destroy is attempted after the DB delete commits — on failure the env row is already gone and the response surfaces 500 `env_fly_provisioning_failed` for operator intervention.


# Get environment
Source: https://docs.agentmark.co/api-reference/environments/get-environment

/openapi.yaml get /v1/environments/{id}
Returns a single environment by id. Includes `cascade_preview` (counts of api_key / alert / deployment / template_snapshot rows that would be deleted on env delete — see FR-090) and `in_flight_saga_id` (id of the most-recent pending/snapshotting/deploying saga deployment row, or null).


# List environments
Source: https://docs.agentmark.co/api-reference/environments/list-environments

/openapi.yaml get /v1/environments
Returns environments for the authenticated application, default env first. Each env carries pin state (`current_version` — 0 when no-pin — plus `current_commit_sha`, the git commit of the pinned deployment, NULL when no-pin) and a stable `epoch` that distinguishes env instances across name reuse after delete.


# List the saga deployment audit log for an environment
Source: https://docs.agentmark.co/api-reference/environments/list-the-saga-deployment-audit-log-for-an-environment

/openapi.yaml get /v1/environments/{id}/deployments
Returns every saga `deployment` row (`env_version IS NOT NULL`) for this env in reverse chronological order. Per FR-030/FR-043. Replaces the removed 054-era `GET /v1/environments/{id}/promotion-history`.


# Promote to environment
Source: https://docs.agentmark.co/api-reference/environments/promote-to-environment

/openapi.yaml post /v1/environments/{id}/promote
Promotes the source environment's current content + code into this environment. Runs the unified deployment in-process: it writes the env-keyed `template_snapshot`, atomically commits the env pointer, AND dispatches the managed code build — this CLOSES the env-pin → code-deploy gap (a promote now deploys code, not just content). Returns 202 with the created `deployment` row; the snapshot + env-pointer commit are already done, while the build runs asynchronously — poll `GET /v1/deployments/{deployment_id}` to watch `code_status` reach `deployed`. The default environment cannot be a promote target.


# Roll back an environment
Source: https://docs.agentmark.co/api-reference/environments/roll-back-an-environment

/openapi.yaml post /v1/environments/{id}/rollback
Rolls this environment back to a prior pinned version (FR-018..FR-023). `target_version` must name a successful forward-promote deployment row on this env — rollback re-runs the deployment at that row's `commit_sha`. Rollback is a promote at an older commit: it writes a fresh env-keyed snapshot, commits the env pointer, and dispatches the build. Returns 202 with the created `deployment` row; poll `GET /v1/deployments/{deployment_id}` to watch the build complete.

Feature-gated: while the `enable_env_rollback` flag is OFF for the caller's tenant this route responds `404` as if it did not exist.


# Get baseline scores
Source: https://docs.agentmark.co/api-reference/experiments/get-baseline-scores

/openapi.yaml get /v1/experiments/baseline
Return per-(row × scorer) scores from the baseline run matching `commit_sha` (a content-addressed git tree hash), optionally narrowed by `dataset_path`. Used by `agentmark run-experiment --baseline-commit` to drive the regression gate: each score is keyed by `inputHash` (a stable hash of the row’s dataset input) so a live run can match its rows to the baseline regardless of order. Empty `rows` when no matching run exists.


# Get experiment
Source: https://docs.agentmark.co/api-reference/experiments/get-experiment

/openapi.yaml get /v1/experiments/{experimentId}
Retrieve a specific experiment by ID, including its per-item details (trace IDs, inputs/outputs, per-item cost/latency/tokens, and any scores attached to each trace).


# List experiments
Source: https://docs.agentmark.co/api-reference/experiments/list-experiments

/openapi.yaml get /v1/experiments
Retrieve a paginated list of experiments (dataset runs). Each experiment is a group of traces that share a `DatasetRunId`, typically produced when a prompt is evaluated against every row in a dataset.

Response includes per-experiment aggregates (item count, avg latency, total cost, avg score) plus filter options (distinct prompt names and dataset paths) so UI consumers can populate dropdowns without a second request.


# Files health
Source: https://docs.agentmark.co/api-reference/health/files-health

/openapi.yaml get /v1/health/files
Check the health of the files service and its dependencies.


# Ingestion health
Source: https://docs.agentmark.co/api-reference/health/ingestion-health

/openapi.yaml get /v1/health/ingestion
Check the health of the trace ingestion pipeline and its dependencies.


# Service health
Source: https://docs.agentmark.co/api-reference/health/service-health

/openapi.yaml get /health
Check if the gateway service is running. Returns healthy if all required
environment variables are configured.


# Get metrics
Source: https://docs.agentmark.co/api-reference/metrics/get-metrics

/openapi.yaml get /v1/metrics
Retrieve aggregated analytics metrics for your application. Returns a summary and an hourly time series for trace volume, latency, cost, token usage, and error rates.

**Cloud only.** The local dev server does not serve this endpoint (requires ClickHouse aggregations).


# API reference
Source: https://docs.agentmark.co/api-reference/overview

Programmatic access to the AgentMark Gateway API.

The AgentMark Gateway API provides direct HTTP access to trace ingestion, scoring, and template retrieval.

<Note>
  Most developers should use the [AgentMark SDK](/introduction/overview) instead of calling the REST API directly.
  The SDK handles authentication, retries, and serialization automatically.
</Note>

## Base URL

<Tabs>
  <Tab title="Cloud">
    ```
    https://api.agentmark.co
    ```
  </Tab>

  <Tab title="Local">
    ```
    http://localhost:9418
    ```
  </Tab>
</Tabs>

The local dev server (`npx agentmark dev`) and Cloud share the same `/v1/*` wire contract — the same Zod schemas are the source of truth on both surfaces. What differs is which handlers are implemented where:

* **Both surfaces:** `/v1/config`, `/v1/traces` (ingest + read), `/v1/sessions`, `/v1/spans`, `/v1/scores` (full CRUD + batch), `/v1/datasets`, `/v1/experiments`, `/v1/templates`, `/v1/capabilities`, `/v1/pricing`.
* **Cloud-only** (local returns `404` or a `501 not_available_locally` stub): `/v1/metrics`, `/v1/scores/aggregations`, `/v1/traces/export`, annotation queues (`/v1/annotation-queues/*`), and the health endpoints (`/health`, `/v1/health/*`).
* **Local-only** (Cloud returns `501 not_available_on_cloud`): `/v1/prompts` lists the prompt files on disk — the Cloud handler is a documented stub pending an implementation decision.
* **Deprecated:** `/v1/runs/{runId}/traces` still works on local for backwards compatibility with older SDK versions, but new code should use `/v1/traces?dataset_run_id={runId}` — both paths hit the same ClickHouse predicate. The Cloud endpoint has always returned `501`.

Call `GET /v1/capabilities` to probe which features a server supports at runtime.

All endpoints are prefixed with `/v1/` except the root health check.

## Available endpoints

The **Where** column shows which environments implement each route. "Cloud + Local" means the same handler semantics on both; "Cloud only" / "Local only" mean the other side returns `501` (with a `not_available_on_cloud` / `not_available_locally` error code) or `404`.

| Endpoint                                                 | Method                     | Where                       | Description                                                                                                                                                                                |
| -------------------------------------------------------- | -------------------------- | --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `/v1/traces`                                             | `POST`                     | Cloud + Local               | Ingest trace data in OTLP format (supports gzip)                                                                                                                                           |
| `/v1/traces`                                             | `GET`                      | Cloud + Local               | List traces with filtering — supports `dataset_run_id` for run-scoped listings                                                                                                             |
| `/v1/traces/{traceId}`                                   | `GET`                      | Cloud + Local               | Get a single trace with all its spans                                                                                                                                                      |
| `/v1/traces/{traceId}/spans`                             | `GET`                      | Cloud + Local               | List every span belonging to a trace                                                                                                                                                       |
| `/v1/traces/{traceId}/spans/{spanId}`                    | `GET`                      | Cloud + Local               | Get full input/output payload for a single span                                                                                                                                            |
| `/v1/traces/{traceId}/graph`                             | `GET`                      | Cloud + Local               | Return nodes + edges for visualizing a trace's agent-execution flow                                                                                                                        |
| `/v1/traces/export`                                      | `GET`                      | Cloud only                  | Export traces as JSONL, CSV, or OpenAI fine-tuning format                                                                                                                                  |
| `/v1/sessions`                                           | `GET`                      | Cloud + Local               | List sessions with filtering by name and user                                                                                                                                              |
| `/v1/sessions/{sessionId}/traces`                        | `GET`                      | Cloud + Local               | List traces for a specific session                                                                                                                                                         |
| `/v1/spans`                                              | `GET`                      | Cloud + Local               | Query spans across traces with filtering by type, status, model, and duration                                                                                                              |
| `/v1/scores`                                             | `POST`                     | Cloud + Local               | Create a score record for a span or trace                                                                                                                                                  |
| `/v1/scores/batch`                                       | `POST`                     | Cloud + Local               | Create up to 1000 scores in one request (per-item results, 207-style)                                                                                                                      |
| `/v1/scores`                                             | `GET`                      | Cloud + Local               | List scores for a specific span or trace                                                                                                                                                   |
| `/v1/scores/{scoreId}`                                   | `GET`                      | Cloud + Local               | Get a single score by ID                                                                                                                                                                   |
| `/v1/scores/{scoreId}`                                   | `DELETE`                   | Cloud + Local               | Delete a score record                                                                                                                                                                      |
| `/v1/scores/names`                                       | `GET`                      | Cloud + Local               | List distinct score names (for UI filters)                                                                                                                                                 |
| `/v1/scores/aggregations`                                | `GET`                      | Cloud only                  | Aggregated score statistics grouped by name                                                                                                                                                |
| `/v1/score-configs`                                      | `GET`                      | Cloud + Local               | List score configurations (reusable score schemas)                                                                                                                                         |
| `/v1/score-configs/{name}`                               | `GET`                      | Cloud + Local               | Get a single score configuration by name                                                                                                                                                   |
| `/v1/metrics`                                            | `GET`                      | Cloud only                  | Aggregated analytics (trace volume, latency, cost, tokens, error rates)                                                                                                                    |
| `/v1/config`                                             | `GET`                      | Cloud + Local               | Retrieve the synced `agentmark.json` project configuration plus the current commit SHA                                                                                                     |
| `/v1/datasets`                                           | `GET`                      | Cloud + Local               | List datasets with per-dataset metadata (`row_count`, `created_at`), case-insensitive `?name=` substring filter, and canonical `{ data, pagination }` envelope                             |
| `/v1/datasets/{datasetName}/rows`                        | `POST`                     | Cloud + Local               | Append a canonical dataset row with `input`, `expected_output`, and `metadata`                                                                                                             |
| `/v1/datasets/{datasetName}/rows/from-traces`            | `POST`                     | Cloud + Local               | Import one or more traces into canonical dataset rows using optional field mapping                                                                                                         |
| `/v1/datasets/{datasetName}/rows/from-spans`             | `POST`                     | Cloud + Local               | Import one or more spans into canonical dataset rows using optional field mapping                                                                                                          |
| `/v1/experiments`                                        | `GET`                      | Cloud + Local               | List experiments                                                                                                                                                                           |
| `/v1/experiments/{experimentId}`                         | `GET`                      | Cloud + Local               | Get an experiment by ID                                                                                                                                                                    |
| `/v1/prompts`                                            | `GET`                      | Local only                  | List prompt file paths in the project. Cloud returns `501 not_available_on_cloud`.                                                                                                         |
| `/v1/runs/{runId}/traces`                                | `GET`                      | Local only · **deprecated** | Use `/v1/traces?dataset_run_id={runId}` instead — both paths hit the same predicate. Kept on Local for older SDK versions; Cloud returns `501`.                                            |
| `/v1/capabilities`                                       | `GET`                      | Cloud + Local               | Check which features the server supports (no auth required)                                                                                                                                |
| `/v1/templates/{templatePath}`                           | `GET`                      | Cloud + Local               | Retrieve a prompt template by file path                                                                                                                                                    |
| `/v1/pricing`                                            | `GET`                      | Cloud + Local               | Per-model LLM pricing data (no auth required)                                                                                                                                              |
| `/v1/annotation-queues`                                  | `GET` · `POST`             | Cloud only                  | List / create annotation queues for human review                                                                                                                                           |
| `/v1/annotation-queues/{queueId}`                        | `GET` · `PATCH` · `DELETE` | Cloud only                  | Read / update / delete a queue                                                                                                                                                             |
| `/v1/annotation-queues/{queueId}/items`                  | `GET` · `POST`             | Cloud only                  | List items or add traces/spans/sessions to a queue                                                                                                                                         |
| `/v1/annotation-queues/{queueId}/items/{itemId}`         | `GET` · `PATCH` · `DELETE` | Cloud only                  | Read / update / remove a single queue item                                                                                                                                                 |
| `/v1/annotation-queues/{queueId}/items/{itemId}/reviews` | `POST`                     | Cloud only                  | Submit a review — LLM-as-judge pipelines can land annotations the same way human reviewers do                                                                                              |
| `/v1/api-keys`                                           | `GET` · `POST`             | Cloud only                  | List API keys (metadata only — no plaintext) or mint a new key. The plaintext value of a newly created key is returned exactly once in the `POST` response and is unrecoverable afterward. |
| `/v1/api-keys/{apiKeyId}`                                | `DELETE`                   | Cloud only                  | Revoke an API key. Revoked keys are rejected immediately.                                                                                                                                  |
| `/v1/connect`                                            | `GET` (WebSocket upgrade)  | Cloud only                  | Persistent connection for deployed workers to receive dispatched jobs.                                                                                                                     |
| `/health`                                                | `GET`                      | Cloud only                  | Root health check (no auth required)                                                                                                                                                       |
| `/v1/health/ingestion`                                   | `GET`                      | Cloud only                  | Ingestion pipeline health with dependency statuses                                                                                                                                         |
| `/v1/health/files`                                       | `GET`                      | Cloud only                  | Files service health with dependency statuses                                                                                                                                              |

Use the sidebar to browse interactive documentation for each endpoint.

<Tip>
  Two programmatic surfaces, same OpenAPI spec under the hood:

  * **From shell / CI:** call the REST endpoints with `curl` and an `AGENTMARK_API_KEY` (or the session bearer from `~/.agentmark/auth.json` after `agentmark login`).
  * **From an IDE agent:** run the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server. It fetches this spec at startup and exposes one MCP tool per operation (e.g. `list_traces`, `create_app`, `start_app_git_connect`), so your Claude Code / Cursor / etc. agent can drive the gateway headlessly.
</Tip>

## Response format

All responses are JSON unless otherwise noted (e.g., CSV exports). Error responses follow a consistent canonical envelope:

```json theme={null}
{
  "error": {
    "code": "string_snake_case",
    "message": "Description of what went wrong"
  }
}
```

The `error.code` field is the programmatic discriminator — use it to branch on specific error cases. The `error.message` field is the human-readable description to show to users. Additional context (e.g. `retry_after_seconds`, `jobId`) appears as extra fields **directly inside `error`**, alongside `code` and `message` (the one exception is `400` validation errors, which nest per-field messages in an `error.details` map — see [Authentication](/api-reference/authentication#error-responses)):

```json theme={null}
{
  "error": {
    "code": "span_limit_exceeded",
    "message": "Monthly unit limit exceeded. Upgrade your plan for unlimited units.",
    "currentCount": 20000,
    "limit": 20000,
    "upgradeUrl": "https://app.agentmark.co/settings/billing"
  }
}
```

The shape matches Stripe, OpenAI, and Anthropic error conventions — one parser works across all endpoints.

## Rate limiting

Requests are rate-limited per tenant. When you exceed your rate limit, the API returns a `429` status code.
Trace ingestion has additional monthly span and storage quotas depending on your plan.

See [Authentication](/api-reference/authentication) for details.

## Versioning

Every endpoint is prefixed with `/v1/`. Breaking changes ship under new version prefixes (`/v2/`, etc.) with a 90+ day deprecation window — `/v1/` keeps working while you migrate.

See [API versioning & stability](/api-reference/versioning) for the full policy on what's breaking, what's additive, and how deprecations are announced.

## Why there is no `PATCH /v1/traces`

Traces are immutable in AgentMark. Once a span lands in ClickHouse, the row representing what happened during that execution is frozen — there is deliberately no endpoint that mutates it.

Other observability platforms expose a "patch trace" endpoint that lets clients backfill metadata, attach a label, or correct a field after ingestion. AgentMark covers those workflows through three separate, append-only resources instead:

* **Scores** (`POST /v1/scores`, `POST /v1/scores/batch`) — attach a graded value (numeric, categorical, or boolean) to a trace or span after the fact. Scores are versioned by `created_at` and never overwrite the underlying span.
* **Comments** — free-form human notes on a trace or span, stored alongside the trace as a separate resource.
* **Annotation queues** (`/v1/annotation-queues/*`) — structured human-in-the-loop review that produces new score and comment records, again without modifying the trace itself.

The three resources above are the migration targets for any "patch trace" workflow you'd build on a competitor. This split is intentional: it keeps the audit trail clean (you can always tell what the model did vs. what a reviewer added later) and lets retention, RBAC, and export rules apply differently to raw execution data than to human-attached metadata.

This is a permanent design choice, not a missing feature — `PATCH /v1/traces` will not ship in `/v1/`, `/v2/`, or any future version.

## Filtering on `/v1/spans` and `/v1/scores`

`/v1/spans` and `/v1/scores` accept the same filter vocabulary as `/v1/traces`. The point is that one filter expression composes across surfaces — write it once, reuse it for trace listings, span listings, score listings, and saved-filter exports.

`/v1/spans` accepts:

* `start_date`, `end_date` — ISO 8601 timestamps. Inclusive on both ends.
* `user_id`, `session_id` — scope the result to a specific user or session.
* `filter` — a JSON-encoded filter DSL, identical to the one `/v1/traces` accepts. Example:

  ```http theme={null}
  GET /v1/spans?filter=%7B%22op%22%3A%22and%22%2C%22exprs%22%3A%5B%7B%22field%22%3A%22model%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A%22gpt-4o%22%7D%2C%7B%22field%22%3A%22latency_ms%22%2C%22op%22%3A%22gt%22%2C%22value%22%3A2000%7D%5D%7D
  ```

  Decoded:

  ```json theme={null}
  {
    "op": "and",
    "exprs": [
      { "field": "model", "op": "eq", "value": "gpt-4o" },
      { "field": "latency_ms", "op": "gt", "value": 2000 }
    ]
  }
  ```

  See [Filtering & search](/observe/filtering-and-search) for the full operator list.

`/v1/scores` accepts `session_id` (newly added — scope scores to a session), alongside `start_date`, `end_date`, and `source` which were already supported.


# Get LLM pricing
Source: https://docs.agentmark.co/api-reference/pricing/get-llm-pricing

/openapi.yaml get /v1/pricing
Returns per-model pricing for cost calculation. Response is a dynamic map keyed by model ID (e.g. `gpt-4o`, `claude-opus-4-6`). Prices are per 1,000 tokens.

Public endpoint — no authentication required. Response is cached for 24 hours (`Cache-Control: public, max-age=86400`).


# List or look up prompts
Source: https://docs.agentmark.co/api-reference/prompts/list-or-look-up-prompts

/openapi.yaml get /v1/prompts
List prompt files. With `?name=X`, filters to prompts whose frontmatter `name` matches — used by the trace drawer to map a span's `prompt_name` back to a file path. Without `name`, the full listing is OSS-only (cloud returns 501). Multiple matches are possible because frontmatter `name` is unique only within a single folder, not across an app.


# Get score config
Source: https://docs.agentmark.co/api-reference/score-configs/get-score-config

/openapi.yaml get /v1/score-configs/{name}
Returns a single score config by name. The name is the object key in `agentmark.json`'s `scores` map. Returns 404 if no config with that name is declared.


# List score configs
Source: https://docs.agentmark.co/api-reference/score-configs/list-score-configs

/openapi.yaml get /v1/score-configs
Returns the score configs declared in the application's `agentmark.json` (synced on deploy). Read-only — to add or modify a config, edit `agentmark.json` and redeploy. Returns an empty list if no configs are declared.


# Create score
Source: https://docs.agentmark.co/api-reference/scoring/create-score

/openapi.yaml post /v1/scores
Create a score record for a span or trace. Scores are used to track quality metrics,
evaluation results, and human feedback.

<Note>This endpoint was consolidated from `/v1/score` (singular). The legacy path still works but `/v1/scores` is preferred.</Note>


# Create scores (batch)
Source: https://docs.agentmark.co/api-reference/scoring/create-scores-batch

/openapi.yaml post /v1/scores/batch
Create up to 1000 scores in a single request. Each item is validated independently and the response always contains a per-item results array.

Status codes:
  - `201 Created` — every item succeeded.
  - `207 Multi-Status` — at least one item failed validation (e.g. missing `resource_id` or invalid `dataType`).
  - `400 Bad Request` — every item failed validation (or the envelope itself is malformed).
  - `413 Payload Too Large` — the request contains more than 1000 items.
  - `500 Internal Server Error` — the batch insert against analytics storage failed; no items were persisted.

Pass an optional `client_id` on each item (max 128 chars) to correlate the server-generated `id` back to your own identifier in the results array. The server never inspects or stores `client_id`.


# Delete score
Source: https://docs.agentmark.co/api-reference/scoring/delete-score

/openapi.yaml delete /v1/scores/{scoreId}
Delete a score record by ID.


# Get score
Source: https://docs.agentmark.co/api-reference/scoring/get-score

/openapi.yaml get /v1/scores/{scoreId}
Retrieve a single score record by ID. Returns the full score object including its value, label, reason, and source.


# Get score aggregations
Source: https://docs.agentmark.co/api-reference/scoring/get-score-aggregations

/openapi.yaml get /v1/scores/aggregations
Returns aggregated statistics for scores grouped by name. Useful for understanding score distributions across your application.

<Note>This endpoint is only available on cloud. The local dev server returns 501.</Note>


# Get score names
Source: https://docs.agentmark.co/api-reference/scoring/get-score-names

/openapi.yaml get /v1/scores/names
Returns a list of distinct score names used in your application. Useful for building filter dropdowns and discovering available score types.


# List scores
Source: https://docs.agentmark.co/api-reference/scoring/list-scores

/openapi.yaml get /v1/scores
Returns a paginated list of scores for the authenticated application. Supports filtering by resource, name, source, and date range.


# List sessions
Source: https://docs.agentmark.co/api-reference/sessions/list-sessions

/openapi.yaml get /v1/sessions
Retrieve a paginated list of sessions. Sessions group related traces together for multi-turn conversations, workflows, and batch processing.


# Get span I/O detail
Source: https://docs.agentmark.co/api-reference/spans/get-span-io-detail

/openapi.yaml get /v1/traces/{traceId}/spans/{spanId}
Returns the full input/output payload for a specific span, plus parsed output objects and tool calls when present. Useful for rendering a single span in isolation without loading the entire trace.


# List spans
Source: https://docs.agentmark.co/api-reference/spans/list-spans

/openapi.yaml get /v1/spans
Query spans across all traces. Supports filtering by type, status, model, name, and duration range.


# List spans for a trace
Source: https://docs.agentmark.co/api-reference/spans/list-spans-for-a-trace

/openapi.yaml get /v1/traces/{traceId}/spans
Returns every span belonging to the given trace, ordered by start time. Not paginated — traces are bounded by span volume (capped at ingest), so the full list is returned in a single response.


# Get template
Source: https://docs.agentmark.co/api-reference/templates/get-template

/openapi.yaml get /v1/templates
Retrieve a prompt template by its file path. Templates must have a `.mdx` or `.jsonl` extension.


# Get trace
Source: https://docs.agentmark.co/api-reference/traces/get-trace

/openapi.yaml get /v1/traces/{traceId}
Retrieve a specific trace by ID, including all its spans.

Pass `?fields=graph` to include agent-workflow DAG nodes in the `graph` field of the response. Successor to the deprecated `GET /v1/traces/{traceId}/graph` sub-resource.


# Ingest traces
Source: https://docs.agentmark.co/api-reference/traces/ingest-traces

/openapi.yaml post /v1/traces
Ingest trace data in [OTLP (OpenTelemetry Protocol)](https://opentelemetry.io/docs/specs/otlp/) format.
Traces are buffered in a queue and processed asynchronously.

Supports gzip-compressed payloads via the `Content-Encoding: gzip` header.


# List traces
Source: https://docs.agentmark.co/api-reference/traces/list-traces

/openapi.yaml get /v1/traces
Retrieve a paginated list of traces. Supports filtering by status, user, model, and date range.


# API versioning & stability
Source: https://docs.agentmark.co/api-reference/versioning

How the AgentMark Gateway API versions itself, what counts as a breaking change, and how deprecations are communicated.

## Current version

The AgentMark Gateway API is **v1**. Every endpoint is prefixed with `/v1/` (except `GET /health`, which is unversioned). The current version will remain available as `/v1/*` for the foreseeable future.

## Versioning strategy

Breaking changes are released under a **new path prefix**. When a future `/v2/` is introduced, `/v1/` will continue to work in parallel — you upgrade when you're ready, not when we ship.

This matches how Stripe, Twilio, and most mature public APIs version their endpoints. A path-based scheme is visible in every request, trivial to grep for in consumer code, and easy to pin in configuration.

## What's non-breaking

The following are **safe to ship within a version** — client code written today will keep working:

* Adding a new endpoint
* Adding a new optional response field
* Adding a new optional request parameter
* Adding a new enum value to a **request** parameter (you send more; we accept more)
* Adding a new response status code (documented)
* Widening a response field's type (e.g. from `integer` to `number`)
* Relaxing a validation rule (accepting inputs that were previously rejected)

If you use the [AgentMark SDK](/introduction/overview), these changes surface as non-breaking SDK releases (`sdk@1.x → sdk@1.y`).

## What's breaking

The following changes require a new version (`/v2/`) if we need them:

* Removing an endpoint
* Removing a response field
* Removing a request parameter
* Changing a response field's type in a narrowing way (e.g. `string → number`)
* Adding a new enum value to a **response** field (you parse; we send something you don't recognize)
* Tightening a validation rule (rejecting inputs that were previously accepted)
* Changing authentication requirements
* Changing the required `Content-Type` of a request or response
* Changing an HTTP status code for an existing response class

Changes in this list will not ship to `/v1/` without a deprecation window (see below).

## Deprecation policy

When we need to break something, the timeline is:

1. **Announce** in the [changelog](/changelog) with the date the change will land.
2. **Add a deprecation notice** to the endpoint's OpenAPI entry (`deprecated: true`) and include a `Deprecation` header in live responses pointing at the replacement.
3. **Wait at least 90 days** between announcement and removal — longer for auth or billing-affecting changes.
4. **Ship the new version alongside the old.** Both work in parallel during the transition.
5. **Remove** after the notice window, only if telemetry shows usage has migrated.

Breaking changes are rare. We'd rather deprecate slowly than ship fast.

## What's not versioned

A few things are intentionally outside the version contract — they can change without a `/v2/` bump:

* **Error response body contents.** The shape is stable: every error returns `{ error: { code: string, message: string } }`. But the set of `code` values may grow over time (we add new error codes when we add new behaviors). Code the `message` for human display; code the `code` for programmatic dispatch — and always have a fallback branch for codes you don't recognize yet.

* **Error response `extras`.** Some errors include additional fields (`retry_after`, `required_permission`, etc.). New fields may appear; existing ones won't change shape.

* **Rate limit values.** Throughput caps adjust based on plan and infrastructure. The `429` response is stable; the exact threshold isn't.

* **Performance characteristics.** Latency targets, batch size guarantees, and read-after-write consistency windows are SLOs, not API contract.

## Using the SDK insulates you from most of this

The [AgentMark SDK](/introduction/overview) handles version negotiation, retries, and response parsing. If you use the SDK, most of this page is transparent — a minor SDK bump follows a minor gateway release, a major SDK bump follows a major gateway release.

The direct HTTP API is supported for custom integrations, but the SDK is the path of least friction.

## Questions or migration help

* Changelog: [/changelog](/changelog)
* Contact: [hello@agentmark.co](mailto:hello@agentmark.co)


# Components
Source: https://docs.agentmark.co/build/components

Create reusable components to share prompting patterns across your prompts

AgentMark supports reusable components to help you maintain consistent prompting patterns across your prompts. Create components in the Dashboard or locally as files — either way, a component is an `.mdx`/`.md` file you import into your prompts.

## Create a component

<Tabs>
  <Tab title="Cloud">
    <img alt="Create component dialog in the Dashboard" />

    The Create Component dialog prompts for the component name and drops a new `.mdx` file into the selected folder. After you edit and publish it, it's available for import in any prompt on this app.

    1. Open the **Files** tab for your app.
    2. Click the kebab menu (⋮) next to the folder where the component should live.
    3. Select **Create component**, give it a name, and click **Create**.
    4. Edit the component in the visual editor, then click **Publish**.
  </Tab>

  <Tab title="Local">
    Create components as `.mdx` files in your `agentmark/` directory:

    ```shell theme={null}
    agentmark/
      ├── components/
      │   ├── math-instructions.mdx
      │   └── language-instructions.md
      └── prompts/
          └── example.prompt.mdx
    ```

    Add a new `.mdx` (or `.md`) file under `components/`. The file is picked up on save — if your app is synced to Cloud, the next `git push` makes it available to your deployed handler; in Local mode, `agentmark dev` reads it directly from disk.
  </Tab>
</Tabs>

## Using components

Import and use components in your prompts:

```jsx example.prompt.mdx theme={null}
import MathInstructions from '../components/math-instructions.mdx';

<System>
  You are a math tutor.
  <MathInstructions level={props.difficulty} />
</System>
```

Component example:

```jsx math-instructions.mdx theme={null}
You are a patient and knowledgeable math tutor. Follow these guidelines:

<If condition={props.level === "basic"}>
  - Explain concepts using simple, everyday examples
  - Break down problems into very small steps
  - Use visual representations when possible
  - Avoid complex mathematical notation
</If>

<If condition={props.level === "intermediate"}>
  - Provide detailed step-by-step solutions
  - Introduce formal mathematical notation gradually
  - Explain the reasoning behind each step
  - Include relevant formulas with explanations
</If>

<If condition={props.level === "advanced"}>
  - Use formal mathematical notation
  - Include theoretical foundations
  - Reference related concepts and theorems
  - Provide rigorous mathematical proofs
</If>
```

In the calling prompt, the parent passes `level` as a prop; the component reads it as `props.level`. Component props are scoped to the component — they don't collide with the parent prompt's props.

## Learn more

For the full component feature reference — props, conditional rendering, plain markdown imports, and composition — see [TemplateDX Components](/templatedx/components).

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Create a prompt
Source: https://docs.agentmark.co/build/creating-prompts

Create prompts in the Dashboard visual editor or as local .prompt.mdx files

<Tabs>
  <Tab title="Cloud">
    ### 1. Create a prompt

    Open the create menu in the Dashboard to start a new prompt.

    <img alt="Create-prompt menu in the Dashboard" />

    The **Create** menu in the top bar lets you add a new prompt, dataset, or component to the current app.

    ### 2. Name your prompt

    Give your prompt a descriptive name and choose the generation type (Text, Object, Image, or Speech).

    <img alt="Prompt name and generation-type selection dialog" />

    The new-prompt dialog takes a name (used as the filename under `agentmark/`) and a generation type — **Text**, **Object**, **Image**, or **Speech** — which determines the frontmatter config key and supported message tags.

    ### 3. Write your prompt

    Select your model and write your prompt using message tags.

    <img alt="Prompt editor with message tags" />

    The editor has a model selector, frontmatter for config (model, temperature, etc.), and a message-tag editor for the prompt body (`<System>`, `<User>`, `<Assistant>`). Syntax highlighting and inline validation help catch frontmatter or tag errors as you type.

    ### 4. Run your prompt

    Add your input variables and run the prompt to see results.

    <img alt="Running a prompt with input variables" />

    Fill the input-variable fields (any `{props.foo}` references in your prompt), then click **Run** to execute the prompt. The output streams into the right panel in real time.

    ### 5. Publish

    Publish to save your changes and make them available to your application.

    <img alt="Publish button in the prompt editor" />

    The **Publish** button commits your edits to the app's default Git branch and makes the new version available to your application via the SDK or CLI.

    Every change is automatically versioned. You can view the full version history, compare changes between versions, and rollback to any previous version from the Dashboard. See [Version Control](./version-control) for details.
  </Tab>

  <Tab title="Local">
    ### 1. Create a `.prompt.mdx` file

    Add a new file in your project's `agentmark/` directory:

    ```mdx agentmark/greeting.prompt.mdx theme={null}
    ---
    name: greeting
    text_config:
      model_name: gpt-4o-mini
      temperature: 0.7
    ---

    <System>
    You are a friendly assistant.
    </System>

    <User>
    Say hello to {props.name} and tell them something interesting.
    </User>
    ```

    ### 2. Choose a generation type

    The frontmatter key determines the output type:

    | Key             | Output          | Tags                                |
    | --------------- | --------------- | ----------------------------------- |
    | `text_config`   | Text response   | `<System>`, `<User>`, `<Assistant>` |
    | `object_config` | Structured JSON | `<System>`, `<User>`, `<Assistant>` |
    | `image_config`  | Image           | `<ImagePrompt>`                     |
    | `speech_config` | Audio           | `<SpeechPrompt>`, `<System>`        |

    [Learn more about generation types →](./generation-types/overview)

    ### 3. Run your prompt

    ```bash theme={null}
    npx agentmark run-prompt agentmark/greeting.prompt.mdx
    ```

    Or run it from your application code — see [Running Prompts](./running-prompts).

    ### 4. Iterate

    Edit the file, save, and run again. With `agentmark dev` running, changes are picked up automatically.

    Your prompts are version-controlled in git alongside your code. When synced to Cloud, every change is tracked with full history and rollback. See [Version Control](./version-control) for details.
  </Tab>
</Tabs>

## Next steps

<CardGroup>
  <Card title="Running Prompts" icon="play" href="./running-prompts">
    Execute prompts from your application via SDK
  </Card>

  <Card title="Playground" icon="flask-vial" href="./playground">
    Compare models and prompts side-by-side
  </Card>

  <Card title="Generation Types" icon="sparkles" href="./generation-types/overview">
    Text, object, image, and speech output
  </Card>

  <Card title="TemplateDX Syntax" icon="brackets-curly" href="/templatedx/syntax">
    Variables, conditionals, loops, and components
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Example prompts
Source: https://docs.agentmark.co/build/example-prompts

Copy-paste starter prompts that cover all four AgentMark generation types — object, text + tools, image, and speech — plus a runnable dataset for each.

Four working `.prompt.mdx` examples you can drop into your `agentmark/` directory and run with `npx agentmark run-prompt`. Each one covers a different generation type and a different AgentMark feature — pick the closest to what you're building and adapt.

<Note>
  These prompts were previously scaffolded by `npm create agentmark`. The CLI now does only the minimum bootstrap (`agentmark.json` + empty `agentmark/` + MCP wiring). The starter examples live here so you can take only the one you need, instead of being handed a four-prompt menu you didn't ask for.
</Note>

<Tabs>
  <Tab title="Object (party-planner)">
    Structured JSON output, schema validation, and a linked dataset with an eval — the canonical "extract this shape from text" prompt.

    **Demonstrates:** `object_config`, JSON schema validation, `test_settings.dataset`, `test_settings.evals` (`exact_match_json`).

    ```mdx agentmark/party-planner.prompt.mdx theme={null}
    ---
    name: party-planner
    object_config:
      model_name: openai/gpt-4o-mini
      schema:
        type: object
        properties:
          names:
            type: array
            description: "List of names of people attending the party."
            items:
              type: string
        required:
          - names
    test_settings:
      dataset: party.jsonl
      evals:
        - exact_match_json
      props:
        party_text: "We're having a party with Alice, Bob, and Carol."
    input_schema:
      type: object
      properties:
        party_text:
          type: string
          description: "A block of text describing the upcoming party and attendees."
      required:
        - party_text
    ---

    <System>
    Extract the names of all people attending the party from the following text. Respond with a list of names only.
    </System>

    <User>
    Text: {props.party_text}
    </User>
    ```

    ```jsonl agentmark/party.jsonl theme={null}
    {"input": {"party_text": "We're having a party with Alice, Bob, and Carol."}, "expected_output": "{\"names\": [\"Alice\", \"Bob\", \"Carol\"]}"}
    {"input": {"party_text": "The guest list includes Dave, Emma, and Frank."}, "expected_output": "{\"names\": [\"Dave\", \"Emma\", \"Frank\"]}"}
    {"input": {"party_text": "Join us for a celebration with Grace, Henry, and Isla."}, "expected_output": "{\"names\": [\"Grace\", \"Henry\", \"Isla\"]}"}
    ```

    ```bash Run it theme={null}
    # Single execution against the inline test_settings.props:
    npx agentmark run-prompt agentmark/party-planner.prompt.mdx

    # Full dataset with the exact_match_json eval:
    npx agentmark run-experiment agentmark/party-planner.prompt.mdx
    ```
  </Tab>

  <Tab title="Text + tools (customer-support)">
    A text-generation agent with tool use and a multi-call budget. Realistic shape for a support bot, knowledge-base lookups, or any agent that needs to chain tool calls before answering.

    **Demonstrates:** `text_config`, `tools:` (by name), `max_calls`, dataset for regression testing.

    <Warning>
      `tools: - search_knowledgebase` references a tool **by name**. You have to register the tool's implementation in your `agentmark.client.ts` (TypeScript) or `agentmark_client.py` (Python) before this prompt will execute end-to-end. See [Tools and agents](./tools-and-agents) for the wiring.
    </Warning>

    ```mdx agentmark/customer-support-agent.prompt.mdx theme={null}
    ---
    name: customer-support-agent
    text_config:
      model_name: openai/gpt-4o-mini
      max_calls: 2
      tools:
        - search_knowledgebase
    test_settings:
      dataset: customer-query.jsonl
      props:
        customer_question: "I'm having trouble with my order. How long does shipping take?"
    input_schema:
      type: object
      properties:
        customer_question:
          type: string
          description: "The customer's question"
      required:
        - customer_question
    ---

    <System>
    You are a customer service agent for a company that sells products online. You are given a customer's question and you need to respond to the customer. You need to be friendly, professional, and helpful.

    You have access to the following tool:
    - search_knowledgebase: Search the company knowledgebase for information about shipping, warranty, and returns. Use this when customers ask about these topics.
    </System>

    <User>{props.customer_question}</User>
    ```

    ```jsonl agentmark/customer-query.jsonl theme={null}
    {"input": {"customer_question": "My package hasn't arrived yet. Can you help me track it?"}}
    {"input": {"customer_question": "I received the wrong item in my order. What should I do?"}}
    {"input": {"customer_question": "How do I return a product that I purchased last week?"}}
    ```

    ```bash Run it theme={null}
    # Make sure search_knowledgebase is registered in your client first.
    npx agentmark run-prompt agentmark/customer-support-agent.prompt.mdx
    ```
  </Tab>

  <Tab title="Image (animal-drawing)">
    DALL-E image generation with a single input prop — the smallest possible end-to-end image prompt.

    **Demonstrates:** `image_config`, `<ImagePrompt>` tag, single-prop interpolation.

    ```mdx agentmark/animal-drawing.prompt.mdx theme={null}
    ---
    name: animal-drawing
    image_config:
      model_name: openai/dall-e-3
      num_images: 1
      size: 1024x1024
      aspect_ratio: 1:1
    test_settings:
      dataset: animal.jsonl
      props:
        animal: "cat"
    ---

    <ImagePrompt>
    Draw a hyper-realistic picture of a {props.animal}
    </ImagePrompt>
    ```

    ```jsonl agentmark/animal.jsonl theme={null}
    {"input": {"animal": "cat"}, "expected_output": "A realistic picture of a cat"}
    {"input": {"animal": "dog"}, "expected_output": "A realistic picture of a dog"}
    {"input": {"animal": "bird"}, "expected_output": "A realistic picture of a bird"}
    ```

    ```bash Run it theme={null}
    npx agentmark run-prompt agentmark/animal-drawing.prompt.mdx --props '{"animal":"otter"}'
    ```
  </Tab>

  <Tab title="Speech (story-teller)">
    Text-to-speech with OpenAI's `tts-1-hd`. Notice the `<SpeechPrompt>` tag — speech config uses it instead of `<User>`.

    **Demonstrates:** `speech_config`, `<SpeechPrompt>` tag, voice/speed/output-format options.

    ```mdx agentmark/story-teller.prompt.mdx theme={null}
    ---
    name: story-teller
    speech_config:
      model_name: openai/tts-1-hd
      voice: "nova"
      speed: 1.0
      output_format: "mp3"
    test_settings:
      dataset: story.jsonl
      props:
        story: "Once upon a time, there was a cat who loved to play with a ball."
    ---

    <System>
    You are a storyteller for children. Make sure your story is engaging and interesting.
    </System>

    <SpeechPrompt>
    - {props.story}
    </SpeechPrompt>
    ```

    ```jsonl agentmark/story.jsonl theme={null}
    {"input": {"story": "Once upon a time, the Moon woke up and found her glow missing! She floated around the sky asking stars, clouds, and even comets if they'd seen her light. It wasn't until she peeked into a mountain lake that she saw her glow shining back—hidden in her own reflection!"}}
    {"input": {"story": "Benny was no ordinary banana—he dreamed of becoming a superhero. One day, when a monkey slipped in the jungle and cried for help, Benny rolled into action, dodging vines and swinging from branches using his peel like a lasso."}}
    {"input": {"story": "In the town of Maplebrook, there was a library that whispered stories when no one was looking. Curious little Nia tiptoed in one rainy day and heard the books giggling softly."}}
    ```

    ```bash Run it theme={null}
    npx agentmark run-prompt agentmark/story-teller.prompt.mdx --props '{"story":"A whale who learned to fly."}'
    ```
  </Tab>
</Tabs>

## Wiring these into your project

Drop the `.prompt.mdx` file into `<your-project>/agentmark/` (the `agentmark/` directory `npm create agentmark` left empty). Drop the `.jsonl` dataset next to it. Then either run from the CLI as shown in each "Run it" block, or load by name from your SDK client.

If you ran `npm create agentmark` and then asked your AI tool to "Set up AgentMark in this project," the setup workflow will have proposed the right SDK package and client file for your stack — the recipes above slot into that wiring directly.

## Next steps

<CardGroup>
  <Card title="Create a prompt" icon="file-plus" href="./creating-prompts">
    Author your own .prompt.mdx files from scratch
  </Card>

  <Card title="Generation types" icon="sparkles" href="./generation-types/overview">
    Reference for text, object, image, and speech configs
  </Card>

  <Card title="Tools and agents" icon="wrench" href="./tools-and-agents">
    Wire tool implementations into prompts (used in customer-support)
  </Card>

  <Card title="Running experiments" icon="flask-vial" href="../evaluate/running-experiments">
    Datasets + evals (used in party-planner)
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# File attachments
Source: https://docs.agentmark.co/build/file-attachments

Attach images and files to prompts using the ImageAttachment and FileAttachment tags.

## Overview

Attach images and other files to your prompts for tasks like image analysis, document processing, or any scenario where the model needs to work with external media.

## Components

<Warning>
  `<ImageAttachment>` and `<FileAttachment>` must be placed inside a `<User>` tag. Placing them inside `<System>` or `<Assistant>` throws `"ImageAttachment and FileAttachment tags must be inside User tag."` at template compile time.
</Warning>

### ImageAttachment

The `<ImageAttachment>` component attaches an image to your prompt:

```jsx theme={null}
<ImageAttachment image={props.imageLink} />
```

Parameters:

* `image` (required): URL to the image file
* `mimeType` (optional): MIME type of the image (e.g., `image/jpeg`, `image/png`)

### FileAttachment

The `<FileAttachment>` component attaches any type of file:

```jsx theme={null}
<FileAttachment data="https://example.com/document.pdf" mimeType={props.fileMimeType} />
```

Parameters:

* `data` (required): URL to the file
* `mimeType` (required): MIME type of the file (e.g., `application/pdf`, `text/plain`)

## Example

A complete prompt using both attachment types:

```jsx math.prompt.mdx theme={null}
---
name: describe-media
text_config:
  model_name: gpt-4o-mini
---

<System>
You are an observer that comments on images and files.
</System>

<User>
  {props.userMessage}
  <ImageAttachment image={props.imageLink} />
  <FileAttachment data="https://example.com/document.pdf" mimeType={props.fileMimeType} />
</User>
```

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Generating images
Source: https://docs.agentmark.co/build/generation-types/image

Generate images from prompts using AgentMark with DALL-E, Stable Diffusion, and other image models.

AgentMark generates images with prompts that declare `image_config` in frontmatter. The image description itself goes in an `<ImagePrompt>` tag.

## Example configuration

```jsx example.prompt.mdx theme={null}
---
name: image
image_config:
  model_name: dall-e-3
  num_images: 1
  size: 1024x1024
  aspect_ratio: 1:1
  seed: 12345
---

<ImagePrompt>
A futuristic cityscape at sunset with flying cars and neon lights
</ImagePrompt>
```

## Tags

| Tag             | Description                                                                                                                      |
| --------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `<ImagePrompt>` | The text description for image generation. AgentMark reads the contents at compile time and sends it to the model as the prompt. |

## Available configuration

| Property       | Type     | Description                                                               | Required |
| -------------- | -------- | ------------------------------------------------------------------------- | -------- |
| `model_name`   | `string` | The name of the model to use for image generation.                        | Yes      |
| `num_images`   | `number` | Number of images to generate.                                             | No       |
| `size`         | `string` | Image dimensions in format `WIDTHxHEIGHT` (e.g., `1024x1024`, `512x512`). | No       |
| `aspect_ratio` | `string` | Aspect ratio in format `WIDTH:HEIGHT` (e.g., `1:1`, `16:9`, `9:16`).      | No       |
| `seed`         | `number` | Random-number seed for reproducibility.                                   | No       |

## Running an image prompt

See [Running prompts → Image generation](/build/running-prompts#image-generation) for the SDK code pattern using Vercel AI SDK's `experimental_generateImage`. (The `experimental_` prefix is upstream — the API may evolve.)

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Generating objects
Source: https://docs.agentmark.co/build/generation-types/object

Generate structured JSON objects from prompts with schema validation using AgentMark.

AgentMark generates structured objects with prompts that declare `object_config` in frontmatter and a JSON schema for the expected output. Object prompts use the same message-role tags as text prompts.

## Example configuration

```jsx example.prompt.mdx theme={null}
---
name: example
object_config:
  model_name: gpt-4
  schema:
    type: object
    properties:
      event:
        type: object
        properties:
          name: 
            type: string
            description: The name of the event
          date:
            type: string
            description: The date of the event
          attendees:
            type: array
            items:
              type: object
              properties:
                name:
                  type: string
                  description: The name of the attendee
                role:
                  type: string
                  description: The role of the attendee
              required:
                - name
                - role
        required: 
          - name
          - date
          - attendees
---

<System>You are an event planner that creates detailed event objects with attendees and their roles.</System>
<User>Create an event for a team meeting next Friday with John as the facilitator and Sarah as the note-taker.</User>
```

## Tags

| Tag           | Description                                                  |
| ------------- | ------------------------------------------------------------ |
| `<System>`    | System-level instructions                                    |
| `<User>`      | User message                                                 |
| `<Assistant>` | Assistant message (optional — include for few-shot examples) |

## Using schema references

Instead of writing a full schema inline, you can extract it into a `.json` file and use `$ref` to reference it. AgentMark resolves the reference at build time.

```jsx extract-event.prompt.mdx theme={null}
---
name: extract-event
object_config:
  model_name: gpt-4
  schema:
    $ref: ./schemas/event.json
---

<System>You are an event planner that creates detailed event objects.</System>
<User>Create an event for a team meeting next Friday with John as the facilitator.</User>
```

See [Schema references](/build/schema-references) for full documentation on `$ref` syntax, transitive references, and JSON Pointer fragments.

## Available configuration

| Property             | Type         | Description                                                                                                                         | Required |
| -------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------- | -------- |
| `model_name`         | `string`     | The name of the model to use for object generation.                                                                                 | Yes      |
| `schema`             | `JSONSchema` | Schema defining the expected structure of the model's output. Supports [`$ref`](/build/schema-references) for reusable definitions. | Yes      |
| `max_tokens`         | `number`     | Maximum number of tokens to generate.                                                                                               | No       |
| `temperature`        | `number`     | Controls the randomness of the output; higher values are more random.                                                               | No       |
| `max_calls`          | `number`     | Maximum number of LLM calls allowed.                                                                                                | No       |
| `top_p`              | `number`     | Cumulative probability for nucleus sampling.                                                                                        | No       |
| `top_k`              | `number`     | Limits next-token selection to the top `k` tokens.                                                                                  | No       |
| `presence_penalty`   | `number`     | Penalizes tokens based on presence in the text so far.                                                                              | No       |
| `frequency_penalty`  | `number`     | Penalizes tokens based on frequency in the text so far.                                                                             | No       |
| `stop_sequences`     | `string[]`   | Strings that, if encountered, stop generation.                                                                                      | No       |
| `seed`               | `number`     | Random-number seed for reproducibility.                                                                                             | No       |
| `max_retries`        | `number`     | Maximum number of retries on failure.                                                                                               | No       |
| `schema_name`        | `string`     | Name sent with the schema (used by OpenAI structured outputs).                                                                      | No       |
| `schema_description` | `string`     | Description sent with the schema.                                                                                                   | No       |
| `tools`              | `string[]`   | List of tool names or MCP URIs available to the model.                                                                              | No       |

## Running an object prompt

See [Running prompts → Object generation](/build/running-prompts#object-generation) for the SDK code patterns for `generateObject` / `streamObject` (TypeScript) and `run_object_prompt` (Python).

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Generation types overview
Source: https://docs.agentmark.co/build/generation-types/overview

Understand the different types of content you can generate with AgentMark prompts

Generation types define what kind of output your prompt will produce. AgentMark supports four types, each optimized for different use cases:

* **Text** — Natural language responses for chatbots, content generation, and analysis
* **Object** — Structured JSON data with schema validation for APIs and data extraction
* **Image** — Visual content generation using models like DALL-E 3
* **Speech** — Spoken audio for voice applications and text-to-speech

## Choosing the right type

| Type       | Best for                           | Output format    | Example use cases                               |
| ---------- | ---------------------------------- | ---------------- | ----------------------------------------------- |
| **Text**   | Conversational AI, content writing | String           | Chatbots, summarization, Q\&A                   |
| **Object** | Structured data extraction         | JSON with schema | Form parsing, data normalization, API responses |
| **Image**  | Visual content creation            | Image file       | Marketing assets, illustrations, prototypes     |
| **Speech** | Voice applications                 | Audio file       | Podcasts, audiobooks, voice assistants          |

## Configuration

Each generation type is configured in the prompt's frontmatter using a specific config key:

```mdx theme={null}
---
name: my-prompt
text_config:        # For text generation
  model_name: gpt-4o
  temperature: 0.7
---
```

```mdx theme={null}
---
name: extract-data
object_config:      # For object generation
  model_name: gpt-4o
  schema:
    type: object
    properties:
      name:
        type: string
---
```

```mdx theme={null}
---
name: create-image
image_config:       # For image generation
  model_name: dall-e-3
  size: 1024x1024
---
```

```mdx theme={null}
---
name: text-to-speech
speech_config:      # For speech generation
  model_name: tts-1
  voice: alloy
---
```

## Loading prompts

Use the appropriate loader method based on your generation type:

```typescript theme={null}
// Text generation
const textPrompt = await client.loadTextPrompt('my-prompt');

// Object generation
const objectPrompt = await client.loadObjectPrompt('extract-data');

// Image generation
const imagePrompt = await client.loadImagePrompt('create-image');

// Speech generation
const speechPrompt = await client.loadSpeechPrompt('text-to-speech');
```

## Detailed guides

<CardGroup>
  <Card title="Text generation" icon="message" href="/build/generation-types/text">
    Natural language responses with conversation history
  </Card>

  <Card title="Object generation" icon="brackets-curly" href="/build/generation-types/object">
    Structured JSON with schema validation
  </Card>

  <Card title="Image generation" icon="image" href="/build/generation-types/image">
    Visual content with DALL-E and similar models
  </Card>

  <Card title="Speech generation" icon="microphone" href="/build/generation-types/speech">
    Audio synthesis with voice customization
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Generating speech
Source: https://docs.agentmark.co/build/generation-types/speech

Generate speech audio from prompts using AgentMark with text-to-speech models.

AgentMark generates speech audio with prompts that declare `speech_config` in frontmatter. The text to speak goes in a `<SpeechPrompt>` tag.

## Example configuration

```jsx example.prompt.mdx theme={null}
---
name: speech
speech_config:
  model_name: tts-1-hd
  voice: "nova"
  speed: 1.0
  output_format: "mp3"
---

<System>
Please read this text aloud.
</System>

<SpeechPrompt>
This is a test for the speech prompt to be spoken aloud.
</SpeechPrompt>
```

## Tags

| Tag              | Description                                                                                                |
| ---------------- | ---------------------------------------------------------------------------------------------------------- |
| `<SpeechPrompt>` | The text to convert to speech. AgentMark reads the contents at compile time and sends it to the TTS model. |
| `<System>`       | Optional system-level instructions passed to models that support them.                                     |

## Available configuration

| Property        | Type     | Description                                                                                                     | Required |
| --------------- | -------- | --------------------------------------------------------------------------------------------------------------- | -------- |
| `model_name`    | `string` | The name of the model to use for speech generation.                                                             | Yes      |
| `voice`         | `string` | Voice identifier (provider-specific; e.g. for OpenAI TTS: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`). | No       |
| `output_format` | `string` | Audio output format (e.g., `mp3`, `opus`, `aac`, `flac`).                                                       | No       |
| `instructions`  | `string` | Additional instructions for speech generation (provider-specific).                                              | No       |
| `speed`         | `number` | Playback speed multiplier.                                                                                      | No       |

## Running a speech prompt

See [Running prompts → Speech generation](/build/running-prompts#speech-generation) for the SDK code pattern using Vercel AI SDK's `experimental_generateSpeech`. (The `experimental_` prefix is upstream — the API may evolve.)

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Generating text
Source: https://docs.agentmark.co/build/generation-types/text

Generate text completions from prompts using AgentMark with any LLM provider.

AgentMark generates text with prompts that declare a `text_config` in frontmatter. Text prompts use message-role tags (`<System>`, `<User>`, `<Assistant>`) and return a string.

## Example configuration

```jsx example.prompt.mdx theme={null}
---
name: example
text_config:
  model_name: gpt-4o-mini
---

<System>You are a math tutor that can perform calculations.</System>
<User>What's 235 * 18?</User>
```

## Tags

| Tag           | Description                                                                        |
| ------------- | ---------------------------------------------------------------------------------- |
| `<System>`    | System-level instructions                                                          |
| `<User>`      | User message                                                                       |
| `<Assistant>` | Assistant message (optional — include for few-shot examples or prior-turn context) |

## Available configuration

| Property            | Type                                                                    | Description                                                                                                              | Required |
| ------------------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | -------- |
| `model_name`        | `string`                                                                | The name of the model to use for text generation.                                                                        | Yes      |
| `max_tokens`        | `number`                                                                | Maximum number of tokens to generate.                                                                                    | No       |
| `temperature`       | `number`                                                                | Controls the randomness of the output; higher values are more random.                                                    | No       |
| `max_calls`         | `number`                                                                | Maximum number of LLM calls allowed (for agent workflows).                                                               | No       |
| `top_p`             | `number`                                                                | Cumulative probability for nucleus sampling.                                                                             | No       |
| `top_k`             | `number`                                                                | Limits next-token selection to the top `k` tokens.                                                                       | No       |
| `presence_penalty`  | `number`                                                                | Penalizes tokens based on presence in the text so far, encouraging new topics.                                           | No       |
| `frequency_penalty` | `number`                                                                | Penalizes tokens based on frequency in the text so far, reducing verbatim repetition.                                    | No       |
| `stop_sequences`    | `string[]`                                                              | Strings that, if encountered, stop generation.                                                                           | No       |
| `seed`              | `number`                                                                | Random-number seed for reproducibility.                                                                                  | No       |
| `max_retries`       | `number`                                                                | Maximum number of retries on failure.                                                                                    | No       |
| `tool_choice`       | `"auto" \| "none" \| "required" \| { type: "tool", tool_name: string }` | Controls how tools are used during generation.                                                                           | No       |
| `tools`             | `string[]`                                                              | List of tool names or MCP URIs available to the model. Tools resolve from the `tools` passed to `createAgentMarkClient`. | No       |

## Running a text prompt

See [Running prompts → Text generation](/build/running-prompts#text-generation) for the SDK code patterns for `generateText` (TypeScript) and `run_text_prompt` (Python).

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# MCP integration
Source: https://docs.agentmark.co/build/mcp

Use Model Context Protocol (MCP) tools from AgentMark prompts via adapters.

AgentMark supports calling Model Context Protocol (MCP) tools directly from your prompts.

## What is MCP?

The **Model Context Protocol (MCP)** is an open standard that allows AI applications to connect to external tools and data sources. Instead of hardcoding tool implementations, MCP lets you:

* **Connect to tool servers** — Use pre-built MCP servers for filesystems, databases, APIs, and more
* **Standardize tool interfaces** — All MCP tools follow the same protocol, making them interchangeable
* **Share tools across projects** — One MCP server can serve multiple AI applications

<Note>
  Think of MCP like USB for AI tools. Just as USB provides a standard way to connect peripherals to computers, MCP provides a standard way to connect tools to AI applications.
</Note>

## How MCP works with AgentMark

1. You configure **MCP servers** in your AgentMark client (either a local process or a remote URL)
2. You reference **MCP tools** in your prompt frontmatter using `mcp://{server}/{tool}` syntax
3. At runtime, AgentMark connects to the server and makes those tools available to the AI model

```
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Your Prompt    │────▶│  AgentMark      │────▶│  MCP Server     │
│  mcp://fs/read  │     │  Client         │     │  (filesystem)   │
└─────────────────┘     └─────────────────┘     └─────────────────┘
```

***

## What you'll learn

* Configure MCP servers (local process or remote URL)
* Reference MCP tools in prompts using `mcp://` URIs
* Combine MCP tools with other tool names
* Use environment variable interpolation for secrets

***

## MCP server types

AgentMark supports two types of MCP servers:

| Type        | Use Case                             | Configuration                   |
| ----------- | ------------------------------------ | ------------------------------- |
| **stdio**   | Local tools that run as a subprocess | `command`, `args`, `cwd`, `env` |
| **URL/SSE** | Remote tools accessed over HTTP      | `url`, `headers`                |

### stdio servers (local process)

The server runs as a child process on your machine. AgentMark communicates with it via stdin/stdout.

```typescript theme={null}
{
  filesystem: {
    command: "npx",
    args: ["-y", "@modelcontextprotocol/server-filesystem", "./data"],
    cwd: "/path/to/project",  // optional working directory
    env: { NODE_ENV: "production" },  // optional environment
  }
}
```

**When to use:** Local development, accessing local files, running custom tools.

### URL servers (remote HTTP)

The server runs remotely and accepts requests over HTTP with Server-Sent Events (SSE).

```typescript theme={null}
{
  docs: {
    url: "https://docs.example.com/mcp",
    headers: { Authorization: "Bearer your-token" },  // optional auth
  }
}
```

**When to use:** Shared team tools, cloud-hosted services, production deployments.

***

## 1) Configure MCP servers

Define servers when creating your AgentMark client. You can mix both server types.

Use `env('VAR_NAME')` — quoted var name, matching the **entire** string value — to interpolate environment variables. This keeps secrets out of your code.

```ts theme={null}
import { openai } from "@ai-sdk/openai";
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
import { ApiLoader } from "@agentmark-ai/loader-api";
import { tool } from "ai";
import { z } from "zod";

const loader = new ApiLoader({ apiKey: process.env.AGENTMARK_API_KEY! });
const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai });

const summarizeTool = tool({
  description: "Summarize a block of text",
  inputSchema: z.object({
    text: z.string(),
    maxSentences: z.number().optional().default(2),
  }),
  execute: async ({ text, maxSentences }) => {
    const sentences = String(text).split(/(?<=[.!?])\s+/).slice(0, maxSentences);
    return { summary: sentences.join(" ") };
  },
});

const agentMark = createAgentMarkClient({
  loader,
  modelRegistry,
  tools: { summarize: summarizeTool },
  mcpServers: {
    docs: {
      url: "env('AGENTMARK_MCP_SSE_URL')",
      headers: { Authorization: `Bearer ${process.env.MCP_TOKEN}` },
    },
    local: {
      command: "npx",
      args: ["-y", "@mastra/mcp-docs-server"],
      env: { NODE_ENV: "production" },
    },
  },
});
```

Server configuration rules:

* URL servers accept only `url` and optional `headers`.
* stdio servers accept only `command`, `args`, `cwd`, `env`.

### Environment interpolation

`env('VAR')` is parsed by AgentMark and resolved from `process.env.VAR` at runtime. The pattern requires:

* **Quoted** variable name inside `env(...)` — `env('VAR')` or `env("VAR")`.
* The `env(...)` expression must be the **entire** string value (anchored match). It cannot be a substring.

For cases that need a prefix (like `Bearer`), use a regular JS template string with `process.env.X` directly:

```ts theme={null}
const agentMark = createAgentMarkClient({
  loader,
  modelRegistry,
  mcpServers: {
    docs: {
      url: "env('DOCS_MCP_URL')",
      headers: { Authorization: `Bearer ${process.env.MCP_TOKEN}` },
    },
    local: {
      command: "env('NODE_BIN')",
      args: ["-y", "@mastra/mcp-docs-server"],
    },
  },
});
```

## 2) Reference MCP tools in prompts

Declare MCP tools in your prompt frontmatter. You can mix MCP tools with other tool names.

```mdx mcp-example.prompt.mdx theme={null}
---
name: mcp-example
text_config:
  model_name: gpt-4
  tools:
    - mcp://docs/web-search
    - summarize
---

<System>
Use the web-search tool to look up relevant documentation when needed.
Use the summarize tool to condense content into a short summary.
</System>

<User>
Find the page that explains MCP integration and summarize it in 2 sentences.
</User>
```

* `mcp://docs/web-search` resolves to the MCP server `docs`, tool `web-search`.
* `summarize` is a tool provided via the `tools` option in your client configuration.

### Wildcard: include all tools from a server

Include every tool exported by a server using `*`:

```mdx theme={null}
---
text_config:
  tools:
    - mcp://docs/*
---
```

* All tools exported by the `docs` server are included by their original names.
* If a tool name collides with an existing tool, the later-added tool overwrites the earlier one.

## 3) Format and run (AI SDK example)

```ts theme={null}
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
import { ApiLoader } from "@agentmark-ai/loader-api";
import { tool } from "ai";
import { z } from "zod";

import { openai } from "@ai-sdk/openai";

const loader = new ApiLoader({ apiKey: process.env.AGENTMARK_API_KEY! });
const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai });

const summarizeTool = tool({
  description: "Summarize a block of text",
  inputSchema: z.object({
    text: z.string(),
    maxSentences: z.number().optional().default(2),
  }),
  execute: async ({ text, maxSentences }) => {
    const sentences = String(text).split(/(?<=[.!?])\s+/).slice(0, maxSentences);
    return { summary: sentences.join(" ") };
  },
});

const agentMark = createAgentMarkClient({
  loader,
  modelRegistry,
  tools: { summarize: summarizeTool },
  mcpServers: {
    test: { command: "npx", args: ["-y", "@mastra/mcp-docs-server"] },
  },
});

(async () => {
  const prompt = await agentMark.loadTextPrompt("./mcp-text.prompt.mdx");
  const input = await prompt.format();
  // Pass input to your AI SDK, e.g. generateText(input)
})();
```

## Notes and best practices

* Keep server configs minimal — URL servers need only `url`, stdio servers need only `command`.
* Prefer environment interpolation for portability and secrets hygiene.
* Use wildcard import (`mcp://server/*`) to quickly expose a server’s full tool surface; be mindful of name collisions.
* Define tools using the AI SDK `tool()` helper and pass them via the `tools` option in `createAgentMarkClient`.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Build
Source: https://docs.agentmark.co/build/overview

Create, run, and version prompts — in code or in the Dashboard

Create prompts as `.prompt.mdx` files in your editor, or use the visual editor in the Dashboard — both produce the same format and are fully interchangeable. Run them via SDK, CLI, or the Dashboard Playground.

## What are AgentMark prompts?

AgentMark prompts are `.prompt.mdx` files that combine the readability of Markdown with the power of JSX templating. They provide a structured, version-controllable way to define LLM prompts with type safety and reusability.

* **Readable** — Human-friendly syntax that's easy to review and understand
* **Reusable** — Share components across prompts and use variables for dynamic content
* **Type-Safe** — Full TypeScript support for props and outputs
* **Version-Controlled** — Store prompts in git alongside your code
* **Testable** — Run experiments with datasets and automated evaluations

## Basic structure

Every AgentMark prompt consists of two parts:

### 1. Frontmatter (YAML)

Defines the prompt's metadata and configuration:

```mdx example.prompt.mdx theme={null}
---
name: example
text_config:
  model_name: gpt-4o-mini
  temperature: 0.7
---
```

### 2. Template content

The actual prompt using message tags and dynamic content:

```mdx theme={null}
<System>
You are a helpful assistant.
</System>

<User>
Summarize the following text: {props.text}
</User>
```

## Creating prompts

<Tabs>
  <Tab title="Cloud">
    Use the Dashboard's visual editor to create and edit prompts — no coding or git knowledge required.

    <img alt="Creating a prompt in the Dashboard visual editor" />

    The visual editor shows the frontmatter and message-tag body, a model selector, input-variable fields, and a streaming output pane — the same UI features available to someone editing the `.prompt.mdx` file locally.

    * Write and edit prompts with syntax highlighting
    * Test prompts directly in the editor
    * Configure model settings through visual controls
    * Preview outputs in real-time

    [Step-by-step guide →](./creating-prompts)
  </Tab>

  <Tab title="Local">
    Create `.prompt.mdx` files in your project's `agentmark/` directory:

    ```mdx agentmark/greeting.prompt.mdx theme={null}
    ---
    name: greeting
    text_config:
      model_name: gpt-4o-mini
      temperature: 0.7
    ---

    <System>
    You are a friendly assistant.
    </System>

    <User>
    Say hello to {props.name}.
    </User>
    ```

    Run it from the CLI:

    ```bash theme={null}
    npx agentmark run-prompt agentmark/greeting.prompt.mdx
    ```

    Or execute it from your application — see [Running Prompts](./running-prompts).
  </Tab>
</Tabs>

## Key features

### Message tags

Structure conversations with role tags:

* `<System>` — System-level instructions
* `<User>` — User messages
* `<Assistant>` — Assistant responses (for few-shot examples)

### Dynamic variables

Access runtime data using props:

```mdx theme={null}
<User>
Hello {props.userName}, you have {props.messageCount} new messages.
</User>
```

[Learn about Props →](/templatedx/variables)

### Conditional logic and loops

```mdx theme={null}
<User>
  <If condition={props.isPremium}>
    You have access to premium features.
  </If>

  Products:
  <ForEach arr={props.products}>
    {(product) => (
      <>- {product.name}: ${product.price}</>
    )}
  </ForEach>
</User>
```

With `props.isPremium = true` and `props.products = [{ name: "Widget", price: 10 }, { name: "Gadget", price: 20 }]`, the `<User>` message renders as:

```
You have access to premium features.

Products:
- Widget: $10
- Gadget: $20
```

[Learn about TemplateDX syntax →](/templatedx/tags)

## Generation types

AgentMark supports multiple output formats:

* **[Text](./generation-types/text)** — Natural language responses
* **[Object](./generation-types/object)** — Structured JSON with schema validation
* **[Image](./generation-types/image)** — Image generation
* **[Speech](./generation-types/speech)** — Audio generation

[Explore Generation Types →](./generation-types/overview)

## Advanced features

* **[Tools & Agents](./tools-and-agents)** — Extend prompts with function calling and multi-step agent workflows
* **[Components](./components)** — Create shared, reusable components across prompts
* **[Schema References](./schema-references)** — Reuse JSON schema definitions across prompts
* **[File Attachments](./file-attachments)** — Include images and documents in prompts
* **[MCP Integration](./mcp)** — Connect to Model Context Protocol servers
* **[Playground](./playground)** — Compare prompts across multiple models side-by-side
* **[Version Control](./version-control)** — Track and manage prompt versions

## Next steps

<CardGroup>
  <Card title="Running Prompts" icon="play" href="./running-prompts">
    Execute prompts in your application via SDK
  </Card>

  <Card title="Generation Types" icon="sparkles" href="./generation-types/overview">
    Explore text, object, image, and speech generation
  </Card>

  <Card title="Tools & Agents" icon="robot" href="./tools-and-agents">
    Build multi-step agents with function calling
  </Card>

  <Card title="TemplateDX Syntax" icon="brackets-curly" href="/templatedx/syntax">
    Learn the full template syntax
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Playground
Source: https://docs.agentmark.co/build/playground

Compare prompts and models side-by-side to find the best configuration before publishing

<Info>**Cloud feature.** The Playground is available in the [AgentMark Dashboard](https://app.agentmark.co).</Info>

## Overview

The Playground lets you run the same prompt across multiple models and parameter configurations side-by-side. Compare outputs, tweak prompt text per variant, and apply the winning configuration back to your editor — all without leaving the Dashboard.

## Entering comparison mode

Open any prompt in the editor and click the **Compare** button in the top-right corner of the tab bar.

<img alt="Compare button in the prompt editor" />

The **Compare** button sits at the right edge of the editor tab bar, next to the **Editor** and **Commit History** tabs. Its outlined style indicates comparison mode is off; clicking it switches the button to a filled style and collapses the file tree.

When you enter comparison mode:

* The file tree collapses to give variants maximum horizontal space
* The navigation sidebar minimizes to icons
* Two variant panels appear side-by-side, ready for configuration

Click **Compare** again to exit and return to the standard editor. Your variant configurations are preserved — re-entering comparison mode restores them.

## Configuring variants

Each variant panel has its own independent configuration:

<img alt="Two variants with different models selected" />

Two variant panels fill the editor area side-by-side. Each panel header shows a **Variant 1** / **Variant 2** label, an **Apply** button, duplicate and remove icons, a **Model** dropdown, a **Temperature** slider with a settings-gear button, and a **Run** button.

### Model selection

Select a different model for each variant from the **Model** dropdown. All models configured in your [model schema](/configure/model-schemas) are available.

### Temperature

The **Temperature** slider is inline for quick adjustments. Click the **gear icon** to open the **Parameters** popover for max tokens, top-p, and other settings.

### Prompt override

Click the **Prompt** accordion on any variant to expand the prompt editor. Each variant starts with the base prompt text and can be edited independently.

<img alt="Variant with prompt override expanded showing the code editor" />

The **Prompt** accordion is expanded on the left variant, revealing the code editor populated with the base prompt text. Edits here only affect this variant.

When a variant's prompt differs from the base, a **Modified** badge appears. This makes it easy to see which variants have custom prompt text at a glance.

## Running variants

### Run all

Click **Run All** in the toolbar to execute all variants simultaneously. Each variant streams its output independently — if one errors, the others continue.

### Run single

Each variant has its own **Run** button for re-running just that variant without affecting others.

### Output and metadata

After execution, each variant displays its output alongside metadata chips showing:

* **Model name** — which model generated the response
* **Latency** — end-to-end response time
* **Token usage** — prompt / completion / total tokens
* **Finish reason** — why the model stopped (e.g., `stop`, `length`)

<img alt="Side-by-side comparison with output and metadata" />

After **Run all** completes, each variant shows its streamed output as monospaced text, followed by a bottom metadata bar of chips: model name, latency (e.g. `3.45s`), a combined prompt / completion / total tokens chip, and the finish reason.

## Managing variants

### Add and remove

Click **Add Variant** to add panels (up to 6 maximum). Remove a variant with the **X** button in its header (minimum 2 required).

### Duplicate

Click the **copy icon** on any variant to duplicate its model, parameters, and prompt override into a new panel.

### Grid layout

Variants are arranged in a 3-column grid:

* **2-3 variants**: single row
* **4-6 variants**: wraps to two rows (3 per row)

<img alt="Six variants in a 3x2 grid layout" />

Six variants fill a 3×2 grid — three variants per row. The **Run all** button in the toolbar header executes every variant in parallel.

## Applying a variant

Once you've found the best configuration, click the **Apply** button on that variant's header. This writes the variant's model, parameters, and prompt text back to the main editor — then exits comparison mode so you can review and publish.

## Limitations

* **Maximum 6 variants** at a time
* **Variables are shared** across all variants (per-variant variables are not yet supported)
* **No dataset integration** — for systematic evaluation across many inputs, use [Experiments](/evaluate/running-experiments)
* **Ephemeral state** — variant configurations are not persisted across page reloads

## What's next

* [Create a Prompt](/build/creating-prompts) — set up your base prompt before comparing
* [Version Control](/build/version-control) — publish the winning variant as a new version
* [Experiments](/evaluate/running-experiments) — run prompts against datasets for systematic evaluation

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Running prompts
Source: https://docs.agentmark.co/build/running-prompts

Run prompts from the Dashboard, CLI, or SDK — with streaming, tracing, and caching

<Tabs>
  <Tab title="Cloud">
    ## Run from the Dashboard

    Open any prompt in the Dashboard editor, fill in your input variables, and click **Run**. Results stream back in real time.

    <img alt="Running a prompt in the Dashboard" />

    The animation shows the Dashboard's prompt editor running a prompt: the user fills input variables in the right-hand panel, clicks **Run**, and the response streams back in the output pane while tokens, cost, and model information appear in the footer.

    Every run is automatically traced. Navigate to **Traces** to see the execution timeline, token usage, cost, and model information for each run.

    ## Run from the playground

    The [Playground](/build/playground) lets you run the same prompt across multiple models and parameter configurations side-by-side. Compare outputs, tweak prompt text per variant, and apply the winning configuration back to your editor.
  </Tab>

  <Tab title="Local">
    ## CLI usage

    Run prompts from the command line for quick testing during development.

    ```bash theme={null}
    npx agentmark run-prompt agentmark/greeting.prompt.mdx
    ```

    <Note>
      Requires the development server running (`npx agentmark dev`).
    </Note>

    ### Passing props

    **Inline JSON**:

    ```bash theme={null}
    npx agentmark run-prompt agentmark/greeting.prompt.mdx \
      --props '{"name": "Alice", "role": "developer"}'
    ```

    **From file**:

    ```bash theme={null}
    npx agentmark run-prompt agentmark/greeting.prompt.mdx \
      --props-file ./test-data.json
    ```

    ### Output examples

    **Text generation**:

    ```
    === Text Prompt Results ===
    Once upon a time...

    ────────────────────────────────────────────────────────────
    🪙 250 in, 100 out, 350 total
    📊 View trace: http://localhost:3000/traces?traceId=<id>
    ```

    **Object generation**:

    ```
    === Object Prompt Results ===
    {
      "name": "John Smith",
      "email": "john@example.com"
    }

    ────────────────────────────────────────────────────────────
    🪙 180 in, 45 out, 225 total
    📊 View trace: http://localhost:3000/traces?traceId=<id>
    ```

    **Image and speech generation** — saved to `.agentmark-outputs/`:

    ```
    === Image Prompt Results ===
    Saved 2 image(s) to:
    - .agentmark-outputs/image-1-1698765432.png
    - .agentmark-outputs/image-2-1698765432.png
    ```

    ## SDK usage

    AgentMark works with multiple AI SDKs through adapters. The pattern is always:

    1. Load prompt with the appropriate loader
    2. Format with props (and optionally telemetry)
    3. Pass to your adapter's generation function

    ### Text generation

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        import { client } from './agentmark.client';
        import { generateText } from 'ai';

        const prompt = await client.loadTextPrompt('agentmark/greeting.prompt.mdx');
        const input = await prompt.format({
          props: { name: 'Alice', role: 'developer' }
        });

        const result = await generateText(input);
        console.log(result.text);
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        from agentmark_client import client
        from agentmark_pydantic_ai_v0 import run_text_prompt

        prompt = await client.load_text_prompt("agentmark/greeting.prompt.mdx")
        params = await prompt.format(props={"name": "Alice", "role": "developer"})

        result = await run_text_prompt(params)
        print(result.text)
        ```
      </Tab>
    </Tabs>

    ### Streaming

    Use `streamText()` and `streamObject()` to stream responses token-by-token:

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        import { client } from './agentmark.client';
        import { streamText } from 'ai';

        const prompt = await client.loadTextPrompt('agentmark/story.prompt.mdx');
        const input = await prompt.format({
          props: { topic: 'space exploration' }
        });

        const result = streamText(input);

        for await (const chunk of result.textStream) {
          process.stdout.write(chunk);
        }
        ```

        For structured output:

        ```typescript theme={null}
        import { streamObject } from 'ai';

        const prompt = await client.loadObjectPrompt('agentmark/extract-data.prompt.mdx');
        const input = await prompt.format({
          props: { input: 'Contact John Smith at john@example.com' }
        });

        const result = streamObject(input);

        for await (const partial of result.partialObjectStream) {
          console.log(partial);
        }
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        from agentmark_client import client
        from agentmark_pydantic_ai_v0 import stream_text_prompt

        prompt = await client.load_text_prompt("agentmark/story.prompt.mdx")
        params = await prompt.format(props={"topic": "space exploration"})

        async for chunk in stream_text_prompt(params):
            print(chunk, end="")
        ```
      </Tab>
    </Tabs>

    ### Object generation

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        import { client } from './agentmark.client';
        import { generateObject } from 'ai';

        const prompt = await client.loadObjectPrompt('agentmark/extract-data.prompt.mdx');
        const input = await prompt.format({
          props: { input: 'Contact John Smith at john@example.com' }
        });

        const result = await generateObject(input);
        console.log(result.object);
        // { name: "John Smith", email: "john@example.com" }
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        from agentmark_client import client
        from agentmark_pydantic_ai_v0 import run_object_prompt

        prompt = await client.load_object_prompt("agentmark/extract-data.prompt.mdx")
        params = await prompt.format(
            props={"input": "Contact John Smith at john@example.com"}
        )

        result = await run_object_prompt(params)
        print(result.object)
        ```
      </Tab>
    </Tabs>

    ### Image generation

    ```typescript theme={null}
    import { client } from './agentmark.client';
    import { experimental_generateImage } from 'ai';

    const prompt = await client.loadImagePrompt('agentmark/logo.prompt.mdx');
    const input = await prompt.format({
      props: { company: 'Acme Corp', style: 'modern' }
    });

    const result = await experimental_generateImage(input);
    result.images.forEach((image, i) => {
      fs.writeFileSync(`logo-${i}.png`, image.data);
    });
    ```

    <Note>TypeScript only — no Python equivalent yet.</Note>

    ### Speech generation

    ```typescript theme={null}
    import { client } from './agentmark.client';
    import { experimental_generateSpeech } from 'ai';

    const prompt = await client.loadSpeechPrompt('agentmark/narration.prompt.mdx');
    const input = await prompt.format({
      props: { script: 'Welcome to our podcast' }
    });

    const result = await experimental_generateSpeech(input);
    fs.writeFileSync('narration.mp3', result.audio);
    ```

    <Note>TypeScript only — no Python equivalent yet.</Note>

    ### Using other adapters

    The pattern is the same for all adapters:

    * **Vercel AI SDK**: `generateText()`, `generateObject()`, `streamText()`, `streamObject()`
    * **Mastra**: `agent.generate()`
    * **Custom**: Your own generation function

    [Learn more about adapters →](/integrations/overview)

    ## Tracing prompt runs

    Enable telemetry to automatically trace every prompt execution. Traces capture input/output, token usage, cost, latency, and custom metadata.

    ```typescript theme={null}
    const input = await prompt.format({
      props: { name: 'Alice' },
      telemetry: {
        isEnabled: true,
        functionId: 'greeting-handler',
        metadata: {
          userId: 'user-123',
          environment: 'production'
        }
      }
    });

    const result = await generateText(input);
    ```

    View traces locally at `http://localhost:3000` or in the Dashboard under **Traces**. See [Tracing Setup](/observe/tracing-setup) for the full API.

    ## Caching

    The AgentMark API loader caches loaded prompts client-side with a 60-second TTL by default. This means repeated calls to `loadTextPrompt()` within the TTL window return the cached version without a network request.

    Caching is automatic — no configuration needed. After the TTL expires, the next request re-fetches from the server.

    ## Troubleshooting

    **Server connection error** — Ensure `npx agentmark dev` is running. Check ports 9417 and 9418 are available.

    **File not found** — Verify the file path and `.prompt.mdx` extension.

    **Invalid JSON in props** — Use valid JSON with double quotes.
  </Tab>
</Tabs>

## Next steps

<CardGroup>
  <Card title="Running Experiments" icon="flask" href="/evaluate/running-experiments">
    Test prompts against datasets
  </Card>

  <Card title="Generation Types" icon="sparkles" href="./generation-types/overview">
    Text, objects, images, and audio
  </Card>

  <Card title="Version Control" icon="code-branch" href="./version-control">
    Track changes and rollback to previous versions
  </Card>

  <Card title="Integrations" icon="plug" href="/integrations/overview">
    Vercel AI SDK, Pydantic AI, Mastra, and more
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Schema references
Source: https://docs.agentmark.co/build/schema-references

Reuse JSON schema definitions across AgentMark prompts with $ref

## Overview

AgentMark supports JSON Schema `$ref` references in prompt schemas. Instead of duplicating the same schema definition across multiple prompts, you can extract shared definitions into `.json` files and reference them with `$ref`. At build time, AgentMark resolves all references and inlines the content.

**Benefits:**

* **DRY schemas**: Define a schema once, reuse it across many prompts
* **Easier maintenance**: Update a schema in one place, and every prompt that references it picks up the change

Schema references work in both `input_schema` (input validation) and `object_config.schema` (structured output).

## Basic usage

Create a JSON schema file, then reference it from your prompt's frontmatter using `$ref`.

### Schema file

```json schemas/user.json theme={null}
{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "email": { "type": "string", "format": "email" },
    "role": { "type": "string", "enum": ["admin", "member", "viewer"] }
  },
  "required": ["name", "email"]
}
```

### Prompt file with `$ref`

```jsx greet-user.prompt.mdx theme={null}
---
name: greet-user
text_config:
  model_name: gpt-4o
input_schema:
  $ref: ./schemas/user.json
---

<System>You are a friendly assistant.</System>
<User>Greet the user: {props.name} ({props.email}), who has the role {props.role}.</User>
```

When AgentMark processes this prompt, it loads `schemas/user.json` and replaces the `$ref` with the full schema content. The result is identical to having written the schema inline.

### Using `$ref` in output schemas

You can also use `$ref` in `object_config.schema` to define the structure of generated objects.

```jsx extract-contact.prompt.mdx theme={null}
---
name: extract-contact
object_config:
  model_name: gpt-4o
  schema:
    $ref: ./schemas/user.json
---

<System>Extract contact information from the following text.</System>
<User>{props.text}</User>
```

### Using `$ref` for nested properties

You do not need to replace the entire schema with a `$ref`. You can use `$ref` for individual properties within a larger schema.

```jsx create-order.prompt.mdx theme={null}
---
name: create-order
object_config:
  model_name: gpt-4o
  schema:
    type: object
    properties:
      customer:
        $ref: ./schemas/user.json
      items:
        type: array
        items:
          $ref: ./schemas/product.json
      total:
        type: number
    required:
      - customer
      - items
---

<System>Create an order from the customer's request.</System>
<User>{props.request}</User>
```

## JSON Pointer fragments

You can reference a specific definition within a schema file using a JSON Pointer fragment (RFC 6901). This is useful when you have a file containing multiple related definitions.

### Definitions file

```json schemas/common.json theme={null}
{
  "$defs": {
    "Address": {
      "type": "object",
      "properties": {
        "street": { "type": "string" },
        "city": { "type": "string" },
        "zip": { "type": "string" },
        "country": { "type": "string" }
      },
      "required": ["street", "city"]
    },
    "PhoneNumber": {
      "type": "string",
      "pattern": "^\\+[0-9]{1,15}$"
    }
  }
}
```

### Referencing a specific definition

```jsx contact-form.prompt.mdx theme={null}
---
name: contact-form
object_config:
  model_name: gpt-4o
  schema:
    type: object
    properties:
      name:
        type: string
      address:
        $ref: ./schemas/common.json#/$defs/Address
      phone:
        $ref: ./schemas/common.json#/$defs/PhoneNumber
---

<System>Extract contact details from the message.</System>
<User>{props.message}</User>
```

The `#/$defs/Address` fragment tells AgentMark to navigate into the JSON object at `$defs` then `Address`, and inline only that portion. The fragment follows standard JSON Pointer syntax, so any nested path works (e.g., `#/$defs/contact/email`).

<Note>
  Older JSON Schema drafts (Draft 4-7) used `definitions` instead of `$defs`. Both work with AgentMark since the fragment is just a JSON Pointer path into the file.
</Note>

## Transitive references

Schema files can themselves contain `$ref` entries that point to other files. AgentMark follows the entire chain and inlines everything.

### Example

```json schemas/user.json theme={null}
{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "address": { "$ref": "./address.json" }
  }
}
```

```json schemas/address.json theme={null}
{
  "type": "object",
  "properties": {
    "street": { "type": "string" },
    "city": { "type": "string" },
    "zip": { "type": "string" }
  }
}
```

```jsx user-profile.prompt.mdx theme={null}
---
name: user-profile
text_config:
  model_name: gpt-4o
input_schema:
  $ref: ./schemas/user.json
---

<User>Describe the user profile for {props.name}.</User>
```

AgentMark resolves `user.json`, then sees the `$ref` to `address.json` inside it, resolves that too, and produces a fully inlined schema. Transitive references also work with JSON Pointer fragments -- a referenced file can use `$ref: ./geo.json#/definitions/Coordinate` and it will resolve correctly.

AgentMark supports up to 50 levels of transitive references.

<Note>
  Fragment-only references like `$ref: "#/$defs/Address"` (no file path) are standard JSON Schema internal references. AgentMark preserves these as-is for runtime validation and does not attempt to resolve them.
</Note>

## Security constraints

AgentMark enforces security boundaries on `$ref` resolution.

* **Local files only**: Only relative file paths are supported. Remote URLs like `https://example.com/schema.json` are not fetched.
* **Project directory boundary**: Resolved paths must stay within the project directory. Path traversal attempts like `../../../etc/passwd` are rejected by the content loader.
* **No absolute paths**: Absolute paths like `/etc/passwd` are rejected.
* **Sibling properties are dropped**: When AgentMark resolves a `$ref`, the entire object is replaced by the referenced content. Any sibling properties next to `$ref` (like `description`) are discarded. Place additional properties in the referenced schema file instead.

## Error handling

When a `$ref` cannot be resolved, AgentMark reports a descriptive error.

| Error                                      | Cause                                               | Fix                                                          |
| ------------------------------------------ | --------------------------------------------------- | ------------------------------------------------------------ |
| `file not found "path"`                    | Referenced file does not exist                      | Check the file path is correct and relative to the prompt    |
| `"path" is not valid JSON`                 | File contains malformed JSON                        | Validate with a JSON linter                                  |
| `circular reference detected: A -> B -> A` | Two or more schemas reference each other in a cycle | Break the cycle by inlining the shared portion               |
| `maximum resolution depth (50) exceeded`   | Reference chain is deeper than 50 levels            | Simplify the schema hierarchy                                |
| `JSON pointer "..." not found in "path"`   | Fragment path does not match any key in the file    | Check the `#/...` fragment matches the target file structure |

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Prompt syntax
Source: https://docs.agentmark.co/build/syntax

Learn the AgentMark template syntax for creating dynamic prompts

AgentMark prompts use TemplateDX, a syntax that combines Markdown with JSX-like components.

## Quick reference

### Message roles

Text and Object prompts use message-role tags (`<System>`, `<User>`, `<Assistant>`). Image and Speech prompts use dedicated root tags (`<ImagePrompt>`, `<SpeechPrompt>`).

#### Text prompts

The `<Assistant>` tag is optional — include it to provide few-shot examples or prior conversation turns.

```mdx theme={null}
<System>You are a helpful assistant</System>

<User>Hello, how are you?</User>

<Assistant>I'm doing well, thank you!</Assistant>
```

#### Object prompts

Object prompts produce structured output defined by a `schema` field in the `object_config` frontmatter. The `<Assistant>` tag below is an optional few-shot example showing the model what a valid response looks like — it does not drive the structure itself.

```mdx theme={null}
---
name: extract-contact
object_config:
  model_name: gpt-4o-mini
  schema:
    type: object
    properties:
      name: { type: string }
      email: { type: string }
---

<System>You extract structured data from text</System>

<User>Extract the name and email from: John Smith (john@example.com)</User>

<Assistant>{"name": "John Smith", "email": "john@example.com"}</Assistant>
```

#### Image prompts

```mdx theme={null}
<ImagePrompt>
A futuristic cityscape at sunset with flying cars
</ImagePrompt>
```

#### Speech prompts

```mdx theme={null}
<System>Read this text clearly and slowly</System>

<SpeechPrompt>
Welcome to AgentMark. This is a test of speech generation.
</SpeechPrompt>
```

### Variables

Use variables to make prompts dynamic:

```mdx theme={null}
<User>
  Hello {props.userName}, you have {props.messageCount} new messages.
</User>
```

### Conditionals

Show or hide content based on conditions:

```mdx theme={null}
<User>
  <If condition={props.isPremium}>
    Welcome, premium member!
  </If>
  <Else>
    Consider upgrading to premium.
  </Else>
</User>
```

### Loops

Iterate over arrays:

```mdx theme={null}
<User>
  Products:
  <ForEach arr={props.products}>
    {(product, index) => (
      <>- {product.name}: ${product.price}</>
    )}
  </ForEach>
</User>
```

### Filters

Transform values with built-in filters:

```mdx theme={null}
<User>
  Your name is {capitalize(props.name)}
  Price: {round(props.price, 2)}
</User>
```

## Learn more

For complete documentation on all syntax features, including advanced templating, filters, components, and type safety:

<Card title="Visit TemplateDX Documentation" icon="book" href="/templatedx/introduction">
  Complete syntax guide →
</Card>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Tools and agents
Source: https://docs.agentmark.co/build/tools-and-agents

Extend prompts with function calling and build multi-step agent workflows

Tools allow your prompts to call external functions — web searches, calculations, API calls, and more. Agents use tools across multiple LLM calls to solve complex tasks.

## Creating tools

Tools are defined using the AI SDK's `tool()` function. Each tool includes a description, a Zod schema for its input, and an `execute` function:

<Warning>
  The examples below use the AI SDK v5 signature (`inputSchema:`). AI SDK v4 used `parameters:` — do not mix them. The `@agentmark-ai/ai-sdk-v5-adapter` package requires v5.
</Warning>

```typescript theme={null}
import { tool } from "ai";
import { z } from "zod";

const calculateTool = tool({
  description: "Performs basic arithmetic calculations",
  inputSchema: z.object({
    expression: z.string().describe("The mathematical expression to evaluate"),
  }),
  execute: async ({ expression }) => {
    // Safe arithmetic — swap in a real expression parser for production.
    const result = Function(`"use strict"; return (${expression})`)();
    return { result };
  },
});
```

## Passing tools to the client

Pass tools to `createAgentMarkClient` via the `tools` option — a plain object keyed by name:

```typescript theme={null}
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
import { openai } from "@ai-sdk/openai";
import { tool } from "ai";
import { z } from "zod";

const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai });

const calculateTool = tool({
  description: "Performs basic arithmetic calculations",
  inputSchema: z.object({
    expression: z.string().describe("The mathematical expression to evaluate"),
  }),
  execute: async ({ expression }) => {
    const result = Function(`"use strict"; return (${expression})`)();
    return { result };
  },
});

const agentmark = createAgentMarkClient({
  modelRegistry,
  tools: {
    calculate: calculateTool,
  },
});
```

## Tool configuration in frontmatter

Reference tools by name in your prompt's frontmatter. The tool names must match the keys in your `tools` object:

```jsx calculator.prompt.mdx theme={null}
---
name: calculator
text_config:
  model_name: gpt-4
  tools:
    - calculate
---

<System>
You are a math tutor that can perform calculations. Use the calculate tool when you need to compute something.
</System>

<User>What's 235 * 18 plus 42?</User>
```

The tool implementation (description, schema, execute function) is defined in your code, not in the frontmatter.

### MCP tools in frontmatter

Reference Model Context Protocol tools directly using `mcp://{server}/{tool}`:

```mdx mcp-example.prompt.mdx theme={null}
---
name: mcp-example
text_config:
  model_name: gpt-4
  tools:
    - mcp://docs/web-search
    - summarize
---

<System>
Use the web-search tool to look up relevant documentation when needed.
Use the summarize tool to condense content into a short summary.
</System>

<User>
Find the page that explains MCP integration and summarize it in 2 sentences.
</User>
```

* `mcp://docs/web-search` resolves to the MCP server named `docs`, tool `web-search`
* `summarize` is a tool provided via the `tools` option in `createAgentMarkClient`
* Use `mcp://docs/*` to include every tool exported by a server

See [MCP integration](./mcp) for details on configuring MCP servers.

## Agents

Enable multi-step agent workflows by setting `max_calls`. The SDK automatically handles multiple LLM calls, passing tool results back until the task is complete:

```jsx travel-agent.prompt.mdx theme={null}
---
name: travel-agent
text_config:
  model_name: gpt-4
  max_calls: 3
  tools:
    - search_flights
    - check_weather
---

<System>
You are a helpful travel assistant that can search flights and check weather conditions.
When helping users plan trips:
1. Search for available flights
2. Check the weather at the destination
3. Make recommendations based on both flight options and weather
</System>

<User>
I want to fly from San Francisco to New York next week. Can you help me plan my trip?
</User>
```

## Testing agents in the Dashboard

<Info>**Cloud feature.** Test agents visually in the [AgentMark Dashboard](https://app.agentmark.co).</Info>

Run agents directly in the Dashboard to see how they use tools in real time:

<img alt="Running an agent with tools in the Dashboard" />

The agent panel shows each step the model takes: the tool it called, the arguments it passed, the tool's response, and the model's next move. You can inspect the full tool-call trace without leaving the editor.

View configured tools and their schemas:

<img alt="Viewing tool schema in the Dashboard" />

The tool-schema panel lists every tool referenced in your prompt's frontmatter, with its description and the full Zod-derived JSON schema for its inputs.

## Best practices

1. Keep tools focused on a single responsibility.
2. Provide clear descriptions to help the LLM use tools appropriately.
3. Handle errors gracefully and return informative error messages.
4. Use descriptive parameter names and include helpful descriptions.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Version control
Source: https://docs.agentmark.co/build/version-control

Track, compare, and manage prompt versions in the AgentMark Dashboard

Every prompt change is automatically tracked and versioned in the Dashboard. Each publish creates a new commit on the app's connected Git branch; the full history is visible in the **Commit History** tab of any prompt.

<Note>
  **Local vs Cloud.** In Local mode your prompts are plain `.prompt.mdx` files — track versions with your own Git workflow (`git log`, `git diff`, `git revert`). AgentMark Cloud builds the visual commit history, diffs, and one-click rollback described below on top of that same Git branch.
</Note>

<img alt="Commit History tab showing prompt version timeline" />

The Commit History list shows every version with author, timestamp, and commit message. The most recent commit is marked as current.

## View a commit

Click any version in the history to view the commit:

* Changes made in that version
* Author and timestamp
* Commit message
* Link to the commit in your Git repository

<img alt="Commit detail view showing diff and metadata" />

The commit detail view shows the diff between this version and the previous one, along with the full commit metadata.

## Rollback

Revert to any previous version of your prompt. Rollback is **non-destructive** — it creates a new commit that restores the prompt content from the selected version, so your history stays intact.

1. Open the **Commit History** tab for the prompt.
2. In the row for the target version, click the rollback icon in the **Actions** column.
3. Confirm the rollback in the dialog.

<img alt="Rollback confirmation dialog" />

The confirmation dialog summarizes which version you're rolling back to. After confirming, a new commit is created and the prompt returns to that version's content.

## Next steps

<CardGroup>
  <Card title="Testing with datasets" icon="flask" href="/evaluate/datasets">
    Validate versions against test data
  </Card>

  <Card title="Running evaluations" icon="chart-line" href="/evaluate/writing-evals">
    Measure quality across versions
  </Card>

  <Card title="Team permissions" icon="users" href="/deploy/users-and-access-control">
    Control who can edit and approve changes
  </Card>

  <Card title="Experiments" icon="beaker" href="/evaluate/running-experiments">
    Compare versions with A/B testing
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Client config
Source: https://docs.agentmark.co/configure/client-config

Configure your AgentMark client for models, tools, scores, and prompt loading

The AgentMark client is configured in `agentmark.client.ts` (or `agentmark_client.py`). It connects your prompts to AI models, tools, evaluations, and prompt loading — used by the CLI, AgentMark Cloud, and your application code.

## Basic configuration

The client file is generated by `npm create agentmark@latest`. Each adapter has its own client pattern:

<Tabs>
  <Tab title="AI SDK (Vercel)">
    ```typescript agentmark.client.ts theme={null}
    import {
      createAgentMarkClient,
      VercelAIModelRegistry,
    } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { ApiLoader } from "@agentmark-ai/loader-api";
    import { openai } from "@ai-sdk/openai";

    const loader =
      process.env.NODE_ENV === "development"
        ? ApiLoader.local({
            baseUrl: process.env.AGENTMARK_BASE_URL || "http://localhost:9418",
          })
        : ApiLoader.cloud({
            apiKey: process.env.AGENTMARK_API_KEY!,
            appId: process.env.AGENTMARK_APP_ID!,
          });

    const modelRegistry = new VercelAIModelRegistry()
      .registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name))
      .registerModels(["dall-e-3"], (name) => openai.image(name))
      .registerModels(["tts-1-hd"], (name) => openai.speech(name));

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
    });
    ```

    Install:

    ```bash theme={null}
    npm install @agentmark-ai/ai-sdk-v5-adapter @agentmark-ai/loader-api @ai-sdk/openai
    ```
  </Tab>

  <Tab title="Claude Agent SDK">
    ```typescript agentmark.client.ts theme={null}
    import {
      createAgentMarkClient,
      ClaudeAgentModelRegistry,
    } from "@agentmark-ai/claude-agent-sdk-v0-adapter";
    import { ApiLoader } from "@agentmark-ai/loader-api";

    const loader =
      process.env.NODE_ENV === "development"
        ? ApiLoader.local({
            baseUrl: process.env.AGENTMARK_BASE_URL || "http://localhost:9418",
          })
        : ApiLoader.cloud({
            apiKey: process.env.AGENTMARK_API_KEY!,
            appId: process.env.AGENTMARK_APP_ID!,
          });

    const modelRegistry = ClaudeAgentModelRegistry.createDefault();

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      adapterOptions: {
        permissionMode: "bypassPermissions",
        maxTurns: 20,
      },
    });
    ```

    Install:

    ```bash theme={null}
    npm install @agentmark-ai/claude-agent-sdk-v0-adapter @agentmark-ai/loader-api
    ```

    The `adapterOptions` are unique to this adapter:

    | Option            | Description                                                      |
    | ----------------- | ---------------------------------------------------------------- |
    | `permissionMode`  | `'default'`, `'acceptEdits'`, `'bypassPermissions'`, or `'plan'` |
    | `maxTurns`        | Maximum number of agent turns                                    |
    | `maxBudgetUsd`    | Spending limit per run                                           |
    | `cwd`             | Working directory for the agent                                  |
    | `allowedTools`    | Whitelist of tool names                                          |
    | `disallowedTools` | Blacklist of tool names                                          |
  </Tab>

  <Tab title="Mastra">
    ```typescript agentmark.client.ts theme={null}
    import {
      createAgentMarkClient,
      MastraModelRegistry,
    } from "@agentmark-ai/mastra-v0-adapter";
    import { ApiLoader } from "@agentmark-ai/loader-api";
    import { openai } from "@ai-sdk/openai";

    const loader =
      process.env.NODE_ENV === "development"
        ? ApiLoader.local({
            baseUrl: process.env.AGENTMARK_BASE_URL || "http://localhost:9418",
          })
        : ApiLoader.cloud({
            apiKey: process.env.AGENTMARK_API_KEY!,
            appId: process.env.AGENTMARK_APP_ID!,
          });

    const modelRegistry = new MastraModelRegistry()
      .registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name));

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
    });
    ```

    Install:

    ```bash theme={null}
    npm install @agentmark-ai/mastra-v0-adapter @agentmark-ai/loader-api @ai-sdk/openai
    ```
  </Tab>

  <Tab title="Claude Agent SDK (Python)">
    ```python agentmark_client.py theme={null}
    import os
    from dotenv import load_dotenv
    from agentmark.prompt_core import ApiLoader
    from agentmark_claude_agent_sdk_v0 import (
        create_claude_agent_client,
        ClaudeAgentModelRegistry,
        ClaudeAgentAdapterOptions,
    )

    load_dotenv()

    if os.getenv("NODE_ENV") == "development":
        loader = ApiLoader.local(
            base_url=os.getenv("AGENTMARK_BASE_URL", "http://localhost:9418")
        )
    else:
        loader = ApiLoader.cloud(
            api_key=os.environ["AGENTMARK_API_KEY"],
            app_id=os.environ["AGENTMARK_APP_ID"],
        )

    model_registry = ClaudeAgentModelRegistry()
    model_registry.register_providers({
        # register your providers here
    })

    client = create_claude_agent_client(
        model_registry=model_registry,
        loader=loader,
        adapter_options=ClaudeAgentAdapterOptions(
            permission_mode="bypassPermissions",
        ),
    )
    ```

    Install:

    ```bash theme={null}
    pip install agentmark-sdk agentmark-claude-agent-sdk-v0 agentmark-prompt-core python-dotenv claude-agent-sdk
    ```

    See [Claude Agent SDK](/integrations/typescript/claude-agent-sdk) for the full adapter options reference.
  </Tab>

  <Tab title="Pydantic AI (Python)">
    ```python agentmark_client.py theme={null}
    import os
    from agentmark.prompt_core import ApiLoader
    from agentmark_pydantic_ai_v0 import (
        create_pydantic_ai_client,
        PydanticAIModelRegistry,
    )

    if os.getenv("NODE_ENV") == "development":
        loader = ApiLoader.local(
            base_url=os.getenv("AGENTMARK_BASE_URL", "http://localhost:9418")
        )
    else:
        loader = ApiLoader.cloud(
            api_key=os.environ["AGENTMARK_API_KEY"],
            app_id=os.environ["AGENTMARK_APP_ID"],
        )

    model_registry = PydanticAIModelRegistry()
    model_registry.register_models(
        ["gpt-4o", "gpt-4o-mini"],
        lambda name, opts=None: f"openai:{name}",
    )
    model_registry.register_models(
        ["claude-sonnet-4-20250514"],
        lambda name, opts=None: f"anthropic:{name}",
    )

    client = create_pydantic_ai_client(
        model_registry=model_registry,
        loader=loader,
    )
    ```

    Install:

    ```bash theme={null}
    pip install agentmark-sdk agentmark-pydantic-ai-v0 agentmark-prompt-core python-dotenv pydantic-ai
    ```

    <Note>
      The Python adapters don't ship a "default" model registry — you register provider prefixes explicitly with `register_models()`. The format `"openai:<model>"` tells Pydantic AI which provider to use at runtime.
    </Note>
  </Tab>
</Tabs>

## Prompt loading

The loader determines how prompts are fetched at runtime. AgentMark provides two loaders:

<Tabs>
  <Tab title="ApiLoader (recommended)">
    Use `ApiLoader` for both development and production:

    ```typescript theme={null}
    import { ApiLoader } from "@agentmark-ai/loader-api";

    // Development — loads from local dev server
    const loader = ApiLoader.local({
      baseUrl: "http://localhost:9418",
    });

    // Production — loads from AgentMark Cloud CDN
    const loader = ApiLoader.cloud({
      apiKey: process.env.AGENTMARK_API_KEY!,
      appId: process.env.AGENTMARK_APP_ID!,
    });
    ```

    `ApiLoader.cloud()` fetches prompts from the AgentMark API with a 60-second TTL cache. `ApiLoader.local()` fetches from your running `agentmark dev` server.
  </Tab>

  <Tab title="FileLoader (self-hosted)">
    Use `FileLoader` to load pre-built prompts from disk (no AgentMark Cloud dependency):

    ```typescript theme={null}
    import { FileLoader } from "@agentmark-ai/loader-file";

    const loader = new FileLoader("./dist/agentmark");
    ```

    Requires running `npx agentmark build --out dist/agentmark` before deployment to compile your `.prompt.mdx` files into JSON.

    A common pattern is to use `ApiLoader.local()` in development and `FileLoader` in production:

    ```typescript theme={null}
    import { ApiLoader } from "@agentmark-ai/loader-api";
    import { FileLoader } from "@agentmark-ai/loader-file";

    const loader =
      process.env.NODE_ENV === "development"
        ? ApiLoader.local({ baseUrl: "http://localhost:9418" })
        : new FileLoader("./dist/agentmark");
    ```
  </Tab>
</Tabs>

## Registering models

The model registry maps model names (from prompt frontmatter) to actual AI SDK model instances. Each adapter has its own registry class.

<Tabs>
  <Tab title="AI SDK (Vercel)">
    ```typescript theme={null}
    import { VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { openai } from "@ai-sdk/openai";
    import { anthropic } from "@ai-sdk/anthropic";
    import { google } from "@ai-sdk/google";

    const modelRegistry = new VercelAIModelRegistry()
      // Language models
      .registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name))
      .registerModels(["claude-sonnet-4-20250514"], (name) => anthropic(name))
      .registerModels(["gemini-2.0-flash"], (name) => google(name))
      // Image models
      .registerModels(["dall-e-3"], (name) => openai.image(name))
      // Speech models
      .registerModels(["tts-1-hd"], (name) => openai.speech(name));
    ```

    You can also use regex patterns for dynamic matching:

    ```typescript theme={null}
    const modelRegistry = new VercelAIModelRegistry()
      .registerModels(/^gpt-/, (name) => openai(name))
      .registerModels(/^claude-/, (name) => anthropic(name));
    ```
  </Tab>

  <Tab title="Claude Agent SDK">
    ```typescript theme={null}
    import { ClaudeAgentModelRegistry } from "@agentmark-ai/claude-agent-sdk-v0-adapter";

    // Option 1: Default registry (passes model names through)
    const modelRegistry = ClaudeAgentModelRegistry.createDefault();

    // Option 2: Custom configuration per model
    const modelRegistry = new ClaudeAgentModelRegistry()
      .registerModels(["claude-sonnet-4-20250514"], (name) => ({
        model: name,
        maxThinkingTokens: 10000,
      }));
    ```
  </Tab>

  <Tab title="Mastra">
    ```typescript theme={null}
    import { MastraModelRegistry } from "@agentmark-ai/mastra-v0-adapter";
    import { openai } from "@ai-sdk/openai";

    const modelRegistry = new MastraModelRegistry()
      .registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name));
    ```
  </Tab>

  <Tab title="Claude Agent SDK (Python)">
    ```python theme={null}
    from agentmark_claude_agent_sdk_v0 import ClaudeAgentModelRegistry, ModelConfig

    # Configure per-model settings. The Python adapter does not ship a
    # `.create_default()` — register models explicitly. Creators must return
    # a `ModelConfig` dataclass, not a plain dict.
    model_registry = ClaudeAgentModelRegistry()
    model_registry.register_models(
        ["claude-sonnet-4-20250514"],
        lambda name, _: ModelConfig(model=name),
    )
    model_registry.register_models(
        ["claude-opus-4-20250514"],
        lambda name, _: ModelConfig(model=name, max_thinking_tokens=10000),
    )
    ```
  </Tab>

  <Tab title="Pydantic AI (Python)">
    ```python theme={null}
    from agentmark_pydantic_ai_v0 import PydanticAIModelRegistry

    model_registry = PydanticAIModelRegistry()
    model_registry.register_models(
        ["gpt-4o", "gpt-4o-mini"],
        lambda name, opts=None: f"openai:{name}",
    )
    model_registry.register_models(
        ["claude-sonnet-4-20250514"],
        lambda name, opts=None: f"anthropic:{name}",
    )
    ```

    <Note>
      The `"<provider>:<model>"` string is the format Pydantic AI uses to pick a provider at runtime. AgentMark doesn't ship a pre-built default registry for Python — register the providers you use.
    </Note>
  </Tab>
</Tabs>

Models referenced in prompt frontmatter must be registered in the model registry:

```mdx theme={null}
---
text_config:
  model_name: gpt-4o
---
```

<Tip>
  Use `npx agentmark pull-models` to add built-in models to your `agentmark.json`. You still need to register them in the client for runtime use.
</Tip>

## Registering tools

Tools allow prompts to call functions during generation. Pass tools directly as a plain object to `createAgentMarkClient` and reference them by name in prompt frontmatter.

<Tabs>
  <Tab title="AI SDK (Vercel)">
    Use the native `tool()` function from the `ai` package to define tools. AI SDK v5 uses `inputSchema` (Zod) — `parameters` is the v4 name and fails type-checking in v5.

    ```typescript theme={null}
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { tool } from "ai";
    import { z } from "zod";

    const searchTool = tool({
      description: "Search the knowledge base",
      inputSchema: z.object({ query: z.string() }),
      execute: async ({ query }) => ({ results: [`Result for ${query}`] }),
    });

    const weatherTool = tool({
      description: "Get current weather for a location",
      inputSchema: z.object({ location: z.string() }),
      execute: async ({ location }) => ({ temp: 72, condition: "sunny" }),
    });

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      tools: {
        search_knowledgebase: searchTool,
        get_weather: weatherTool,
      },
    });
    ```
  </Tab>

  <Tab title="Claude Agent SDK">
    The Claude Agent SDK adapter uses `mcpServers` (camelCase) instead of tools, since the Claude agent accesses tools through MCP:

    ```typescript theme={null}
    import { createAgentMarkClient, ClaudeAgentModelRegistry } from "@agentmark-ai/claude-agent-sdk-v0-adapter";

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      mcpServers: {
        tools: { url: "https://tools.example.com/mcp" },
      },
    });
    ```
  </Tab>

  <Tab title="Claude Agent SDK (Python)">
    The Python Claude Agent SDK adapter also uses `mcp_servers`:

    ```python theme={null}
    from agentmark_claude_agent_sdk_v0 import create_claude_agent_client

    client = create_claude_agent_client(
        model_registry=model_registry,
        loader=loader,
        mcp_servers={
            "tools": {"url": "https://tools.example.com/mcp"},
        },
    )
    ```
  </Tab>

  <Tab title="Mastra">
    Pass tools as a plain object to `createAgentMarkClient`:

    ```typescript theme={null}
    import { createAgentMarkClient, MastraModelRegistry } from "@agentmark-ai/mastra-v0-adapter";

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      tools: {
        search_knowledgebase: searchTool,
      },
    });
    ```
  </Tab>

  <Tab title="Pydantic AI (Python)">
    Pass native Python functions (or `pydantic_ai.Tool` objects) as a `tools` list. The adapter filters the list at adapt time against the tool names in the prompt's frontmatter:

    ```python theme={null}
    from agentmark_pydantic_ai_v0 import create_pydantic_ai_client

    async def search_knowledgebase(query: str) -> dict:
        return {"results": [f"Result for {query}"]}

    client = create_pydantic_ai_client(
        model_registry=model_registry,
        tools=[search_knowledgebase],
        loader=loader,
    )
    ```
  </Tab>
</Tabs>

Reference tools in prompt frontmatter:

```mdx theme={null}
---
text_config:
  model_name: gpt-4o
  tools:
    - search_knowledgebase
---
```

[Learn more about tools](/build/tools-and-agents)

## Registering evals

Eval functions score prompt outputs during experiments. Score schemas are defined separately in `agentmark.json` (see [Project config](/configure/project-config#scores)) and deployed to AgentMark Cloud. Eval functions are registered in your client config and connected to scores by name.

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import type { EvalFunction } from "@agentmark-ai/prompt-core";

    const evals: Record<string, EvalFunction> = {
      exact_match: ({ output, expectedOutput }) => {
        const match = output === expectedOutput;
        return { score: match ? 1 : 0, passed: match };
      },
      contains_keyword: ({ output, expectedOutput }) => {
        const contains = String(output).includes(String(expectedOutput));
        return { passed: contains };
      },
    };
    ```

    Pass the evals to your client:

    ```typescript theme={null}
    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      evals,
    });
    ```
  </Tab>

  <Tab title="Pydantic AI (Python)">
    ```python theme={null}
    from agentmark.prompt_core import EvalParams, EvalResult

    evals = {
        "exact_match": lambda params: {
            "passed": params["output"] == params.get("expectedOutput"),
        },
        "contains_keyword": lambda params: {
            "passed": str(params.get("expectedOutput", "")) in str(params["output"]),
        },
    }

    client = create_pydantic_ai_client(
        model_registry=model_registry,
        loader=loader,
        evals=evals,
    )
    ```
  </Tab>

  <Tab title="Claude Agent SDK (Python)">
    ```python theme={null}
    from agentmark.prompt_core import EvalParams, EvalResult
    from agentmark_claude_agent_sdk_v0 import (
        create_claude_agent_client,
        ClaudeAgentModelRegistry,
        ModelConfig,
    )

    def exact_match(params: EvalParams) -> EvalResult:
        match = str(params["output"]).strip() == str(params.get("expectedOutput", "")).strip()
        return {"passed": match, "score": 1.0 if match else 0.0}

    evals = {
        "exact_match": exact_match,
    }

    model_registry = ClaudeAgentModelRegistry()
    model_registry.register_models(
        ["claude-sonnet-4-20250514"],
        lambda name, _: ModelConfig(model=name),
    )

    client = create_claude_agent_client(
        model_registry=model_registry,
        loader=loader,
        evals=evals,
    )
    ```
  </Tab>
</Tabs>

Reference evals in prompt frontmatter:

```mdx theme={null}
---
test_settings:
  dataset: ./datasets/sentiment.jsonl
  evals:
    - exact_match
---
```

[Learn more about evaluations](/evaluate/writing-evals)

## MCP servers

MCP servers provide additional tools to your prompts. Pass them as a plain `mcpServers` object to `createAgentMarkClient`:

```typescript theme={null}
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";

export const client = createAgentMarkClient({
  loader,
  modelRegistry,
  mcpServers: {
    filesystem: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-filesystem", "./docs"],
    },
    github: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-github"],
      env: { GITHUB_PERSONAL_ACCESS_TOKEN: process.env.GITHUB_TOKEN! },
    },
    docs: {
      url: "https://docs.example.com/mcp",
      headers: { Authorization: "Bearer env(MCP_TOKEN)" },
    },
  },
});
```

Each key in the `mcpServers` object is the server name. Local servers use `command` and `args`, while remote servers use `url` and optional `headers`.

<Note>
  MCP servers configured in `agentmark.json` are available in the AgentMark Dashboard prompt editor. MCP servers configured in the client code are available at runtime.
</Note>

[Learn more about MCP](/build/mcp)

## Observability

The AgentMark SDK provides OpenTelemetry-based tracing for monitoring prompts in production.

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { AgentMarkSDK } from "@agentmark-ai/sdk";

    const sdk = new AgentMarkSDK({
      apiKey: process.env.AGENTMARK_API_KEY!,
      appId: process.env.AGENTMARK_APP_ID!,
    });

    // Initialize tracing (call once at startup)
    sdk.initTracing();

    // Use the SDK's built-in loader
    const loader = sdk.getApiLoader();
    ```

    `initTracing()` sets up an OpenTelemetry `BatchSpanProcessor` that exports traces to the AgentMark API. For debugging, use `sdk.initTracing({ disableBatch: true })` for immediate span export.

    To redact sensitive data from traces, pass a `mask` function. See [PII masking](/observe/pii-masking).
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    from agentmark_sdk import AgentMarkSDK

    sdk = AgentMarkSDK(
        api_key=os.environ["AGENTMARK_API_KEY"],
        app_id=os.environ["AGENTMARK_APP_ID"],
    )

    sdk.init_tracing()
    ```

    To redact sensitive data from traces, pass a `mask` function. See [PII masking](/observe/pii-masking).
  </Tab>
</Tabs>

You can also pass a `mask` function to redact sensitive data from traces before they leave your application:

```typescript theme={null}
import { AgentMarkSDK, createPiiMasker } from '@agentmark-ai/sdk';

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
  mask: createPiiMasker({ email: true, phone: true, ssn: true }),
});
sdk.initTracing();
```

[Learn more about PII masking](/observe/pii-masking)

[Learn more about observability](/observe/overview)

## Type safety

Run `npx agentmark generate-types --root-dir agentmark > agentmark.types.ts` to generate TypeScript types for all your prompts. The generated file exports a default interface named `AgentmarkTypes`. Pass it to `createAgentMarkClient` for autocomplete on prompt names, props, and outputs:

```typescript theme={null}
import type AgentmarkTypes from "./agentmark.types";

export const client = createAgentMarkClient<AgentmarkTypes>({
  loader,
  modelRegistry,
});

// Type-checked: prompt name, props, and output
const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
const input = await prompt.format({
  props: { name: "Alice", role: "developer" }, // type-checked
});
```

[Learn more about type safety](/sdk-reference/typescript/type-safety)

## Using the client

Import the client in your application to load and run prompts:

<Tabs>
  <Tab title="AI SDK (Vercel)">
    ```typescript theme={null}
    import { client } from "./agentmark.client";
    import { generateText } from "ai";

    const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
    const input = await prompt.format({
      props: { name: "Alice" },
      telemetry: { isEnabled: true },
    });

    const result = await generateText(input);
    console.log(result.text);
    ```
  </Tab>

  <Tab title="Claude Agent SDK">
    ```typescript theme={null}
    import { client } from "./agentmark.client";
    import { query } from "@anthropic-ai/claude-agent-sdk";

    const prompt = await client.loadTextPrompt("agent-task.prompt.mdx");
    const adapted = await prompt.format({
      props: { task: "Refactor the auth module" },
      telemetry: { isEnabled: true },
    });

    // adapted.query has { prompt, options } shaped for the SDK
    for await (const message of query(adapted.query)) {
      console.log(message);
    }
    ```

    To add OpenTelemetry tracing, wrap the call with `withTracing`:

    ```typescript theme={null}
    import { withTracing } from "@agentmark-ai/claude-agent-sdk-v0-adapter";

    const tracedResult = await withTracing(query, {
      query: adapted.query,
      telemetry: adapted.telemetry,
    });

    for await (const message of tracedResult) {
      console.log(message);
    }
    ```
  </Tab>

  <Tab title="Mastra">
    Mastra prompts expose `formatAgent()` and `formatMessages()` — use these instead of the base `format()`:

    ```typescript theme={null}
    import { Agent } from "@mastra/core/agent";
    import { client } from "./agentmark.client";

    const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
    const agentConfig = await prompt.formatAgent({
      props: { name: "Alice" },
      options: { telemetry: { isEnabled: true } },
    });

    const [messages, generateOptions] = await agentConfig.formatMessages();

    const agent = new Agent(agentConfig);
    const result = await agent.generate(messages, generateOptions);
    ```
  </Tab>

  <Tab title="Claude Agent SDK (Python)">
    ```python theme={null}
    from agentmark_claude_agent_sdk_v0 import traced_query
    from agentmark_client import client

    prompt = await client.load_text_prompt("code-reviewer.prompt.mdx")
    adapted = await prompt.format(props={
        "task": "Analyze the auth module and suggest improvements"
    })

    async for message in traced_query(adapted):
        print(message)
    ```
  </Tab>

  <Tab title="Pydantic AI (Python)">
    ```python theme={null}
    from agentmark_pydantic_ai_v0 import run_text_prompt
    from agentmark_client import client

    prompt = await client.load_text_prompt("greeting.prompt.mdx")
    params = await prompt.format(props={"name": "Alice"})

    result = await run_text_prompt(params)
    print(result.output)
    ```
  </Tab>
</Tabs>

## Troubleshooting

| Issue                     | Solution                                                                                                                  |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| Model not found           | Ensure the model name in prompt frontmatter is registered in your model registry                                          |
| Tool not available        | Check the tool is included in the `tools` object passed to `createAgentMarkClient` and the name matches the prompt config |
| Loader connection failed  | Verify `agentmark dev` is running for local mode, or check `AGENTMARK_API_KEY` / `AGENTMARK_APP_ID` for Cloud mode        |
| MCP server not connecting | Verify the command/args are correct and any required env vars are set                                                     |
| Type errors               | Run `npx agentmark generate-types --root-dir agentmark > agentmark.types.ts` to regenerate types                          |

## Next steps

<CardGroup>
  <Card title="Running prompts" icon="play" href="/build/running-prompts">
    Use the client to run prompts
  </Card>

  <Card title="Tools and agents" icon="wrench" href="/build/tools-and-agents">
    Register and use tools
  </Card>

  <Card title="MCP integration" icon="server" href="/build/mcp">
    Connect MCP servers
  </Card>

  <Card title="Type safety" icon="shield-check" href="/sdk-reference/typescript/type-safety">
    Add TypeScript types
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Environment variables
Source: https://docs.agentmark.co/configure/environment-variables

Complete reference for all environment variables used by AgentMark

This page documents the environment variables AgentMark actually reads, organized by purpose. Every entry below was verified by grepping `process.env.<NAME>` and `os.environ` references in the repo at HEAD.

<Tip>
  Create a `.env` file in your project root. The AgentMark CLI automatically loads it before running commands.
</Tip>

## AgentMark core

| Variable             | Required   | Default                    | Description                                                                                                                                                                                                                                                                                                                      |
| -------------------- | ---------- | -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `AGENTMARK_API_KEY`  | Cloud only | —                          | Your AgentMark API key. Generate one from the **API Keys** page in your app's settings (Dashboard → your app → Settings → API Keys).                                                                                                                                                                                             |
| `AGENTMARK_APP_ID`   | Cloud only | —                          | Your AgentMark application ID. Find it on your app's settings page (Dashboard → your app → Settings → General).                                                                                                                                                                                                                  |
| `AGENTMARK_BASE_URL` | No         | `https://api.agentmark.co` | Base URL read by the Python `ApiLoader.cloud()` (defaults to `https://api.agentmark.co`) and the scaffolded TypeScript `agentmark.client.ts` when `NODE_ENV === 'development'` (where the TS scaffold falls back to `http://localhost:9418`).                                                                                    |
| `NODE_ENV`           | No         | —                          | The scaffolded **TypeScript** client branches on `NODE_ENV === 'development'` to pick `ApiLoader.local` vs `ApiLoader.cloud`. The Python scaffold does not read `NODE_ENV` — it uses `ApiLoader.cloud()` directly and relies on the CLI setting `AGENTMARK_BASE_URL` for local dev. The SDK itself never branches on `NODE_ENV`. |

<Note>
  `NODE_ENV` is a user-code convention in the scaffolded TypeScript client, not something the SDK reads internally. If you write your own client, pick whatever flag fits your setup.
</Note>

### Example: development vs production

The scaffolded TypeScript client uses `NODE_ENV` to decide which loader to use. A typical setup:

```bash theme={null}
# .env (development)
NODE_ENV=development
AGENTMARK_BASE_URL=http://localhost:9418

# .env (production)
NODE_ENV=production
AGENTMARK_API_KEY=sk_agentmark_xxxxx
AGENTMARK_APP_ID=app_xxxxx
```

## AI provider API keys

Configure API keys for the AI providers you use. Only set the keys for providers you actually call.

<Note>
  When set on a managed AgentMark Cloud deployment (Dashboard → your app → **Settings → Environment variables**), values are stored encrypted in our vault, scoped to the app, decrypted only at build time, and excluded from logs. See [Security → Provider API keys](/deploy/security#provider-api-keys-managed-deployments) for details.
</Note>

### OpenAI

| Variable          | Required | Description                                   |
| ----------------- | -------- | --------------------------------------------- |
| `OPENAI_API_KEY`  | Yes\*    | OpenAI API key for GPT models, DALL-E, TTS    |
| `OPENAI_ORG_ID`   | No       | Organization ID for OpenAI API calls          |
| `OPENAI_BASE_URL` | No       | Custom base URL (for Azure OpenAI or proxies) |

```bash theme={null}
OPENAI_API_KEY=sk-xxxxx
```

### Anthropic

| Variable            | Required | Description                         |
| ------------------- | -------- | ----------------------------------- |
| `ANTHROPIC_API_KEY` | Yes\*    | Anthropic API key for Claude models |

```bash theme={null}
ANTHROPIC_API_KEY=sk-ant-xxxxx
```

### Google

| Variable                       | Required | Description                         |
| ------------------------------ | -------- | ----------------------------------- |
| `GOOGLE_GENERATIVE_AI_API_KEY` | Yes\*    | Google AI API key for Gemini models |

```bash theme={null}
GOOGLE_GENERATIVE_AI_API_KEY=xxxxx
```

### AWS Bedrock

| Variable                | Required | Description                                              |
| ----------------------- | -------- | -------------------------------------------------------- |
| `AWS_ACCESS_KEY_ID`     | Yes\*    | AWS access key                                           |
| `AWS_SECRET_ACCESS_KEY` | Yes\*    | AWS secret key                                           |
| `AWS_REGION`            | Yes\*    | AWS region (delegated to AWS SDK — no AgentMark default) |

```bash theme={null}
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=xxxxx
AWS_REGION=us-west-2
```

### Azure OpenAI

| Variable                   | Required | Description                                                     |
| -------------------------- | -------- | --------------------------------------------------------------- |
| `AZURE_OPENAI_API_KEY`     | Yes\*    | Azure OpenAI API key                                            |
| `AZURE_OPENAI_ENDPOINT`    | Yes\*    | Azure OpenAI endpoint URL                                       |
| `AZURE_OPENAI_API_VERSION` | Yes\*    | API version (delegated to the Azure SDK — no AgentMark default) |

```bash theme={null}
AZURE_OPENAI_API_KEY=xxxxx
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-06-01
```

<Note>
  \*Required only if you use models from that provider. You don't need to set keys for providers you don't use.
</Note>

## MCP server configuration

AgentMark does not read any MCP-specific env vars itself. MCP servers reference whatever env vars you configure in your `agentmark.json` `mcpServers` block or pass via `env:` to a stdio server:

| Variable                                        | Purpose                                                                                                              |
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `GITHUB_TOKEN` / `GITHUB_PERSONAL_ACCESS_TOKEN` | Used by the [GitHub MCP server](https://github.com/github/github-mcp-server) when referenced via `env(GITHUB_TOKEN)` |
| Any others                                      | Anything you reference via `env(VAR_NAME)` interpolation                                                             |

### Using `env()` interpolation

You can reference environment variables in your MCP server config using `env("VAR_NAME")`:

```typescript theme={null}
const mcpServers = {
  docs: {
    url: "env(MCP_DOCS_URL)",
    headers: { Authorization: "Bearer env(MCP_AUTH_TOKEN)" },
  },
  github: {
    command: "npx",
    args: ["-y", "@modelcontextprotocol/server-github"],
    env: { GITHUB_PERSONAL_ACCESS_TOKEN: "env(GITHUB_TOKEN)" },
  },
};

const client = createAgentMarkClient({
  mcpServers,
  // ... other options
});
```

Then set the variables in `.env`:

```bash theme={null}
MCP_DOCS_URL=https://docs.example.com/mcp
MCP_AUTH_TOKEN=your-auth-token
GITHUB_TOKEN=ghp_xxxxx
```

## Observability and tracing

| Variable                 | Required | Default | Description                                                                                                                              |
| ------------------------ | -------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `AGENTMARK_HIDE_INPUTS`  | No       | `false` | Replace all input attributes with `[REDACTED]` before export. See [PII masking](/observe/pii-masking#environment-variable-suppression).  |
| `AGENTMARK_HIDE_OUTPUTS` | No       | `false` | Replace all output attributes with `[REDACTED]` before export. See [PII masking](/observe/pii-masking#environment-variable-suppression). |

Tracing is controlled by calling `sdk.initTracing()` in your code, not by an environment variable. The OTLP endpoint is derived from the SDK's configured `baseUrl` and is not overridable via `OTEL_EXPORTER_OTLP_ENDPOINT`. The Claude Agent SDK adapter actively strips `OTEL_EXPORTER_OTLP_ENDPOINT` from child processes to prevent duplicate spans.

### MCP trace server

For the [MCP trace server](/sdk-reference/tools/mcp-trace-server):

| Variable               | Required | Default                 | Description                                                                |
| ---------------------- | -------- | ----------------------- | -------------------------------------------------------------------------- |
| `AGENTMARK_URL`        | No       | `http://localhost:9418` | AgentMark API server URL                                                   |
| `AGENTMARK_API_KEY`    | No       | —                       | Forwarded as the API key when the MCP trace server calls the AgentMark API |
| `AGENTMARK_TIMEOUT_MS` | No       | `30000`                 | Request timeout in milliseconds                                            |

## CLI configuration

| Variable                       | Required | Default                    | Description                                                                                                                                                                                                                                            |
| ------------------------------ | -------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `AGENTMARK_API_PORT`           | No       | `9418`                     | Port of the local API server that `agentmark run-experiment`'s score-posting step calls back to. To change the port that `agentmark dev` binds, pass `--api-port` instead.                                                                             |
| `AGENTMARK_API_URL`            | No       | `https://api.agentmark.co` | AgentMark Cloud gateway URL. Used by the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server (`AGENTMARK_API_URL=https://api-stg.agentmark.co` points it at staging) and read by `agentmark run-experiment` when posting scores to Cloud. |
| `AGENTMARK_WEBHOOK_URL`        | No       | `http://localhost:9417`    | Override the webhook server URL for `agentmark run-prompt` and `agentmark run-experiment`.                                                                                                                                                             |
| `AGENTMARK_NO_UPDATE_NOTIFIER` | No       | —                          | Set to any truthy value to suppress the CLI's upgrade-available banner. See `cli-src/update-notifier/constants.ts:15`.                                                                                                                                 |

## Webhook configuration

Variables for [alert webhook](/deploy/webhooks) endpoints.

| Variable                   | Required     | Default | Description                                                                                                                                                                                                                                                                                                                                                            |
| -------------------------- | ------------ | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `AGENTMARK_WEBHOOK_SECRET` | Webhook only | —       | Secret for verifying alert webhook signatures, shown in the dashboard under your app's **Settings → Integrations → LLM Call URL** form. This is a user-code convention — AgentMark's SDK does not read this variable; your webhook handler passes it to `verifySignature()`. Only needed if you deploy a webhook endpoint for [alert notifications](/deploy/webhooks). |

```bash theme={null}
AGENTMARK_WEBHOOK_SECRET=whsec_xxxxx
```

## Complete example

Here's a complete `.env` file for a typical project:

```bash theme={null}
# AgentMark configuration
AGENTMARK_API_KEY=sk_agentmark_xxxxx
AGENTMARK_APP_ID=app_xxxxx

# AI provider keys (only include providers you use)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx

# MCP servers
GITHUB_TOKEN=ghp_xxxxx

# Alert webhook (only if using webhook endpoint)
# AGENTMARK_WEBHOOK_SECRET=whsec_xxxxx

# Development overrides (uncomment for local dev)
# NODE_ENV=development
# AGENTMARK_BASE_URL=http://localhost:9418
```

## Loading environment variables

### Automatic loading

The AgentMark CLI automatically loads `.env` files from your project root before running any command.

### Manual loading (application code)

For your application code, use a package like `dotenv`:

```typescript theme={null}
import "dotenv/config";
// Now process.env.AGENTMARK_API_KEY is available
```

Or in Next.js, environment variables from `.env.local` are loaded automatically.

### CI/CD

In CI/CD, set environment variables through your CI/CD provider's secrets management:

* **GitHub Actions**: repository secrets or environment secrets
* **Vercel**: project environment variables
* **AWS**: Secrets Manager or Parameter Store

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Model schemas
Source: https://docs.agentmark.co/configure/model-schemas

Add built-in and custom models in AgentMark

You can add models to your AgentMark project in two ways: pull pre-configured models from supported providers using the CLI, or define custom model schemas with full control over settings and pricing.

## Pulling built-in models

Use the `pull-models` command to interactively add models to your `agentmark.json`:

```bash theme={null}
npx agentmark pull-models
```

This will:

1. Show you the available providers
2. Let you select which models to add
3. Update `builtInModels` in your `agentmark.json`

Each pulled entry is written in `provider/model` form — for example, selecting OpenAI's `gpt-4o` writes `"openai/gpt-4o"` to `builtInModels`, and selecting Ollama's `llama3.1` writes `"ollama/llama3.1"`. The model IDs in the tables below are shown without the provider prefix for readability.

The full set of models comes from AgentMark's [model registry](https://github.com/agentmark-ai/agentmark/tree/main/packages/model-registry) (sourced from LiteLLM and OpenRouter). Run `pull-models` to see the authoritative list for each provider — the tables below highlight a handful of commonly used IDs.

### Supported providers

AgentMark ships provider labels for: OpenAI, Anthropic, Google (including Vertex AI), xAI, Groq, Cohere, Mistral, DeepSeek, Together AI, Ollama, Fireworks, AWS Bedrock, Azure OpenAI, and Perplexity. The tabs below show a subset with example IDs.

<Tabs>
  <Tab title="OpenAI">
    **Language models:** `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-3.5-turbo`, `o1`, `o3`, and more

    **Image models:** `dall-e-3`, `dall-e-2`

    **Speech models:** `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`
  </Tab>

  <Tab title="Anthropic">
    **Language models:** `claude-4-opus-20250514`, `claude-4-sonnet-20250514`, `claude-3-7-sonnet-20250219`, `claude-3-opus-20240229`, `claude-3-haiku-20240307`, plus `*-latest` aliases, and more.

    The registry contains both dated IDs (e.g. `claude-4-opus-20250514`) and bare IDs (e.g. `claude-opus-4`, `claude-3-haiku`, `claude-3.5-sonnet`). A few bare names such as `claude-3-sonnet` and `claude-3-opus` are not registry entries — use the dated ID or `-latest` alias instead.
  </Tab>

  <Tab title="Google">
    **Language models:** `gemini-2.0-flash`, `gemini-2.5-pro`, and more
  </Tab>

  <Tab title="Ollama">
    **Language models:** `codellama`, `deepseek-r1`, `gemma`, `gemma2`, `llama3.1`, `llama3.2`, `llava`, `mistral`, `mistral-small`, `mistral-small3.1`, `qwen`, `qwen2.5`, `tinyllama`
  </Tab>

  <Tab title="xAI">
    **Language models:** `grok-3`, `grok-3-mini`, `grok-3-fast-beta`, `grok-3-fast-latest`, `grok-2-vision`, and more
  </Tab>

  <Tab title="Groq">
    **Language models:** `llama-3.3-70b-versatile`, `llama-3.1-8b-instant`, `openai/gpt-oss-120b`, `moonshotai/kimi-k2-instruct-0905`, `qwen/qwen3-32b`, `meta-llama/llama-4-scout-17b-16e-instruct`, and more
  </Tab>
</Tabs>

<Note>
  For AI SDK and Mastra adapters, you still need to register pulled models in your `agentmark.client.ts` with your adapter's model registry so they work at runtime. The Claude Agent SDK adapter handles registration natively — see [Registering models](/configure/client-config#registering-models).
</Note>

## Custom model schemas

For models not covered by the built-in providers, or when you need custom settings and pricing, define model schemas in your `agentmark.json` under `modelSchemas`.

### Basic structure

Each model schema includes:

* **label**: Display name shown in the AgentMark Dashboard prompt editor
* **cost**: Pricing configuration for cost tracking
* **settings**: Configurable parameters with UI controls

```json theme={null}
{
  "modelSchemas": {
    "my-custom-model": {
      "label": "My Custom Model",
      "cost": {
        "inputCost": 0.01,
        "outputCost": 0.03,
        "unitScale": 1000000
      },
      "settings": {}
    }
  }
}
```

### Cost configuration

The `cost` object defines pricing for cost tracking:

| Property     | Description                                                           |
| ------------ | --------------------------------------------------------------------- |
| `inputCost`  | Cost per unit for input tokens (USD)                                  |
| `outputCost` | Cost per unit for output tokens (USD)                                 |
| `unitScale`  | Number of tokens per unit (e.g., `1000000` = cost per million tokens) |

```json theme={null}
"cost": {
  "inputCost": 0.01,
  "outputCost": 0.03,
  "unitScale": 1000000
}
```

This means $0.01 per million input tokens and $0.03 per million output tokens.

### Settings configuration

Settings define configurable parameters that appear in the Dashboard prompt editor. Each setting has:

| Property  | Description                                                                                 |
| --------- | ------------------------------------------------------------------------------------------- |
| `label`   | Display name shown in the Dashboard                                                         |
| `order`   | Sort order (ascending — lower values appear first)                                          |
| `default` | Default value                                                                               |
| `type`    | Either `"slider"` (numeric) or `"string"` (for select, imageSize, and aspectRatio controls) |
| `ui`      | Which control to render: `slider`, `select`, `imageSize`, or `aspectRatio`                  |

The Dashboard editor renders a control only when `ui` matches one of the supported values above. Settings with any other `ui` value render as "unsupported".

The available controls are:

<Tabs>
  <Tab title="Slider">
    For numeric values with a range:

    ```json theme={null}
    "temperature": {
      "label": "Temperature",
      "order": 1,
      "default": 0.7,
      "minimum": 0,
      "maximum": 2,
      "multipleOf": 0.1,
      "type": "slider",
      "ui": "slider"
    }
    ```
  </Tab>

  <Tab title="Select">
    For dropdown selection — use `type: "string"` with `ui: "select"` and an `options` array:

    ```json theme={null}
    "response_format": {
      "label": "Response format",
      "order": 3,
      "default": "json",
      "type": "string",
      "ui": "select",
      "options": [
        { "label": "JSON", "value": "json" },
        { "label": "Text", "value": "text" }
      ]
    }
    ```
  </Tab>

  <Tab title="Image size / aspect ratio">
    Specialized controls for image generation models:

    ```json theme={null}
    "image_size": {
      "label": "Image size",
      "order": 4,
      "default": "1024x1024",
      "type": "string",
      "ui": "imageSize"
    }
    ```

    ```json theme={null}
    "aspect_ratio": {
      "label": "Aspect ratio",
      "order": 5,
      "default": "16:9",
      "type": "string",
      "ui": "aspectRatio"
    }
    ```
  </Tab>
</Tabs>

### Complete example

Here's a full example with a text model and an image model:

```json agentmark.json theme={null}
{
  "$schema": "https://raw.githubusercontent.com/agentmark-ai/agentmark/refs/heads/main/packages/cli/agentmark.schema.json",
  "version": "2.0.0",
  "agentmarkPath": ".",
  "modelSchemas": {
    "gpt-4-custom": {
      "label": "GPT-4 Custom",
      "cost": {
        "inputCost": 0.03,
        "outputCost": 0.06,
        "unitScale": 1000000
      },
      "settings": {
        "temperature": {
          "label": "Temperature",
          "order": 1,
          "default": 0.7,
          "minimum": 0,
          "maximum": 2,
          "multipleOf": 0.1,
          "type": "slider",
          "ui": "slider"
        },
        "max_tokens": {
          "label": "Max tokens",
          "order": 2,
          "default": 2048,
          "minimum": 1,
          "maximum": 8192,
          "multipleOf": 1,
          "type": "slider",
          "ui": "slider"
        },
        "response_format": {
          "label": "Response format",
          "order": 3,
          "default": "text",
          "type": "string",
          "ui": "select",
          "options": [
            { "label": "Text", "value": "text" },
            { "label": "JSON", "value": "json" },
            { "label": "JSON Schema", "value": "json_schema" }
          ]
        }
      }
    },
    "dall-e-3": {
      "label": "DALL-E 3",
      "cost": {
        "inputCost": 0.04,
        "outputCost": 0,
        "unitScale": 1
      },
      "settings": {
        "image_size": {
          "label": "Image size",
          "order": 1,
          "default": "1024x1024",
          "type": "string",
          "ui": "imageSize"
        },
        "quality": {
          "label": "Quality",
          "order": 2,
          "default": "standard",
          "type": "string",
          "ui": "select",
          "options": [
            { "label": "Standard", "value": "standard" },
            { "label": "HD", "value": "hd" }
          ]
        }
      }
    }
  }
}
```

## Best practices

1. **Use descriptive labels** — make it clear what each setting does in the Dashboard
2. **Set appropriate ranges** — define minimum and maximum values that make sense for your model
3. **Order settings logically** — lower `order` values appear first in the prompt editor
4. **Provide sensible defaults** — choose default values that work well for most use cases
5. **Document costs accurately** — make sure the cost configuration matches your provider's pricing

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Project config
Source: https://docs.agentmark.co/configure/project-config

Learn how to configure AgentMark for your application needs

AgentMark projects are configured through two main files: `agentmark.json` for project-level settings, and `agentmark.client.ts` (or `agentmark_client.py`) for runtime configuration like models, tools, and loaders.

## agentmark.json

The `agentmark.json` file lives at your project root and configures your AgentMark application. It is read by both the CLI and AgentMark Cloud.

### Basic example

A freshly-scaffolded `agentmark.json` (from `npm create agentmark@latest`) contains only the four base fields:

```json agentmark.json theme={null}
{
  "$schema": "https://raw.githubusercontent.com/agentmark-ai/agentmark/refs/heads/main/packages/cli/agentmark.schema.json",
  "version": "2.0.0",
  "mdxVersion": "1.0",
  "agentmarkPath": "."
}
```

Add the optional properties documented below — such as `builtInModels`, `scores`, and `mcpServers` — as your project needs them.

### Configuration properties

#### \$schema (optional)

Points to the JSON Schema for editor autocompletion and validation.

```json theme={null}
"$schema": "https://raw.githubusercontent.com/agentmark-ai/agentmark/refs/heads/main/packages/cli/agentmark.schema.json"
```

#### agentmarkPath (required)

The base directory (relative to your project root) where AgentMark looks for the `agentmark/` folder containing prompts, components, and datasets. Projects scaffolded with `npm create agentmark@latest` use `"."` — the `agentmark/` directory at the project root.

```json theme={null}
"agentmarkPath": "."
```

<Tip>
  In a monorepo, set this to the relative path of the package containing your AgentMark files (e.g., `"packages/ai"`).
</Tip>

<Warning>
  Use `"."` (or a relative path like `"packages/ai"`), not `"/"`. A leading slash resolves to the filesystem root and breaks `agentmark build`.
</Warning>

#### version (required)

The AgentMark configuration version. Use `"2.0.0"` for new projects. AgentMark Cloud uses this to choose the storage folder for deployed prompts — versions `>= "2.0.0"` use the `agentmark/` folder, earlier versions use the legacy `puzzlet/` folder.

#### mdxVersion (optional)

The prompt format version. Accepts `"1.0"` (current) or `"0.0"` (legacy). Use `"1.0"` for new projects.

#### builtInModels (optional)

An array of model IDs allowed in prompts. When set and non-empty, `prompt-core` rejects any prompt whose `model_name` is not in the list. IDs use the `provider/model` format (e.g., `openai/gpt-4o`) so the adapter's model registry can auto-resolve the provider when you call `.registerProviders({ openai, anthropic })`. Pricing and settings for these models come from the bundled AgentMark [model registry](/configure/model-schemas).

```json theme={null}
"builtInModels": ["openai/gpt-4o", "openai/gpt-4o-mini", "anthropic/claude-sonnet-4-20250514"]
```

Use the `pull-models` CLI command to interactively add models from supported providers — it emits the correct `provider/model` format automatically:

```bash theme={null}
npx agentmark pull-models
```

See [Model schemas](/configure/model-schemas) for details.

#### evals (deprecated)

Use [`scores`](#scores-optional) instead. The `evals` field listed evaluation function names but did not include schema definitions. It is still supported for backward compatibility.

```json theme={null}
"evals": ["correctness", "hallucination", "relevance"]
```

#### scores (optional)

Define score schemas for evaluation and human annotation. Each entry declares a score name and its type (boolean, numeric, or categorical). These schemas are synced to AgentMark Cloud through the [deployment pipeline](/deploy/deployment) and used by both the annotation UI and experiment runner.

```json theme={null}
"scores": {
  "accuracy": {
    "type": "boolean",
    "description": "Was the response factually correct?"
  },
  "helpfulness": {
    "type": "numeric",
    "min": 1,
    "max": 5,
    "description": "Rate helpfulness on a 1-5 scale"
  },
  "tone": {
    "type": "categorical",
    "description": "Response tone",
    "categories": [
      { "label": "professional", "value": 1 },
      { "label": "casual", "value": 0.5 },
      { "label": "inappropriate", "value": 0 }
    ]
  }
}
```

To add automated eval functions for these scores, define them in your client config using the `evals` option. See [Evaluations](/evaluate/writing-evals) for details.

#### modelSchemas (optional)

Define custom model configurations with settings, pricing, and UI controls. Use this for models not covered by `builtInModels`, or to customize settings for existing models.

```json theme={null}
"modelSchemas": {
  "my-custom-model": {
    "label": "My Custom Model",
    "cost": {
      "inputCost": 0.01,
      "outputCost": 0.03,
      "unitScale": 1000000
    },
    "settings": {
      "temperature": {
        "label": "Temperature",
        "order": 1,
        "default": 0.7,
        "minimum": 0,
        "maximum": 2,
        "multipleOf": 0.1,
        "type": "slider"
      }
    }
  }
}
```

See [Adding Models](/configure/model-schemas) for the full schema reference.

#### mcpServers (optional)

Configure Model Context Protocol (MCP) servers that your prompts can reference as tools. Servers listed here are registered with the adapter at runtime (AI SDK, Mastra, Claude Agent SDK) and become available to prompts that reference them as `mcp://<server-name>/<tool>` in the `tools:` frontmatter.

<Tabs>
  <Tab title="URL / SSE">
    For remote MCP servers accessible via HTTP:

    ```json theme={null}
    "mcpServers": {
      "docs": {
        "url": "https://example.com/mcp",
        "headers": {
          "Authorization": "Bearer your-token"
        }
      }
    }
    ```
  </Tab>

  <Tab title="Stdio">
    For local MCP servers that run as a subprocess:

    ```json theme={null}
    "mcpServers": {
      "local-tools": {
        "command": "node",
        "args": ["./mcp-server.js"],
        "cwd": "/path/to/server",
        "env": {
          "API_KEY": "secret"
        }
      }
    }
    ```
  </Tab>
</Tabs>

See [MCP Integration](/build/mcp) for usage in prompts.

#### handler (optional)

Path to your handler file for [managed code deployment](/deploy/deployment). AgentMark Cloud bundles and deploys this file so prompts can be executed from the Dashboard. The file extension determines the runtime (`.py` → Python, anything else → Node.js).

```json agentmark.json theme={null}
"handler": "handler.ts"
```

If omitted, AgentMark Cloud checks for `handler.py` at the repository root first (Python runtime), then falls back to `handler.ts` (Node.js runtime). If neither is found, managed code deployment is skipped. The setup-and-integration skill workflow writes this field when it scaffolds a handler (`handler.ts` for TypeScript projects, `handler.py` for Python).

### Full example

An illustrative config showing every top-level field (not all are written by the scaffolder — see each field's section above for when it applies):

```json agentmark.json theme={null}
{
  "$schema": "https://raw.githubusercontent.com/agentmark-ai/agentmark/refs/heads/main/packages/cli/agentmark.schema.json",
  "version": "2.0.0",
  "mdxVersion": "1.0",
  "agentmarkPath": ".",
  "builtInModels": ["openai/gpt-4o", "openai/gpt-4o-mini", "anthropic/claude-sonnet-4-20250514"],
  "scores": {
    "correctness": {
      "type": "boolean",
      "description": "Was the response correct?"
    },
    "hallucination": {
      "type": "boolean",
      "description": "Did the response contain hallucinated content?"
    },
    "helpfulness": {
      "type": "numeric",
      "min": 1,
      "max": 5,
      "description": "Rate helpfulness on a 1-5 scale"
    }
  },
  "handler": "handler.ts",
  "mcpServers": {
    "docs": {
      "url": "https://example.com/mcp"
    }
  },
  "modelSchemas": {
    "my-fine-tuned-model": {
      "label": "My Fine-tuned Model",
      "cost": {
        "inputCost": 0.005,
        "outputCost": 0.015,
        "unitScale": 1000000
      },
      "settings": {
        "temperature": {
          "label": "Temperature",
          "order": 1,
          "default": 0.7,
          "minimum": 0,
          "maximum": 2,
          "multipleOf": 0.1,
          "type": "slider"
        }
      }
    }
  }
}
```

## Client configuration

The client configuration file (`agentmark.client.ts` or `agentmark_client.py`) defines your runtime setup: which models to use, what tools are available, how to load prompts, and which evaluations to run.

This file is auto-generated by `npm create agentmark@latest` and can be customized for your project.

<Tabs>
  <Tab title="Cloud mode">
    In Cloud mode, prompts are loaded from the AgentMark API in production and from your local dev server during development:

    ```typescript agentmark.client.ts theme={null}
    import { ApiLoader } from "@agentmark-ai/loader-api";

    const loader = process.env.NODE_ENV === 'development'
      ? ApiLoader.local({
          baseUrl: process.env.AGENTMARK_BASE_URL || 'http://localhost:9418'
        })
      : ApiLoader.cloud({
          apiKey: process.env.AGENTMARK_API_KEY!,
          appId: process.env.AGENTMARK_APP_ID!,
        });
    ```
  </Tab>

  <Tab title="Self-hosted mode">
    In self-hosted mode, prompts are loaded from pre-built files in production:

    ```typescript agentmark.client.ts theme={null}
    import { ApiLoader } from "@agentmark-ai/loader-api";
    import { FileLoader } from "@agentmark-ai/loader-file";

    const loader = process.env.NODE_ENV === 'development'
      ? ApiLoader.local({
          baseUrl: process.env.AGENTMARK_BASE_URL || 'http://localhost:9418'
        })
      : new FileLoader('./dist/agentmark');
    ```

    <Note>
      Self-hosted mode requires running `npx agentmark build --out dist/agentmark` to pre-compile your prompts before deployment.
    </Note>
  </Tab>
</Tabs>

## Environment variables

| Variable             | Required           | Description                                                                                |
| -------------------- | ------------------ | ------------------------------------------------------------------------------------------ |
| `AGENTMARK_API_KEY`  | Cloud mode         | API key from AgentMark Dashboard settings                                                  |
| `AGENTMARK_APP_ID`   | Cloud mode         | App ID from AgentMark Dashboard settings                                                   |
| `AGENTMARK_BASE_URL` | No                 | Override the local dev server URL in scaffolded clients (default: `http://localhost:9418`) |
| `OPENAI_API_KEY`     | Depends on adapter | OpenAI API key for AI SDK, Mastra, or Pydantic AI adapters                                 |
| `ANTHROPIC_API_KEY`  | Depends on adapter | Anthropic API key for Claude Agent SDK adapter                                             |

See [Environment variables](/configure/environment-variables) for the complete list.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# API keys
Source: https://docs.agentmark.co/deploy/api-keys

Create, scope, and manage API keys for the AgentMark Gateway from the Dashboard.

API keys authenticate every request to the AgentMark Gateway. Each key is scoped to a single app and carries either a preset **role** (SDK, Read-Only, Full Access) or a **custom** permission set you assemble yourself. The gateway checks a key's permissions on every request and returns `403 Forbidden` when a key lacks the required permission.

For the header format and the full endpoint-to-permission mapping, see [Authentication](/api-reference/authentication). For the permission catalog and role definitions, see [Users and access control](/deploy/users-and-access-control#api-keys).

## Locate the API keys settings

Open the AgentMark Dashboard, switch to the app you want to scope the key to (the app shown in the breadcrumb at the top), then navigate to **Settings → API keys**.

<img alt="Settings > API Keys page with empty state and a Create API Key button" />

<Info>Keys are scoped to the app shown in the breadcrumb. A key created here cannot access any other app's traces, templates, or datasets.</Info>

## Create an API key

1. Click **Create API key**.
2. Enter a **Name** (used for identification in the list; must be unique within the app).
3. Select a **Role** or choose **Custom** to pick permissions individually.
4. Click **Create**.

### Role presets

Three presets cover the common integration patterns:

**SDK** — `trace.write`, `template.read`, `score.write`. CLI and SDK integrations: ingest traces, read templates, write scores.

<img alt="Create API Key modal with the SDK role selected, showing trace.write, template.read, and score.write permission chips" />

**Read-Only** — `trace.read`, `span.read`, `session.read`, `score.read`, `score_config.read`, `dataset.read`, `metrics.read`, `deployment.read`, `environment.read`, `alert.read`, `slack_integration.read`, `app.read`. Dashboard and BI tools: read-only access to all data.

<img alt="Create API Key modal with the Read-Only role selected, showing read-only permission chips across traces, spans, sessions, scores, datasets, and metrics" />

**Full Access** — every permission in the catalog. Admin and CI pipelines; grant only when needed.

## Custom permissions

Select **Custom** to build a permission set from scratch. Permissions are grouped by resource (Traces, Templates, Scores, Spans, Sessions, Datasets, Metrics, Experiments) so you can mix and match — for example, `trace.read + score.write` for a scoring worker, or `dataset.write` only for a CI job that uploads eval rows.

<img alt="Create API Key modal with Custom role expanded, showing grouped permission checkboxes for Traces, Templates, and Scores" />

You must select at least one permission. Submitting with no permissions selected shows the validation message **Select a role or at least one permission**.

<img alt="Custom permissions form scrolled to show all permission groups with a red validation error requiring at least one permission" />

## Copy the key

After the key is created, the Dashboard shows the key value **once**. Copy it now — it cannot be retrieved later.

<img alt="Post-create dialog showing the generated API key value with a copy icon and a warning that the key won't be saved" />

<Warning>
  **The key is shown once.** If you lose it, you must delete the key and create a new one — AgentMark does not store the raw key value after this step.
</Warning>

Store the key in a secrets manager and load it from environment variables in your application:

```bash theme={null}
AGENTMARK_API_KEY=sk_agentmark_your_key_here
AGENTMARK_APP_ID=app_your_app_id_here
```

Both variables are documented in [Environment variables](/configure/environment-variables).

## Edit key permissions

To change a key's permissions, click the pencil icon next to the key in the list.

<img alt="API Keys list showing a single key row with Name, Created By, and Created At columns and pencil and trash action icons" />

The edit modal works the same as the create modal — pick a role or toggle individual permissions. The Custom view shows every permission grouped by resource.

<img alt="Edit API Key Permissions modal with the Custom role selected, showing ungrouped permission checkboxes" />

Selecting **Full Access** reveals the complete permission set the key will carry:

<img alt="Edit API Key Permissions modal with the Full Access role selected, showing every permission chip including trace, template, score, span, session, dataset, metrics, and experiment permissions" />

Click **Save** to apply. The key value does not change, so your deployed integrations continue working with the new permissions immediately — no redeploy required.

<img alt="API Keys list after saving an edit, showing the same key with an updated timestamp" />

<Tip>Edits apply to the same key value, so you can tighten or loosen permissions on a production key without rotating secrets.</Tip>

## Delete a key

Click the red trash icon next to a key and confirm the deletion.

<img alt="API Keys list back to the empty state after the key was deleted" />

<Warning>
  Deletion is immediate. Any integration still using the deleted key will start receiving `401 Unauthorized` on the next request.
</Warning>

## Rotate a key

To rotate a key without downtime:

1. Create a new key with the same permissions.
2. Copy the new value and update `AGENTMARK_API_KEY` in your deployment (or secrets manager).
3. Confirm your application is using the new key (check traces or logs for the expected activity).
4. Delete the old key from the Dashboard.

This ordering guarantees no `401` gap — the old key keeps working until you delete it.

## Related reading

* [Authentication](/api-reference/authentication) — request headers, endpoint permissions, error codes
* [Users and access control](/deploy/users-and-access-control#api-keys) — permission catalog and role definitions
* [Security](/deploy/security#api-key-security) — how keys are stored and rate-limited

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Billing and usage
Source: https://docs.agentmark.co/deploy/billing-and-usage

Pricing tiers, usage limits, entitlements, and rate limits

AgentMark offers four tiers. Manage your subscription from **Settings → Billing** in the AgentMark Dashboard.

## Tiers

|                | Free | Growth     | Team        | Enterprise |
| -------------- | ---- | ---------- | ----------- | ---------- |
| **Price**      | \$0  | \$59/month | \$499/month | Custom     |
| **Self-serve** | Yes  | Yes        | Yes         | No         |

## Usage limits

| Resource              | Free   | Growth    | Team      | Enterprise   |
| --------------------- | ------ | --------- | --------- | ------------ |
| **Units per month**   | 20,000 | 20,000    | 100,000   | Configurable |
| **Apps**              | 1      | 3         | 10        | Configurable |
| **Users**             | 2      | Unlimited | Unlimited | Unlimited    |
| **API keys**          | 25     | 25        | Unlimited | Unlimited    |
| **Data retention**    | 7 days | 90 days   | 90 days   | Configurable |
| **Storage per month** | 1 GB   | 5 GB      | 25 GB     | Configurable |

"Configurable" means the tier default can be raised through a custom [entitlement override](#entitlement-overrides) — typical for Enterprise accounts.

## Rate limits

API requests are rate-limited per tier:

| Tier       | Rate limit |
| ---------- | ---------- |
| Free       | 3,000 RPM  |
| Growth     | 3,000 RPM  |
| Team       | 6,000 RPM  |
| Enterprise | 12,000 RPM |

## Feature availability

Some features are gated by tier:

| Feature            | Free | Growth | Team | Enterprise |
| ------------------ | ---- | ------ | ---- | ---------- |
| Alerts             | —    | Yes    | Yes  | Yes        |
| Custom metrics     | —    | Yes    | Yes  | Yes        |
| GitLab integration | —    | Yes    | Yes  | Yes        |
| Custom roles       | —    | —      | Yes  | Yes        |
| App-level roles    | —    | —      | Yes  | Yes        |
| SSO (SAML)         | —    | —      | Yes  | Yes        |

Need data residency or regional hosting? [Contact us](mailto:hello@agentmark.co) — Enterprise deals can accommodate custom arrangements.

## Metered billing

Each tier includes a base monthly allocation of **units**. On Growth and Team, usage beyond the included limit is metered and billed monthly at \$10 per 100k additional units. On the Free tier, units are enforced as a hard cap instead of metered — upgrade or contact sales to raise the limit.

## Entitlement overrides

Enterprise customers can have custom entitlements configured by AgentMark — for example, higher span limits, extended data retention, or elevated rate limits beyond the standard tier defaults.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Deployment
Source: https://docs.agentmark.co/deploy/deployment

Deploy your prompts and handler code to AgentMark Cloud using the deployment pipeline

Deploy your AgentMark project by connecting a Git repository. On every push, AgentMark Cloud runs a deployment pipeline that syncs your files and deploys your handler code.

## Deployment pipeline

Connect a Git repository to your app in the AgentMark Dashboard. When you push, the pipeline runs two steps: file sync and code deploy.

### Setup

1. Push your project to a Git repository (GitHub or GitLab).
2. From your org's **Apps** page, create an app (name-only — the Create App modal doesn't connect a repo). Then open the app's settings menu → **Link repository** to connect your Git repository.
3. Pick a branch to deploy from.

### How the pipeline works

Every push to your connected branch triggers a two-step deployment:

1. **File sync** — AgentMark Cloud syncs your prompt templates (`.prompt.mdx`), components (`.mdx`, `.md`), and datasets (`.jsonl`) between your repository and the app.
2. **Code deploy** — If a handler file is detected, AgentMark Cloud bundles your code and deploys it to a managed machine. Your handler executes prompts when triggered from the Dashboard, API, or experiments.

This pipeline runs on the app's default environment (`dev`), which tracks your branch HEAD live. To ship the same code to an isolated, pinned environment like `staging` or `prod`, see [Environments and promotions](/deploy/environments-and-promotions).

A handler is the entry point for prompt execution in AgentMark Cloud. It receives prompt requests and runs them using your adapter and models. The setup-and-integration skill workflow scaffolds a handler when you ask your AI tool to "set up AgentMark in this project" against a Cloud-targeted app; you can also write one by hand following the framework's [integration guide](/integrations/overview).

<Tip>
  If no handler is detected, the pipeline completes after file sync and skips the code deploy step. Prompts can still be loaded by your own runtime via the SDK + Cloud API key — see [API keys](/deploy/api-keys).
</Tip>

Each push creates its own deployment record; rapid consecutive pushes each run through the pipeline.

### Handler detection

AgentMark Cloud resolves your handler file in this order:

1. **`handler` key in `agentmark.json`** — If your config includes a `handler` field, that path is used.

   ```json agentmark.json theme={null}
   {
     "version": "2.0.0",
     "agentmarkPath": ".",
     "handler": "src/handler.ts"
   }
   ```

   Use `"src/handler.py"` if your project is Python.

2. **Fallback** — If no `handler` key is set, AgentMark Cloud looks for `handler.py` first, then `handler.ts` at the repository root.

Both TypeScript and Python handlers are fully supported. The setup-and-integration skill workflow places the appropriate entry point when you wire AgentMark into your project (`handler.ts` for TypeScript, `handler.py` for Python) and adds the `handler` key to `agentmark.json` if needed. If you're authoring by hand, follow the [integration guide](/integrations/overview) for your framework.

If neither file is found, the code deploy step is skipped and the deployment completes after file sync only.

### Re-triggering deployments

After your first successful deployment, you can re-trigger individual steps from the deployment card in the Dashboard:

* **Re-sync** — pull the latest files from your repository without rebuilding code. Use this when you only changed prompt templates or datasets.
* **Rebuild** — re-bundle and redeploy your handler code without re-syncing files. Use this when you need to pick up new environment variables.
* **Full deploy** — run both file sync and code deploy.

### Build caching

When you push a commit whose **import graph** hasn't changed since your last successful deployment, AgentMark Cloud skips the builder and marks the new deployment as deployed using the artifact already running. No setup needed — caching is automatic.

#### How it works

Each successful build emits a **build manifest** — the list of files that actually participated in the bundle. For TypeScript that's the import graph from your handler (resolved by esbuild), plus the lockfile, `package.json`, `tsconfig.json`, and `agentmark.json`. For Python it's the transitively-imported modules from your handler (resolved by Python's `modulefinder`), plus `requirements.txt` / `pyproject.toml` / `Pipfile.lock` / etc. Each manifest entry records the file's git blob SHA at the time of that build.

On the next push, AgentMark Cloud fetches the recursive git tree at the new commit and compares each manifest entry's recorded blob SHA to the file's blob SHA at the new commit. If they all match, the cache hits — your code's actual inputs haven't changed, so the prior bundle is still correct.

When the cache hits, the deployment record gets `cache_hit = true` and completes in under a second. The running production machine continues to serve.

#### When you'll see cache hits

* **Prompt-, dataset-, or component-only pushes** — the most common case. PMs editing `.prompt.mdx` / `.mdx` / `.md` / `.jsonl` files don't appear in the import graph, so the manifest doesn't include them, and changes don't affect the cache.
* **Edits to source files your handler doesn't import** — e.g., one-off scripts, examples, internal tools, or dead code. If the file isn't reachable from your entry point, it's not in the manifest.
* **Retries of the same commit** (a previous push errored mid-pipeline; you retry the same commit).
* **Re-deploys after rolling forward and back** to the most recently deployed code state (older matching deployments don't count — only the latest successful one).

#### When the cache misses

* Any change to a file that's part of your handler's import graph (the entry point itself, anything it imports transitively).
* Any change to your **lockfile** (`package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `bun.lockb`, `Pipfile.lock`, `poetry.lock`).
* Any change to your **package manifest** (`package.json`, `pyproject.toml`, `requirements.txt`, etc.).
* Any change to **build configs** (`tsconfig.*`, `setup.py`/`setup.cfg`).
* Any change to **`agentmark.json` / `puzzlet.json`**.
* A new lockfile or manifest appearing where there wasn't one before.
* A builder image upgrade (system-wide cache invalidation when AgentMark Cloud upgrades the build environment).
* The first deploy of an app, or the first deploy after this feature shipped (no prior manifest to compare against).

#### Manual deploys always bypass the cache

The dashboard's **Full deploy** and **Rebuild** buttons always run the builder, regardless of whether the manifest matches. A manual click is an explicit signal that you want a fresh build — typically to pick up new environment variables, recover from a transient failure, or rebuild against an updated builder image. The cache only applies to automatic deploys triggered by `git push`.

#### Commit-message overrides

You can override the cache decision by including a directive in your commit message:

| Directive       | Effect                                                                                                                                                                |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `[force build]` | Always run the builder, even when the hash matches. Use to force a fresh build (e.g., to pick up a new base image).                                                   |
| `[skip build]`  | Skip the builder and reuse the most recent successful artifact. Use for documentation-only or other no-op pushes when you want a deployment record without a rebuild. |

Both directives are **case-insensitive** and may appear anywhere in the commit message. If both appear, `[force build]` wins.

```bash theme={null}
# Forces a rebuild even if the hash matches the prior successful deploy.
git commit -m "fix: bump base image [force build]"

# Skips the build entirely and reuses the prior artifact.
git commit -m "docs: typo in README [skip build]"
```

<Warning>
  `[skip build]` reuses the **most recent successful artifact regardless of whether your code changed**. That means any code differences in the current push will not run in production until your next normal deploy. Use it for genuine no-ops only — not as a way to ship code without testing the build.
</Warning>

<Tip>
  If you push `[skip build]` to an app with no prior successful deployment, the deployment fails with a clear reason (`[skip build] requested but no prior successful build to reuse`) — there's no artifact to fall back to. Push without the directive to trigger a normal first build.
</Tip>

### Environment variables

Configure environment variables for your deployed handler in the Dashboard under **Settings → Environment variables**. These variables are injected during the build step and available to your handler at runtime.

<Warning>
  A small set of names are reserved and cannot be overridden: `AGENTMARK_API_KEY`, `AGENTMARK_APP_ID`, `AGENTMARK_BASE_URL`, `AGENTMARK_DISPATCH_SECRET`, and `PORT`. Any other name — including ones starting with `AGENTMARK_` — can be used freely.
</Warning>

Add your AI provider keys (such as `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`) and any other secrets your handler needs. Changes to environment variables take effect on the next deployment — trigger a **Rebuild** to apply them immediately.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Environments and promotions
Source: https://docs.agentmark.co/deploy/environments-and-promotions

Run the same app at different versions in isolated environments, and promote a tested version forward when it's ready.

A new app starts with one environment, `dev`, that tracks your branch HEAD live. Create `staging` and `prod` to run the same code at pinned, immutable versions — and **promote** from one environment into another when a version is ready to ship.

## Why environments?

Without environments, every push to your connected branch is the only thing your prompts and code can serve. That's fine while you're experimenting on `dev` — but it leaves no place to test a release candidate against real traffic, and no way to keep an older version live while you iterate on the next.

Environments solve both:

* **`dev` stays live on HEAD** — prompt-only and dataset-only pushes show up instantly, the way iteration should feel.
* **`staging`, `prod`, and any other env you create are pinned** — each runs an immutable version snapshot of your content + code that does not move until you explicitly promote into it.

The version that runs in `prod` is the one you tested in `staging`. Pushing a typo fix to `dev` doesn't change either.

## The `dev` environment

Every app has exactly one default environment, named `dev`. It is created automatically with the app and has three behaviors no other env shares:

* **Tracks branch HEAD live.** Every push to your connected branch deploys to `dev` through the normal [deployment pipeline](/deploy/deployment) — the build cache, handler detection, and commit-message overrides all apply here.
* **Cannot be promoted into.** `dev` is the source of promotions, never the target. Promoting *into* `dev` would silently move the live pointer off HEAD, breaking the "`dev` always reflects my branch" invariant.
* **Cannot be deleted.** The default env is locked for the lifetime of the app.

`dev` does not get its own dedicated runtime until something needs one — it runs through the app's primary deployment machinery. Non-default envs each get their own isolated runtime.

## Creating an environment

Open the environment dropdown in the breadcrumb at the top of the dashboard and click **New environment**.

<img alt="Breadcrumb environment dropdown open from a non-default env, listing dev with the default suffix, staging with the no-pin suffix, prod with the v1 suffix, plus the New environment, Promote to staging, and Roll back menu items" />

The dropdown shows every env on the app suffixed with its pin state — `default` for `dev`, `no-pin` for a freshly created env, or the version number (`v1`, `v3`) for a pinned env. The same menu hosts **New environment**, and — when the selected env is not `dev` and you hold `environment.promote` — **Promote to&#x20;**.

Names must match `^[a-z][a-z0-9-]{1,39}$` — lowercase letters, digits, and hyphens, starting with a letter, 2–40 characters. Common choices are `staging`, `prod`, `preview`, `eu`, `tenant-acme`.

<img alt="Create environment modal dialog with a single Name field and helper text reading 'Lowercase letters, digits, and hyphens. Must start with a letter. 2–40 chars.', plus Cancel and Create actions" />

A newly-created env starts in the **no-pin** state: the runtime is provisioned but no version has been promoted into it yet. Until the first promote, the env serves nothing.

<Tip>
  Creating an env does not auto-mint an API key. The post-create dialog links straight to the [API keys](/deploy/api-keys) page filtered to the new env so you can mint one scoped to it. Keys are scoped per env, so a `prod` key cannot reach `staging` traces, templates, or datasets.
</Tip>

<img alt="Environment created dialog appearing after a successful create, naming the new environment (preview), with an info note that no API key was created automatically and a primary Create an API key for this environment button alongside a Later button" />

<Warning>
  The number of environments you can create per app depends on your tier. When you hit the limit, env creation fails with a message naming your current plan — upgrade to add more.
</Warning>

## Promoting between environments

A promotion copies one environment's current content and rebuilds its code into another environment. The target env's version counter advances by one, its runtime rebuilds, and its pin moves to the new version.

### Opening the dialog

Promote is always launched from the **target** — the env you're promoting *into*:

1. In the breadcrumb, switch to the env you want to promote into (`staging`, `prod`, etc.).
2. Open the env dropdown and click **Promote to&#x20;**.
3. In the dialog, pick the **source** env from the dropdown and (optionally) add a note explaining why.
4. Click **Promote to&#x20;**.

The dialog always shows the direction as a banner: `source → target`. You can change the source, but the target is fixed by the env you triggered the action from. This is intentional — promotions have a direction, and the UI never lets it go ambiguous.

<img alt="Promote to staging dialog with the Select source → staging direction banner across the top, the Source environment dropdown expanded to show two options annotated with their pin state — 'dev (unpinned)' and 'prod (v1)' — and an empty Note (optional) field below" />

The source dropdown annotates each option with its current pin state (`dev (unpinned)`, `prod (v1)`) so you can see at a glance which commit each source would send. The target env is excluded from the source list — an env cannot promote into itself.

### What gets promoted

A promote takes the source's **current content** and the **commit it was built from**, and applies both to the target:

* **Content** — prompt templates (`.prompt.mdx`), components (`.mdx`, `.md`), datasets (`.jsonl`), and schemas. A fresh env-keyed snapshot is written for the target, so its content is fully independent of the source going forward.
* **Code** — the target's runtime is rebuilt at the source's commit. The same build cache, env vars, and handler detection rules from the [deployment pipeline](/deploy/deployment) apply.

The source supplies the commit, not the latest of its branch — so what you tested is what gets promoted:

| Source state                      | Commit promoted                 |
| --------------------------------- | ------------------------------- |
| Pinned env (e.g. `staging` at v3) | The env's current pinned commit |
| `dev` or any no-pin env           | The branch's current HEAD       |

### Rules

* **Target cannot be `dev`.** The default env tracks branch HEAD; pinning it would break that invariant. The dialog hides `dev` from anywhere it would appear as a target.
* **Source cannot equal target.** Promoting an env into itself is rejected; the dialog drops the target from the source dropdown.
* **The app needs a git connection.** Promote builds code against a commit, so the app must have a [linked repository](/deploy/deployment#setup). Promoting from an app with no git connection returns 400.
* **Permission required.** Only roles that hold `environment.promote` see the action.

### Versions

Each non-default env has a monotonic `current_version` counter. A successful forward promote advances it by one — `v1`, `v2`, `v3`. The version that the env is currently pinned to is shown next to its name in the breadcrumb dropdown (`prod · v3`).

`dev` is always at `current_version: 0` — it has no pin to count.

## Watching a promotion run

When you submit a promotion, the API returns immediately as soon as the content snapshot is written and the env pointer is committed. The code build runs asynchronously after that. The dialog stays open and shows a live progress view with two sub-status rows:

* **File sync** — writing the env-keyed content snapshot. Completes before the API returns; the dialog shows this as already done.
* **Code build** — the managed build that produces the target's new runtime. This is what the progress view is actively polling.

<img alt="Promotion in progress dialog showing the dev → staging direction banner, a blue 'In progress' chip, the line 'Deploying version v1 at commit a1b2c3d', and two sub-status rows — File sync with a green 'Synced' pill and Code build with a blue 'Building…' pill — and a single Close action" />

Closing the dialog mid-build does not cancel the promotion — the build keeps going. While a promotion is in flight, the env shows a **Deploying…** badge next to its name on the **Settings → General** page, so you can come back to it from anywhere.

<img alt="Settings → General page showing the Environment card for staging with the env name, a 'No pin' chip, a blue 'Deploying…' badge, and the version label 'v1', followed by a Delete environment affordance" />

The overall status collapses onto one of three values:

| Status          | Meaning                                                                                        |
| --------------- | ---------------------------------------------------------------------------------------------- |
| **In progress** | Either the env-level commit or the code build is still working.                                |
| **Deployed**    | The content commit succeeded and the build settled (either deployed or skipped via cache hit). |
| **Failed**      | Any one of the steps failed; the env stays on its previous version.                            |

A failed promotion does not change the target's pin or its `current_version` — you can fix the underlying cause and click **Retry** to re-run the same promotion, or close the dialog and trigger a fresh one.

## Deployment history

The **Deployments** tab on each env shows every promotion targeting that env in reverse chronological order. Each card surfaces:

* The short commit SHA the build ran against
* A **Promotion** type chip and the deployment status pill
* The two pipeline steps — **File Sync** and **Code Deploy** — with each step's outcome
* The actor who triggered the promotion and how long ago it ran

<img alt="Deployments tab scoped to the prod env via the breadcrumb, showing two reverse-chronological cards — the newer 7f3c9a2 promoted 20 minutes ago and the older e2e0fix promoted 1 day ago — each with a 'Promotion' type chip, a green 'Success' status, and File Sync and Code Deploy step rows marked Files synced and Code deployed" />

This is the audit log for "what version did `prod` run between Tuesday and Friday?". It is per-env: history entries belong to the env they targeted and stay on it for the life of the app.

<Tip>
  The `GET /v1/environments/{id}/deployments` endpoint returns more detail than the cards show — including the pinned `env_version`, the source env id, and the promotion note. Use it when you need the full audit record programmatically.
</Tip>

## Permissions

Environment actions are gated by these per-app permissions:

| Permission            | Action                                                          |
| --------------------- | --------------------------------------------------------------- |
| `environment.read`    | See the env dropdown and per-env settings                       |
| `environment.insert`  | Create a new environment                                        |
| `environment.promote` | Promote into an environment                                     |
| `environment.delete`  | Delete a non-default environment (with typed-name confirmation) |

See [Users and access control](/deploy/users-and-access-control) for how permissions roll up into roles, and [API keys](/deploy/api-keys) for scoping a key to a specific env.

## API reference

The same actions are available over the API. Every endpoint is scoped to the app passed in the `X-Agentmark-App-Id` header. See the [API reference](/api-reference/overview) — Environments tag — for full request and response schemas.

| Method   | Path                                | Description                                                             |
| -------- | ----------------------------------- | ----------------------------------------------------------------------- |
| `GET`    | `/v1/environments`                  | List environments for the app (default env first)                       |
| `POST`   | `/v1/environments`                  | Create a non-default environment                                        |
| `GET`    | `/v1/environments/{id}`             | Get one env, including a cascade-preview for delete                     |
| `DELETE` | `/v1/environments/{id}`             | Delete a non-default env (requires typed-name confirmation in the body) |
| `POST`   | `/v1/environments/{id}/promote`     | Promote a source env into this one; returns the new deployment          |
| `GET`    | `/v1/environments/{id}/deployments` | List promotion history for an env                                       |

A `POST .../promote` returns `202 Accepted` with the new `deployment_id` and `env_version`. The content snapshot and env-pointer commit are already done by the time the response lands; the build runs asynchronously. Poll `GET /v1/deployments/{deploymentId}` and watch `code_status` until it reaches `deployed` (success) or `failed`.

```bash theme={null}
# Promote staging into prod
curl -X POST https://api.agentmark.co/v1/environments/$PROD_ENV_ID/promote \
  -H "Authorization: Bearer $AGENTMARK_API_KEY" \
  -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "source_environment_id": "'"$STAGING_ENV_ID"'",
    "note": "Release 2026-05-29 — sign-off in #releases"
  }'
```

## Related reading

* [Deployment](/deploy/deployment) — the underlying pipeline that promote dispatches a build into
* [API keys](/deploy/api-keys) — keys are scoped per environment
* [Users and access control](/deploy/users-and-access-control) — env permissions and roles
* [Regression gates](/deploy/regression-gates) — block a promotion candidate before it ships

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Organizations and teams
Source: https://docs.agentmark.co/deploy/organizations-and-teams

Organizations, apps, and how they fit together

AgentMark uses an **Organization → App** hierarchy.

```mermaid theme={null}
graph TD
    Org[Organization]
    App1[App: staging]
    App2[App: production]
    App3[App: experiments]
    Org --> App1
    Org --> App2
    Org --> App3
    style Org fill:#fff3c4,stroke:#333,stroke-width:2px
    style App1 fill:#e2f0d9,stroke:#333,stroke-width:2px
    style App2 fill:#e2f0d9,stroke:#333,stroke-width:2px
    style App3 fill:#e2f0d9,stroke:#333,stroke-width:2px
```

## Organizations

Each organization represents a company or team. Organizations own billing, members, and all apps within them.

* One billing configuration per organization
* Multiple members with [role-based access](/deploy/users-and-access-control)
* Multiple apps for different services, environments, or teams

## Apps

Apps are the primary unit of isolation. Each app has its own prompts, traces, datasets, experiments, API keys, and metrics. Apps can be synced to a Git repository (GitHub or GitLab).

Common patterns:

* **Per-environment**: `staging`, `production`, `development`
* **Per-service**: `customer-support-agent`, `search-pipeline`, `onboarding-flow`
* **Per-team**: `team-alpha`, `team-platform`

## Setting up

1. **Create an organization** after your account is provisioned — [request access](https://forms.gle/r2z6HuvEoYfHKDxp6) and we'll send you an invite
2. **Create an app** from your org's **Apps** page — click **Create App**. The app is created with a generated name; you can connect a Git repo afterwards from the app card's settings menu → **Connect Git Repository**.
3. **Invite members** from **Settings → Members** — assign roles (Owner, Admin, Write, Read)
4. **Sync a Git repo** to your app for prompt version control and automatic deployment

## Related

<CardGroup>
  <Card title="Users and access control" icon="shield" href="/deploy/users-and-access-control">
    Roles, permissions, custom roles, and SSO
  </Card>

  <Card title="Billing and usage" icon="credit-card" href="/deploy/billing-and-usage">
    Tiers, limits, and feature availability
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Regression gates
Source: https://docs.agentmark.co/deploy/regression-gates

Fail a PR or build when experiment scores regress against a baseline run

A regression gate fails a CI build when an experiment's scores drop against a baseline run. Unlike the absolute `--threshold` pass-rate gate — which fails when too few rows pass in this run alone — a regression gate compares each case to how that same case scored before, so it catches silent quality drops even when a scorer still passes in absolute terms.

Use it to keep a prompt or agent from getting quietly worse as you iterate: you record a baseline once on your default branch, and every PR after that is gated against it.

## When to use this versus `--threshold`

AgentMark has two complementary CI gates. They answer different questions, and you can run both at once.

* **Absolute pass-rate gate** (`--threshold <percent>`): fails when the share of passing rows in *this* run falls below a fixed floor. It needs no baseline and answers "is this run good enough on its own?". See [Running experiments](/evaluate/running-experiments) for the `--threshold` flag and JUnit output.
* **Regression gate** (this page): fails when a case scored *worse than its own baseline*, or when a scorer's mean across the run drops below a configured floor. It needs a baseline run and answers "did this change make anything worse than before?".

A prompt change can keep the overall pass rate at 90% while quietly degrading ten specific cases that used to score higher. The absolute gate misses that; the regression gate catches it.

## How it works

A regression gate compares this run's per-(row × scorer) scores against a baseline run and applies two independent checks. Either one failing fails the build.

### The two checks

**Per-case regression** (`test_settings.regression_tolerance`): a single row × scorer pair fails when its score drops more than `regression_tolerance` *below that same case's baseline score*. The tolerance is a fraction — `0.05` means "fail if the score fell more than 5% below baseline." This is relative and per-case: a score of 0.80 against a baseline of 0.90 is an 11% drop and fails at `0.05`; the same 0.80 with no baseline does not fire this check.

**Run-level threshold** (`test_settings.score_thresholds`): a scorer fails when its *mean* score across the whole run falls below a configured floor. You write it as a `{ scorerName: minMeanScore }` map, for example `{ groundedness: 0.9 }`. This is absolute and run-level — it does not need a baseline, so it stays in force even on the first run.

<Note>
  The per-case regression check only fires when a baseline score exists for that row × scorer pair and the baseline score is greater than zero. It never fires on a missing baseline, a non-numeric score, or a zero baseline — so it cannot fail a build spuriously.
</Note>

### How the baseline is resolved

Each run is identified by a stable `experiment_key`. It defaults to the prompt's repo-relative entrypoint path (for example `./prompts/qa.prompt.mdx`), so two distinct evaluations never collide even when they share a dataset. Set it explicitly when your subject has no single entrypoint file (a code-assembled agent or workflow), or to keep the identity stable across file renames.

AgentMark resolves the baseline by `experiment_key`, environment, and the git **tree hash** of the code at the base commit. It prefers the run recorded at that exact tree hash. If none exists, it falls back to the most recent prior run of the same `experiment_key` — and reports which one it used, so the comparison is never silent.

Rows are matched between runs by a content hash of the dataset input, not by position or ID. Reordering your dataset or regenerating row IDs does not break the comparison.

## Prerequisites

A regression gate compares against a baseline run, so a baseline has to exist first.

* **Baselines are stored in [AgentMark Cloud](/introduction/deployment-modes).** The local dev server's run storage is ephemeral, so it cannot serve as a durable baseline across CI runs. Both setup paths below require an `AGENTMARK_API_KEY`. (The eval-action's `api-key` input is optional for plain evals, but the regression gate needs Cloud-stored baselines, so a key is required here.)
* **Bootstrap by recording a baseline on your default branch.** Run the experiment once on `main` (through the same eval-action or SDK call you use in PRs). From then on, each PR gates against the run recorded on its base commit.
* **No prior run means the gate is inert, not failing.** If AgentMark finds no baseline for the `experiment_key`, the per-case regression check is skipped — there's nothing to compare against yet. The run-level `score_thresholds` gate still applies.

## Set it up for prompts

For prompt-based evals, AgentMark publishes a CI integration for each major platform that runs the changed `.prompt.mdx` files on each PR/MR, compares each case to the base commit's run, and fails the check with per-case annotations. The gate thresholds live in the prompt's frontmatter, so the workflow only needs to point the integration at a baseline ref.

<Tabs>
  <Tab title="GitHub Actions">
    Use the [`agentmark-ai/eval-action`](https://github.com/agentmark-ai/eval-action) GitHub Action.

    <Note>
      The `agentmark-ai/eval-action` repo publishes alongside the first regression-gate release. If `agentmark-ai/eval-action@v1` resolves to a 404 for you, the action hasn't been published yet — use the SDK setup below in the meantime (it runs the same gate from your existing test suite, and works for GitLab and other CI platforms too).
    </Note>

    <Steps>
      <Step title="Add the gate config to the prompt frontmatter">
        Set `regression_tolerance` and `score_thresholds` in the prompt's `test_settings`.

        ```yaml theme={null}
        test_settings:
          dataset: ./data/qa.jsonl
          regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
          score_thresholds:
            groundedness: 0.9                   # fail the run if mean groundedness < 0.9
        ```
      </Step>

      <Step title="Add the action to your PR workflow">
        Check out with full history so the action can resolve the base ref to a tree hash, then add the action. `baseline-ref` defaults to the PR base SHA, so you can omit it.

        ```yaml theme={null}
        name: Evals
        on: pull_request

        jobs:
          eval:
            runs-on: ubuntu-latest
            steps:
              - uses: actions/checkout@v4
                with:
                  fetch-depth: 0                # required: the gate resolves the base ref to a tree hash
              - uses: agentmark-ai/eval-action@v1
                with:
                  api-key: ${{ secrets.AGENTMARK_API_KEY }}
                  # baseline-ref defaults to ${{ github.event.pull_request.base.sha }}
        ```
      </Step>
    </Steps>

    <Warning>
      `fetch-depth: 0` is required. The default shallow checkout does not contain the base commit, so the action cannot resolve it to a tree hash. When that happens the action prints a warning and disables the regression gate for that run rather than failing.
    </Warning>

    The action reference is `agentmark-ai/eval-action@v1` — a standalone action, so the `uses:` path is just the org and repo. It writes JUnit XML, so failures appear in the same PR check panel as your existing `pytest`, `jest`, or `vitest` runs.
  </Tab>

  <Tab title="GitLab CI/CD">
    Use the [`agentmark-ai/eval-component`](/evaluate/gitlab-ci) GitLab CI/CD Catalog component. It wraps the same CLI as the GitHub Action and accepts the same `baseline-ref` / `threshold` semantics — only the integration boilerplate differs.

    <Note>
      The `agentmark-ai/eval-component` Catalog project publishes alongside the first GitLab-parity release. If `gitlab.com/agentmark-ai/eval-component/eval@v1` resolves to a 404, see the [raw-CLI fallback](/evaluate/gitlab-ci#raw-cli-fallback) — same JUnit output, hand-rolled YAML. The platform-neutral SDK setup below also works for GitLab.
    </Note>

    <Steps>
      <Step title="Add the gate config to the prompt frontmatter">
        Set `regression_tolerance` and `score_thresholds` in the prompt's `test_settings` — the same frontmatter the GitHub Action consumes.

        ```yaml theme={null}
        test_settings:
          dataset: ./data/qa.jsonl
          regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
          score_thresholds:
            groundedness: 0.9                   # fail the run if mean groundedness < 0.9
        ```
      </Step>

      <Step title="Add the component to your MR pipeline">
        Set `GIT_DEPTH: "0"` so the diff base resolves to a tree hash, then include the component. `baseline-ref` defaults to `$CI_MERGE_REQUEST_DIFF_BASE_SHA`, so MRs pick the right baseline automatically.

        ```yaml theme={null}
        include:
          - component: gitlab.com/agentmark-ai/eval-component/eval@v1
            inputs:
              api-key: $AGENTMARK_API_KEY    # masked, protected CI variable

        variables:
          GIT_DEPTH: "0"                     # required so the diff base resolves
        ```
      </Step>
    </Steps>

    <Warning>
      `GIT_DEPTH: "0"` is required. GitLab's default shallow checkout does not contain the diff base, so the component cannot resolve `$CI_MERGE_REQUEST_DIFF_BASE_SHA` to a tree hash. When that happens the component disables the regression gate for the run rather than failing.
    </Warning>

    JUnit XML lands as a `artifacts:reports:junit:` artifact, so per-case failures render inline in the MR widget and the pipeline **Tests** tab alongside any other `pytest` / `jest` / `vitest` failures. See [GitLab CI/CD](/evaluate/gitlab-ci) for the full inputs reference.
  </Tab>
</Tabs>

## Set it up for agents and workflows (SDK)

When the thing under test is an agent or a multi-step workflow rather than a single prompt, gate it from inside your existing test suite with the TypeScript SDK. There are no separate eval files and no CLI — your `task` function *is* the execution, so it works with any framework and needs no adapter.

The trade-off: the CLI derives the two git tree hashes automatically, but the SDK does not. You pass them yourself from your CI environment.

```ts theme={null}
import { AgentMarkSDK } from "@agentmark-ai/sdk";

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
});

// Record this run as a trace so it can serve as a future baseline.
sdk.initTracing();

const result = await sdk.runExperiment({
  experimentKey: "support-agent",
  dataset,                                    // [{ input, expectedOutput? }, ...]
  task: (input) => supportAgent.run(input),   // any callable — your agent or workflow
  evaluators: [groundedness],
  sourceTreeHash: process.env.TREE_SHA,       // `git rev-parse HEAD^{tree}`
  baselineTreeHash: process.env.BASE_SHA,     // base commit's tree hash; omit to skip the gate
  regressionTolerance: 0.05,
  scoreThresholds: { groundedness: 0.9 },
  junitPath: "agentmark-results-support-agent.xml", // emit the same JUnit the CLI does
});

// Fail the test when either gate fired (the JUnit file also lands for CI to report).
expect(result.passed).toBe(true);
```

The SDK constructor needs both `apiKey` and `appId`. `initTracing()` registers the run with AgentMark Cloud so a later PR can use it as a baseline — without it, the run executes and gates, but it won't be stored as a baseline for next time.

Setting `junitPath` writes the run as JUnit XML — the same shape the eval-action produces for prompts — so a code experiment surfaces in the PR check exactly like a prompt one. See [Surface both in one check](#surface-prompt-and-code-experiments-in-one-check).

To compute the tree hashes in CI:

```bash theme={null}
TREE_SHA=$(git rev-parse "HEAD^{tree}")
# After actions/checkout@v4, the base branch only exists as a remote-tracking
# ref (origin/<branch>), so prefix with `origin/`. For non-PR events, swap in
# ${{ github.event.pull_request.base.sha }} or the appropriate base SHA.
BASE_SHA=$(git rev-parse "origin/$GITHUB_BASE_REF^{tree}")
```

<Note>
  Pass a git **tree hash**, not a commit SHA, for both `sourceTreeHash` and `baselineTreeHash`. Tree hashes are content-addressed, so two commits with identical file contents resolve to the same baseline. `git rev-parse <ref>^{tree}` converts any commit ref to its tree hash.
</Note>

## Surface prompt and code experiments in one check

JUnit is the shared contract. The eval-action emits it for prompt experiments, and `runExperiment({ junitPath })` emits the *identical* shape for code experiments — same per-`(row × scorer)` testcases, same regression `<failure>`s, same run-level threshold cases. Point one reporter at both and a single PR check covers everything, regardless of origin.

The action's `report: false` makes it *produce* the XML without opening its own check, so a downstream reporter can combine both producers:

```yaml theme={null}
- uses: actions/checkout@v4
  with:
    fetch-depth: 0

# Prompt experiments → agentmark-results-*.xml (no check yet)
- uses: agentmark-ai/eval-action@v1
  with:
    api-key: ${{ secrets.AGENTMARK_API_KEY }}
    report: false

# Code experiments → the same glob, from your test suite
- run: npm test            # runExperiment({ junitPath: 'agentmark-results-<key>.xml' })
  env:
    AGENTMARK_API_KEY: ${{ secrets.AGENTMARK_API_KEY }}
    AGENTMARK_APP_ID: ${{ secrets.AGENTMARK_APP_ID }}

# One reporter for both → a single "AgentMark Evals" check
- uses: mikepenz/action-junit-report@v5
  if: always()
  with:
    report_paths: 'agentmark-results-*.xml'
    check_name: 'AgentMark Evals'
    fail_on_failure: true
```

Because the format is shared, this also works outside GitHub — any JUnit consumer (GitLab, Jenkins, CircleCI) renders both the same way.

## Read the results

Both setup paths surface the same gate outcome — overall pass/fail plus the exact cases that regressed.

In **CI with eval-action**, every row × scorer pair is a JUnit `<testcase>`. A regressed case emits a `<failure>`, so the PR check panel and the Checks tab point at the specific inputs that got worse. The run-level `score_thresholds` failures appear as their own testcases.

In the **SDK**, the return value pinpoints each regression. `result.passed` is the gate verdict — `false` if any case regressed or a `score_thresholds` floor was breached (it does not consider each row's absolute pass/fail; assert on `row.evals[].passed` yourself if you want that too). `result.regressionFailures` counts the regressed pairs; and each row carries per-eval detail so you can list exactly what dropped:

```ts theme={null}
const regressed = result.rows.flatMap((row) =>
  row.evals
    .filter((e) => e.regressed)
    .map((e) => ({ input: row.input, scorer: e.name, score: e.score, baseline: e.baselineScore })),
);

for (const r of regressed) {
  console.log(`${r.scorer}: ${r.score} (baseline ${r.baseline}) — ${JSON.stringify(r.input)}`);
}
```

Each eval result carries `regressed` (whether this specific score fell beyond tolerance) and `baselineScore` (what the matched baseline scored), alongside the run's `failedScoreThresholds` and the `resolved` baseline descriptor.

For the underlying flag that drives the CLI path, see [`--baseline-commit`](/sdk-reference/cli/commands#agentmark-run-experiment) in the CLI reference. The eval-action resolves `baseline-ref` to a tree hash and passes it as `--baseline-commit` for you.

## Caveats

* **No baseline disables only the regression check.** When no prior run exists for the `experiment_key`, the per-case regression check is skipped; `score_thresholds` still runs. Absolute per-row pass/fail is gated only in **eval-action** — a `passed: false` scorer becomes a JUnit `<failure>` the reporter fails on; the CLI's own exit code does not gate it, and the **SDK**'s `result.passed` covers only the regression and `score_thresholds` gates. The CLI prints `⚠️  No baseline run found for "<experiment_key>" — regression gate inactive.` to stderr; stdout stays clean for redirecting to a results file.
* **Exact-match versus recency fallback is reported, never silent.** If there's no run at the base commit's exact tree hash, AgentMark compares against the most recent prior run of the same key instead. The CLI prints `⚠️  No run at <tree-hash> for "<experiment_key>"; comparing against the most recent prior run instead.` to stderr, and the SDK returns `resolved.matchedExactCommit: false`. A recency fallback can compare against a different code state than the PR base, so treat its results as advisory.
* **Row matching is by input hash, so masking or input drift can leave it matching nothing.** Rows are joined to the baseline by a content hash of the dataset input. If you redact inputs — the SDK tracing `hideInputs` option, or a `mask` function that rewrites the stored `agentmark.dataset_input` the gate hashes — or the dataset input otherwise differs from the baseline run, the live rows won't match and the per-case check compares nothing. A baseline that matched nothing is treated as inert (like no baseline), not a failure, but it is reported, never silent: the CLI prints `⚠️  Baseline resolved but 0/<N> rows matched it by input hash — regression gate compared nothing.` to stderr, and the SDK returns `baselineRowsMatched: 0` (with a `console.warn`). Assert on `result.baselineRowsMatched > 0` in CI if a silently inert gate would be worse than a hard failure.
* **`experiment_key` must be stable across runs to match.** The CLI defaults `experiment_key` to the repo-relative entrypoint path, derived from the git top level. A run recorded where git is unavailable falls back to the prompt name or file basename, which won't match a git-derived key — so a baseline and a candidate computed in different environments can silently fail to resolve. Set `test_settings.experiment_key` explicitly to pin the identity when your runs span environments.
* **A non-positive baseline score is skipped.** The per-case check needs a baseline score greater than zero to compute a fractional drop, so a baseline of `0` never fires a regression for that pair.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Security
Source: https://docs.agentmark.co/deploy/security

How AgentMark protects your data — SSO, encryption, PII masking, and data residency

## Authentication

AgentMark supports email/password authentication for all tiers and SAML 2.0 SSO for Team and Enterprise tiers.

### SSO (Team and Enterprise)

Configure SAML 2.0 single sign-on for your organization:

* **Supported providers**: Azure AD, Okta, Google Workspace, and any SAML 2.0-compliant IdP
* **Domain allowlisting**: restrict sign-in to specific email domains
* **Enforcement mode**: require SSO for all org members (no password fallback)
* **Attribute mapping**: map IdP attributes (full name, first/last name) to AgentMark profiles

To configure SSO, navigate to **Settings → SSO** in the AgentMark Dashboard.

## Data protection

### PII masking

Redact sensitive data from traces before it leaves your application. Masking runs in your application process, so configured attributes are redacted before the OTel exporter sends them.

```typescript theme={null}
import { AgentMarkSDK, createPiiMasker } from '@agentmark-ai/sdk';

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
  mask: createPiiMasker({ email: true, ssn: true }),
});
sdk.initTracing();
```

For a zero-code option, set `AGENTMARK_HIDE_INPUTS=true` or `AGENTMARK_HIDE_OUTPUTS=true` to redact LLM request inputs or response outputs (the `gen_ai.request.*` and `gen_ai.response.*` attributes) to `[REDACTED]` before export.

[Full PII masking docs →](/observe/pii-masking)

### Encryption

* **In transit**: all API communication uses TLS 1.2+ (terminated at Cloudflare)
* **At rest**: data stored in Supabase (PostgreSQL) and ClickHouse with provider-managed encryption at rest

### Provider API keys (managed deployments)

Every AgentMark app runs as a managed deployment. The AI provider keys you configure (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.) are stored as encrypted secrets in our vault, using authenticated encryption with a root key held outside the application database.

* **Scoped per app** — one app's keys are never visible to another
* **Decrypted only when needed** — values are pulled from the vault at build time and injected into the handler runtime, or when an authorized dashboard user explicitly clicks "reveal" on a single variable
* **Never written to logs** — env var values are excluded from build logs, request logs, and trace exports
* **Deleted on demand** — removing a variable from the dashboard deletes the underlying vault secret in the same transaction

### AgentMark API key security

AgentMark API keys (the keys your code uses to authenticate with AgentMark) are issued with per-tier rate limiting and scoped to individual apps — no single key grants access across your organization.

Each key carries a permission set that controls which API endpoints it can call. Choose a preset role (**SDK**, **Read-Only**, **Full Access**) or build a **Custom** permission set. The gateway enforces these permissions on every request and returns `403` when a key lacks the required permission. See the [API keys walkthrough](/deploy/api-keys) and [Users and access control](/deploy/users-and-access-control#api-keys) for details.

## Data residency

If you have regional hosting, on-prem, or data residency requirements, [contact us](mailto:hello@agentmark.co) — Enterprise deals can accommodate custom arrangements.

## Data retention

Data retention varies by tier:

| Tier       | Retention    |
| ---------- | ------------ |
| Free       | 7 days       |
| Growth     | 90 days      |
| Team       | 90 days      |
| Enterprise | Configurable |

## Temporary support access

When Enterprise customers need hands-on support, AgentMark support engineers can grant themselves temporary read-only access to a tenant's data. Access auto-expires after 24 hours and requires customer permission confirmation. All access grants are recorded in an immutable audit log.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Status checks
Source: https://docs.agentmark.co/deploy/status-checks

Fail pull/merge requests when AgentMark prompts don't compile, the same way you'd fail them for a broken build.

When you connect a repository to AgentMark — GitHub or GitLab — every push gets a status check on the commit: **AgentMark / Build**. It's the same pattern Vercel uses: if the deploy would fail, the PR/MR shows red and (once you make it required) blocks merge.

## What gets checked

On both providers, the check runs every prompt and component file in your push through the same compiler the runtime uses (`getTemplateDXInstance(...).parse(...)`). If a file would fail to load at request time, the check fails — with that file annotated on the PR (GitHub) or summarized in the status description (GitLab).

What surfaces as a failure today:

* **Template syntax errors** — malformed MDX/JSX in `.prompt.mdx` or `.mdx` files (unclosed tags, stray HTML comments, unexpected characters).
* **Frontmatter errors** — missing `text_config`/`object_config`/`image_config`/`speech_config`; invalid model names; missing required fields.
* **Schema reference errors** — `$ref` paths that don't resolve in the push.

What the check does **not** cover:

* Your handler's TypeScript or Python build — that runs on the code-deploy step, and surfaces in the AgentMark dashboard under the deployment's build logs.
* Files outside the `agentmark` directory configured by `agentmark.json` — only files AgentMark Cloud imports are validated.

## Conclusion states

The check reports one of three conclusions per push:

| Conclusion  | GitHub appearance                                              | GitLab appearance                                                                                                        |
| ----------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| **success** | green check                                                    | `success` status, "All prompt and component files compiled."                                                             |
| **failure** | red ✕, with per-file annotations on the PR's Files Changed tab | `failed` status, description shows count by category (e.g. `3 prompt compile errors (1 frontmatter, 2 template syntax)`) |
| **neutral** | gray, "No prompts to compile"                                  | `success` status, "No AgentMark files in this push." — GitLab has no native neutral state, so we collapse to success     |

## GitHub

### Making the check required

GitHub's "required status check" setting is what actually blocks merge — it's a repo-level setting we deliberately don't manage on your behalf, because doing so needs `administration` scope on the GitHub App and that's a security surface most teams don't want us holding.

To enforce the check:

1. In your GitHub repo, go to **Settings → Branches → Branch protection rules**.

2. Add a rule for the branch you deploy from (usually `main`).

3. Enable **Require status checks to pass before merging**.

4. In the search box, type `AgentMark` and select **AgentMark / Build**.

   <Tip>
     The check name only appears in GitHub's picker after the first push that triggers it. If you don't see it, push any commit to the repo first, then come back to this screen.
   </Tip>

5. Save.

From now on, any PR targeting that branch can't merge until the check passes.

### What a failure looks like

When a prompt fails to compile, the PR's **Files changed** tab gets an inline annotation on the offending file. The annotation title categorizes the failure so you can spot the class of problem from the sidebar without opening each file:

* **Frontmatter error** — fix the `---` block at the top of the prompt.
* **Template syntax error** — MDX/JSX in the body didn't parse.
* **Schema reference error** — a `$ref` couldn't be resolved.
* **Prompt compile error** — generic fallback.

Annotations include the parser's line and column when the underlying templatedx error carries them. Frontmatter errors without a line number fall back to line 1 (the start of the frontmatter block).

## GitLab

### What the check looks like

GitLab's commit status API is intentionally thinner than GitHub's Checks API — it carries `{state, name, description, target_url}` and nothing else. So the GitLab experience differs in two ways:

* **No line-level annotations on the MR diff.** Failure details ride on the status description as a categorized count: `4 prompt compile errors (1 frontmatter, 2 template syntax, 1 schema reference)`.
* **The `target_url` points to the AgentMark dashboard** — click through for the per-file failure list with full error messages.

A planned follow-up will use GitLab's **MR Discussions API** to post line-positioned comments alongside the status, giving GitLab users back the line specificity GitHub gets via annotations.

### Making the check required

GitLab's options depend on your tier:

| GitLab tier            | Mechanism                                                                      | Limitation                                                                                                                                                                                                                                                                                                                                                                            |
| ---------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Free**               | "Pipelines must succeed" in **Settings → Merge requests**                      | Only blocks merge if the status comes from a real GitLab CI job. The Cloud-managed status posted by AgentMark Cloud is informational. To block merge on free tier, use the self-hosted CI job pattern in [Hardcoded merge blocking via CI](#hardcoded-merge-blocking-via-ci) below — it runs `agentmark build` as a real pipeline job, which "Pipelines must succeed" *can* block on. |
| **Premium / Ultimate** | **External Status Checks** in **Settings → General → Merge request approvals** | Add a rule pointing at AgentMark; the Cloud-managed status counts as a required external approval.                                                                                                                                                                                                                                                                                    |

## Hardcoded merge blocking via CI

The Cloud-managed status checks above are convenient — zero config beyond installing the App or wiring a webhook — but they depend on AgentMark Cloud being reachable, and on the provider tier supporting "required external checks" (GitLab Free doesn't).

For a **fully self-hosted** alternative that blocks merge on *any* provider regardless of tier, run `agentmark build` as a CI job in your own repository. The CLI compiles every prompt with the same compiler the runtime uses and exits non-zero on failure — failing the pipeline naturally and triggering the provider's built-in "pipelines must succeed" merge guard.

This pattern has three advantages over (or alongside) the Cloud-managed check:

* **Works on GitLab Free.** "Pipelines must succeed" only blocks merge on pipeline failures, not external status checks — but `agentmark build` failing IS a pipeline failure.
* **Doesn't depend on AgentMark Cloud uptime.** The validation runs entirely inside your CI. If our webhook handler is degraded, your merge guard still works.
* **Faster feedback for big repos.** No webhook round-trip — the CI job runs in seconds against the pushed commit directly.

### GitLab CI

```yaml .gitlab-ci.yml theme={null}
agentmark-build:
  image: node:22
  stage: test
  script:
    - npx --yes @agentmark-ai/cli build
  rules:
    - changes:
        - "agentmark/**/*"
        - "agentmark.json"
        - "**/*.prompt.mdx"
        - "**/*.mdx"
```

Then in **Settings → Merge requests → Merge checks**, enable **Pipelines must succeed**. From now on, an MR can't merge until the `agentmark-build` job is green.

### GitHub Actions

```yaml .github/workflows/agentmark-build.yml theme={null}
name: AgentMark / Build
on:
  pull_request:
    paths:
      - "agentmark/**"
      - "agentmark.json"
      - "**/*.prompt.mdx"
      - "**/*.mdx"

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npx --yes @agentmark-ai/cli build
```

Then in **Settings → Branches → Branch protection rules**, add a rule requiring the **AgentMark / Build** check name. Same picker, same UX as the App-based check — your CI job's status takes the place of (or runs alongside) the App's.

<Tip>
  You can run both the Cloud-managed check and the self-hosted CI check in parallel. They use the same name (`AgentMark / Build`) but appear as separate entries in the required-status picker on GitHub. On GitLab they appear as separate statuses on the MR pipeline. Belt-and-suspenders: if either signals failure, the merge blocks.
</Tip>

## Re-running the check

There's no "Re-run" button on either platform — the check is keyed to the commit SHA, and pushing a new commit (even an empty one) is the way to retry:

```bash theme={null}
git commit --allow-empty -m "Retry AgentMark check"
git push
```

## Limitations

* **PRs from forks don't get checks** (GitHub) today. The check is posted from the `push` webhook, which forks don't send to the upstream App. PRs from branches in the same repo do get checks. GitLab MRs from forked projects have the analogous limitation.
* **No retries on transient outages.** If the status API itself errors when we post the result, we log it and move on; the deploy still runs. The check will sit at *in progress* (GitHub) / *running* (GitLab) until the provider's timeout. You can re-push to recover.
* **One check per push**, not per file. The aggregate conclusion comes from every prompt file in the change set.
* **GitLab line-level annotations are deferred to a follow-up** that uses the MR Discussions API to post comments at failing line positions. Today, GitLab users get a categorized count in the status description; the full per-file list lives in the AgentMark dashboard.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Users and access control
Source: https://docs.agentmark.co/deploy/users-and-access-control

Roles, permissions, custom roles, app-level access, and team management

AgentMark uses a role-based access control (RBAC) system with granular permissions at the organization and app level.

## Built-in roles

Every organization member is assigned one of these roles:

| Role      | Access                                                                                                                                                                                                |
| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Owner** | Full access, including billing and subscription management. Assigned to the org creator. Only Owners can promote other members to Owner.                                                              |
| **Admin** | Full access to all org resources, settings, and member management. Read-only billing — only Owners can change the subscription plan or payment method.                                                |
| **Write** | Create and edit prompts and datasets; create and view API keys; run experiments; view traces. Read-only on apps — only Admins and Owners can create or delete apps. Cannot manage members or billing. |
| **Read**  | Read-only access to all org resources. Cannot create, edit, or delete anything.                                                                                                                       |

## Inviting members

Invite team members from **Settings → Members** in the AgentMark Dashboard. Invitations are sent by email and expire after 7 days. Each invitation includes a role assignment.

## Custom roles and app-level access

<Info>**Team tier and above.** Custom roles and app-level role assignments require a Team or Enterprise subscription.</Info>

### Custom roles

Create custom roles with cherry-picked permissions for fine-grained access control:

1. Navigate to **Settings → Roles** in the Dashboard
2. Click **Create role**
3. Name the role and select the specific permissions to grant
4. Assign the role to members

Custom roles draw from the full permission catalog — you can grant access to specific features (e.g., "can view traces and run experiments but cannot edit prompts or manage billing").

### App-level roles

Assign different roles per app within the same organization. A member might have **Write** access to your staging app but **Read** access to production.

To configure per-app access, open **Settings → Members** in the AgentMark Dashboard, click the row action menu next to a member, and choose **Manage app access**. From the dialog, toggle each app on or off and set a built-in or custom role per app.

## API keys

API keys are scoped to individual apps. Each key grants access only to that app's resources (prompts, traces, experiments).

* Create and manage keys from the app-level **Settings → API keys** page in the Dashboard (under `/orgs/<org>/apps/<app>/settings/api-keys`)
* Keys are rate-limited by tier (see [Billing and usage](/deploy/billing-and-usage) for limits)
* Key names must be unique within an app

For a step-by-step Dashboard walkthrough with screenshots, see [API keys](/deploy/api-keys).

### Role presets

Each API key carries either a preset role or a custom permission set. When you create or edit a key, choose one of these presets or select **Custom** to toggle individual permissions.

| Role            | Access                                                                                                                                                                                                                                                           |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **SDK**         | `trace.write`, `template.read`, `score.write`. CLI and SDK integrations that ingest traces, read templates, and write scores.                                                                                                                                    |
| **Read-Only**   | `trace.read`, `span.read`, `session.read`, `score.read`, `score_config.read`, `dataset.read`, `metrics.read`, `deployment.read`, `environment.read`, `alert.read`, `slack_integration.read`, `app.read`. Dashboards and BI tools — read-only access to all data. |
| **Full Access** | Every permission in the catalog. Admin and CI pipelines.                                                                                                                                                                                                         |
| **Custom**      | Toggle individual permissions. At least one permission is required.                                                                                                                                                                                              |

### Permission catalog

The custom permission picker exposes these permissions, grouped by resource:

| Permission                | Description                                       |
| ------------------------- | ------------------------------------------------- |
| `trace.write`             | Ingest new traces via `POST /v1/traces`           |
| `trace.read`              | Read traces and graph views                       |
| `span.read`               | Read spans via `GET /v1/spans`                    |
| `session.read`            | Read sessions                                     |
| `template.read`           | Read prompt templates                             |
| `score.read`              | Read scores, aggregations, and score names        |
| `score.write`             | Create scores                                     |
| `score.delete`            | Delete scores                                     |
| `score_config.read`       | Read score configs                                |
| `dataset.read`            | Read datasets                                     |
| `dataset.write`           | Create dataset rows                               |
| `metrics.read`            | Read aggregate metrics                            |
| `experiment.read`         | Read experiments, runs, and prompt execution logs |
| `annotation_queue.read`   | Read annotation queues and queue items            |
| `annotation_queue.write`  | Create annotation queues and queue items          |
| `annotation_queue.delete` | Delete annotation queues                          |
| `annotation_queue.review` | Submit reviews to annotation queues               |
| `api_key.read`            | Read API keys                                     |
| `api_key.insert`          | Create API keys                                   |
| `api_key.delete`          | Delete API keys                                   |
| `deployment.read`         | Read deployments                                  |
| `environment.read`        | Read environments                                 |
| `environment.insert`      | Create environments                               |
| `environment.update`      | Update environments                               |
| `environment.delete`      | Delete environments                               |
| `environment.promote`     | Promote / roll back environments                  |
| `alert.read`              | Read alerts                                       |
| `alert.insert`            | Create alerts                                     |
| `alert.update`            | Update alerts                                     |
| `alert.delete`            | Delete alerts                                     |
| `slack_integration.read`  | Read Slack channels                               |
| `app.read`                | Read apps                                         |
| `app.insert`              | Create apps                                       |
| `app.update`              | Update apps                                       |
| `app.delete`              | Delete apps                                       |

See [Endpoint permissions](/api-reference/authentication#endpoint-permissions) for the full mapping of API endpoints to permissions.

When a key attempts an operation it does not have permission for, the API returns `403 Forbidden`.

### Creating a scoped key

1. Navigate to the app's **Settings → API keys** page in the Dashboard
2. Click **Create API key**
3. Enter a name for the key
4. Select a **role** or choose **Custom** to toggle individual permissions
5. Click **Create** — copy the key immediately, as it is only shown once

### Editing key permissions

You can change an existing key's permissions at any time:

1. Open the app's **Settings → API keys** page
2. Click the pencil icon next to the key you want to modify
3. Update the role or individual permissions
4. Save your changes — the key value stays the same

## SSO enforcement

Team and Enterprise organizations can enforce SAML SSO for all members. When SSO enforcement is enabled, members must authenticate through your identity provider — no password fallback is available.

See [Security](/deploy/security) for SSO configuration details.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Webhooks
Source: https://docs.agentmark.co/deploy/webhooks

Receive alert notifications from AgentMark Cloud via webhooks

<Info>**Paid feature.** Webhook alert delivery requires a Growth, Team, or Enterprise plan — the `alerts_enabled` entitlement is off on the Free plan.</Info>

AgentMark webhooks deliver alert notifications to your application when cost, latency, error-rate, or evaluation-score thresholds are breached or resolved.

## How it works

When an alert fires or resolves, AgentMark Cloud sends an HTTP POST request to your configured webhook URL. The request body is the event payload and the `x-agentmark-signature-256` header is an HMAC-SHA256 signature of the request body (using your webhook secret), formatted as `sha256=<hex-digest>`. Your endpoint verifies the signature, processes the alert, and returns a 2xx status.

## Setup

### 1. Get your webhook secret

AgentMark stores one webhook URL and one secret per app, and alerts reuse that configuration:

1. Open your app in the AgentMark Dashboard.
2. Navigate to **Settings → Integrations**.
3. Under **Webhook Url**, find:
   * **Webhook Url** — enter your production webhook endpoint URL.
   * **Secret Key** — used for signature verification.

<Warning>
  Keep your webhook secret secure. Use environment variables — never commit it to source control.
</Warning>

### 2. Install dependencies

```bash theme={null}
npm install @agentmark-ai/shared-utils
```

### 3. Create the webhook endpoint

Set up environment variables:

```env theme={null}
AGENTMARK_WEBHOOK_SECRET=your_webhook_secret
```

Create a POST endpoint that verifies signatures and handles alert events. This example uses Next.js App Router:

```typescript app/api/agentmark-alerts/route.ts theme={null}
import { NextRequest, NextResponse } from "next/server";
import { verifySignature } from "@agentmark-ai/shared-utils";

export const dynamic = "force-dynamic";

export async function POST(request: NextRequest) {
  const payload = await request.json();
  const signature = request.headers.get("x-agentmark-signature-256");

  // 1. Verify signature
  if (
    !signature ||
    !(await verifySignature(
      process.env.AGENTMARK_WEBHOOK_SECRET!,
      signature,
      JSON.stringify(payload)
    ))
  ) {
    return NextResponse.json(
      { message: "Invalid signature" },
      { status: 401 }
    );
  }

  try {
    const { event } = payload;

    // 2. Handle alert events
    if (event.type === "alert") {
      const { alert, message, timestamp } = event.data;

      console.log(
        `Alert ${alert.status}: ${alert.type} — ${message} (${timestamp})`
      );

      // Route to your notification system (Slack, PagerDuty, email, etc.)
      // await sendSlackNotification(alert, message);

      return NextResponse.json({
        message: "Alert processed",
        alertId: alert.id,
        status: alert.status,
      });
    }

    return NextResponse.json(
      { message: `Unknown event type: ${event.type}` },
      { status: 400 }
    );
  } catch (error) {
    console.error("Webhook error:", error);
    return NextResponse.json(
      { message: "Internal server error" },
      { status: 500 }
    );
  }
}
```

### 4. Deploy and configure

<Steps>
  <Step title="Deploy your endpoint">
    Deploy your application to a publicly accessible URL (e.g., Vercel, Railway, AWS).
  </Step>

  <Step title="Add the webhook URL">
    In the AgentMark Dashboard, go to **Settings → Integrations** and enter your endpoint URL in the **Webhook Url** form (e.g., `https://your-app.vercel.app/api/agentmark-alerts`).
  </Step>

  <Step title="Set the webhook secret">
    Add the **Secret Key** from the same **Webhook Url** form to your deployment's environment variables as `AGENTMARK_WEBHOOK_SECRET`.
  </Step>

  <Step title="Enable webhook delivery on the alert">
    Webhook delivery is opt-in per alert. When you create or edit an alert in the Dashboard (under **Alerts**), toggle on **Custom Webhook** — it's off by default. Without this toggle, AgentMark will not POST to your webhook URL even if the app has one configured.
  </Step>

  <Step title="Test">
    Configure an alert in the Dashboard (e.g., cost threshold), trigger the condition, and verify your endpoint receives the event.
  </Step>
</Steps>

## Event format

```json theme={null}
{
  "event": {
    "type": "alert",
    "data": {
      "alert": {
        "id": "string",
        "currentValue": 0,
        "threshold": 0,
        "status": "triggered | resolved",
        "timeWindow": 60,
        "type": "cost | latency | error_rate | evaluation_score",
        "commitSha": "string (optional)",
        "evaluationName": "string (only when type is evaluation_score)",
        "evaluationAggregation": "avg | individual (only when type is evaluation_score)",
        "evaluationThresholdDirection": "above | below (only when type is evaluation_score)"
      },
      "message": "string",
      "timestamp": 1712764245000
    }
  }
}
```

* `timeWindow` is the measurement window in **minutes** (number).
* `timestamp` is a Unix epoch in **milliseconds** (number) — from `Date.now()`.
* `commitSha` is included inside `alert` when the app has a commit SHA on record at trigger time; otherwise it's omitted. On `resolved` events it is always omitted.
* The three `evaluation*` fields on `alert` are included only when `type` is `evaluation_score`.

### Alert types

| Type               | Monitors        | Example threshold    |
| ------------------ | --------------- | -------------------- |
| `cost`             | API usage costs | Spending > \$50/day  |
| `latency`          | Response times  | P95 latency > 5000ms |
| `error_rate`       | Error frequency | Error rate > 5%      |
| `evaluation_score` | Score pipeline  | Score \< 0.8 average |

### Processing alerts

```typescript theme={null}
if (event.type === "alert") {
  const { alert, message, timestamp } = event.data;

  if (alert.status === "triggered") {
    switch (alert.type) {
      case "cost":
        console.log(`Cost alert: ${alert.currentValue} exceeded threshold ${alert.threshold}`);
        break;
      case "latency":
        console.log(`Latency alert: ${alert.currentValue}ms exceeded threshold ${alert.threshold}ms`);
        break;
      case "error_rate":
        console.log(`Error rate alert: ${alert.currentValue}% exceeded threshold ${alert.threshold}%`);
        break;
      case "evaluation_score":
        console.log(`Score alert: ${alert.currentValue} crossed threshold ${alert.threshold}`);
        break;
    }
  } else if (alert.status === "resolved") {
    console.log(`Alert ${alert.id} resolved at ${new Date(timestamp).toISOString()}`);
  }
}
```

## Integration examples

### Slack notifications

```typescript theme={null}
async function sendSlackNotification(alert: any, message: string) {
  const slackWebhookUrl = process.env.SLACK_WEBHOOK_URL;
  if (!slackWebhookUrl) return;

  await fetch(slackWebhookUrl, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      text: `AgentMark Alert: ${alert.type.toUpperCase()}`,
      blocks: [
        {
          type: "section",
          text: { type: "mrkdwn", text: message },
        },
        {
          type: "section",
          fields: [
            { type: "mrkdwn", text: `*Status:*\n${alert.status}` },
            { type: "mrkdwn", text: `*Current:*\n${alert.currentValue}` },
            { type: "mrkdwn", text: `*Threshold:*\n${alert.threshold}` },
            { type: "mrkdwn", text: `*Window:*\n${alert.timeWindow} min` },
          ],
        },
      ],
    }),
  });
}
```

### Email notifications

```typescript theme={null}
async function sendEmailAlert(alert: any, message: string) {
  // Use your preferred email service (Resend, SendGrid, AWS SES, etc.)
  const emailConfig = {
    from: process.env.ALERT_EMAIL_FROM,
    to: process.env.ALERT_EMAIL_TO,
    subject: `AgentMark Alert: ${alert.type} - ${alert.status}`,
    html: `
      <h2>AgentMark Alert</h2>
      <p>${message}</p>
      <table>
        <tr><td><strong>Alert ID:</strong></td><td>${alert.id}</td></tr>
        <tr><td><strong>Type:</strong></td><td>${alert.type}</td></tr>
        <tr><td><strong>Status:</strong></td><td>${alert.status}</td></tr>
        <tr><td><strong>Current Value:</strong></td><td>${alert.currentValue}</td></tr>
        <tr><td><strong>Threshold:</strong></td><td>${alert.threshold}</td></tr>
      </table>
    `,
  };

  // await sendEmail(emailConfig);
}
```

## Security best practices

1. **Always verify signatures** -- reject requests with missing or invalid `x-agentmark-signature-256` headers.
2. **Use HTTPS** -- your production endpoint must use HTTPS.
3. **Store secrets in environment variables** -- never hardcode credentials.
4. **Return proper status codes** -- `401` for auth failures, `400` for bad requests, `500` for server errors.
5. **Respond quickly** -- return a `200` status promptly, then process the alert asynchronously if needed.
6. **Route by type** -- send cost alerts to finance channels, latency alerts to engineering, etc.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Human annotation
Source: https://docs.agentmark.co/evaluate/annotations

Score traces with human reviewers — individually or in structured batch queues

<Info>**Cloud feature.** Annotations are available in the [AgentMark Dashboard](https://app.agentmark.co).</Info>

Human annotation adds manual scores, labels, and feedback to your traces. Use it to evaluate subjective quality, flag edge cases, curate training datasets, and calibrate your automated evals.

AgentMark supports two annotation workflows:

* **Inline annotation** — score a single trace directly from the trace drawer
* **Annotation queues** — batch traces into structured review queues with assignment, progress tracking, and multi-reviewer support

<img alt="Animated walkthrough of the annotation queue review flow: queue list, detail view, and review panel" />

The animation shows a reviewer moving through a queue: list view → queue detail with assigned items → side-by-side review panel (trace on the left, score controls on the right) → next item.

## When to use human annotation

| Use case                      | Example                                                       | Workflow                                                              |
| ----------------------------- | ------------------------------------------------------------- | --------------------------------------------------------------------- |
| **Quality audits**            | Review a sample of production traces for correctness and tone | Create a queue, add traces, assign to domain experts                  |
| **Edge case triage**          | Flag and investigate unexpected model behavior                | Inline annotation from the trace drawer                               |
| **Dataset curation**          | Build high-quality test datasets from real production data    | Review in queue, save passing traces to a dataset                     |
| **Calibrate automated evals** | Align your LLM-as-judge scorers with human judgment           | Score the same traces manually that your evals score, compare results |
| **Multi-reviewer consensus**  | Get independent assessments from multiple team members        | Set reviewers required > 1 on a queue                                 |

## Score types

Score configs define what reviewers score on. They are declared as JSON in your `agentmark.json` under the top-level `scores` field and synced to AgentMark Cloud through the [deployment pipeline](/deploy/deployment). When creating a queue, you select which score configs to include.

<Note>
  Score configs must be synced to AgentMark Cloud before you can create a queue. Push your changes to the connected branch so the deployment pipeline picks up your `agentmark.json` scores. Once synced, score configs are always available in the Dashboard — no worker dependency required. See [Project configuration](/configure/project-config) for the full `scores` schema and [Evaluations](/evaluate/writing-evals) for adding automated eval functions.
</Note>

<Tabs>
  <Tab title="Boolean (pass/fail)">
    A binary judgment. The reviewer clicks **Pass** or **Fail**.

    ```json agentmark.json theme={null}
    {
      "scores": {
        "factual_accuracy": {
          "type": "boolean",
          "description": "Was the response factually correct?"
        }
      }
    }
    ```

    Saved as score `1` (pass) or `0` (fail). Best for clear-cut criteria.
  </Tab>

  <Tab title="Numeric (scale)">
    A number within a configurable range. The reviewer enters a value.

    ```json agentmark.json theme={null}
    {
      "scores": {
        "helpfulness": {
          "type": "numeric",
          "min": 1,
          "max": 5,
          "description": "Rate helpfulness on a 1-5 scale"
        }
      }
    }
    ```

    Best for graded assessments.
  </Tab>

  <Tab title="Categorical (labels)">
    A dropdown of predefined options. The reviewer picks one.

    ```json agentmark.json theme={null}
    {
      "scores": {
        "tone": {
          "type": "categorical",
          "description": "Response tone",
          "categories": [
            { "label": "professional", "value": 1 },
            { "label": "casual", "value": 0.5 },
            { "label": "inappropriate", "value": 0 }
          ]
        }
      }
    }
    ```

    Each category is a `{label, value}` pair. The `label` is shown in the reviewer dropdown, and the `value` is the numeric score recorded when selected.

    Best for classification.
  </Tab>
</Tabs>

Every score type includes an optional **reason** field where the reviewer can explain their judgment.

## Inline annotation

Add a score to any trace directly from the trace drawer — no queue required.

<img alt="Evaluations tab in the trace drawer showing inline annotation scores alongside automated eval results" />

The Evaluations tab lists every score attached to the selected span — both automated eval results and human annotations. Each row shows the score name, label/value, and reason; annotations carry an **annotation** badge to distinguish them from automated scores.

<Steps>
  <Step title="Open a trace">
    Navigate to **Traces** and click on any trace to open the detail drawer.
  </Step>

  <Step title="Select a span">
    Choose the span you want to annotate from the trace tree.
  </Step>

  <Step title="Go to the evaluations tab">
    Click the **Evaluations** tab in the drawer.
  </Step>

  <Step title="Add annotation">
    Click **Add annotation**, fill in the name, label, score, and reason, then click **Save**.
  </Step>
</Steps>

Inline annotations appear alongside automated eval scores, distinguished by an "annotation" badge.

## Annotation queues

For batch review, use annotation queues. Queues let you organize items, assign reviewers, track progress, and require multiple independent reviews.

<img alt="Review queues list showing active queues with progress bars and pending badge in sidebar" />

The Review Queues page lists every queue in the app with its name, status, progress, and creation time. Filter tabs narrow by status (All, Active, Completed, Archived), and an **Assigned to me** toggle restricts the list to queues with items assigned to you. A sidebar badge surfaces the total number of pending items across all active queues.

### Create a queue

Navigate to **Review Queues** in the sidebar and click **Create Queue**.

<img alt="Create review queue dialog with name, instructions, reviewers required, and score config fields" />

The **Create Review Queue** dialog takes a name, optional description and annotator instructions, the number of independent reviews required per item, and the set of score configs to show to reviewers. A default dataset can be selected to pre-fill the "Save to dataset" action during review.

| Field                           | Required | Description                                                                            |
| ------------------------------- | -------- | -------------------------------------------------------------------------------------- |
| **Name**                        | Yes      | Descriptive name for the review batch                                                  |
| **Description**                 | No       | Context for what this queue covers                                                     |
| **Instructions for annotators** | No       | Guidance shown during review (e.g., "Mark PASS if factually correct and professional") |
| **Reviewers required**          | Yes      | Independent reviews needed per item (default: 1)                                       |
| **Score configs**               | Yes      | Which scoring dimensions to show during review                                         |
| **Default dataset**             | No       | Pre-selects a dataset for the "Save to dataset" action                                 |

### Add items

<Tabs>
  <Tab title="Bulk from traces">
    <Steps>
      <Step title="Select traces">
        Go to the **Traces** page and select traces using the checkboxes.
      </Step>

      <Step title="Add to queue">
        Click **Add to Queue** in the bulk actions bar, choose a queue, and confirm.
      </Step>
    </Steps>
  </Tab>

  <Tab title="Individual spans">
    When viewing a trace in the drawer, click **Add to Queue** in the action bar. This adds the selected span — not the whole trace — to the queue.
  </Tab>

  <Tab title="From experiments">
    On the experiment detail page, select items and add them to a queue for review.
  </Tab>
</Tabs>

### Queue detail

Click any queue to see its items, progress, and reviewer assignments.

<img alt="Queue detail view showing items table with status, type, assignment, filter tabs, and multi-reviewer badge" />

The queue detail view lists every item with status (pending, completed, skipped), resource type (trace, span, or session), and assignee. Filter tabs across the top narrow by status, and a multi-reviewer badge shows how many independent reviews remain per item.

Filter items using the tabs:

| Tab                | Shows                     |
| ------------------ | ------------------------- |
| **All**            | Every item in the queue   |
| **Pending**        | Items waiting for review  |
| **Completed**      | Reviewed or skipped items |
| **Assigned to me** | Items assigned to you     |

Click the assign icon on any row to assign it to a team member. Use the three-dot menu to archive a queue when review is complete.

### Review workflow

Click **Start Review** to begin. The review view splits into two panels.

**Left panel — trace content:**

* Metadata bar with trace name, latency, cost, tokens, and model
* Root span input/output formatted as JSON
* Expandable spans tree — click any span to see its I/O
* For session items, a conversation timeline showing all turns

**Right panel — annotation:**

* Annotator instructions (collapsible, from queue config)
* Score controls for each configured dimension
* Prior annotations on this resource (read-only)
* Save to dataset section with auto-extracted I/O

| Action              | Shortcut | What it does                        |
| ------------------- | -------- | ----------------------------------- |
| **Complete + Next** | `Enter`  | Save scores, mark complete, advance |
| **Skip**            | —        | Mark as skipped, advance            |
| **Back**            | —        | Return to queue detail              |

<Note>
  Dataset items added through the **Save to dataset** section are staged on the queue while review is in progress. They are committed to the target dataset in a single batch when the queue is marked completed, so saved items will not appear in the dataset until queue completion. This keeps the dataset clean if a review is paused, abandoned, or reverted.
</Note>

### Multi-reviewer

When **reviewers required** is set above 1, each reviewer annotates independently:

* The review header shows a progress badge (e.g., "0/2 reviewed") tracking how many reviewers have completed their assessment
* Each reviewer sees their own fresh annotation form — they don't see other reviewers' scores while annotating
* An item is only marked **complete** when the required number of independent reviews is reached
* The `/next` endpoint automatically skips items the current reviewer has already reviewed, so each reviewer only sees items they haven't scored yet

<Tip>
  Use multi-reviewer for high-stakes evaluations like safety reviews or fine-tuning dataset curation where a single reviewer's judgment isn't sufficient.
</Tip>

### Resource types

Queues support three item types:

| Type        | When to use                                 | What the reviewer sees                  |
| ----------- | ------------------------------------------- | --------------------------------------- |
| **Trace**   | Review a complete request                   | Full trace with expandable per-span I/O |
| **Span**    | Review a single LLM call or tool invocation | Individual span content                 |
| **Session** | Review a multi-turn conversation            | Conversation timeline across traces     |

## Programmatic queue management

Annotation queues are fully exposed on the public REST API at `/v1/annotation-queues` (Cloud only — the local dev server returns `404`). CI pipelines can create queues, enqueue traces, and — via the `/reviews` endpoint — submit annotations through the same path a human reviewer clicks in the Dashboard.

| Method                     | Path                                                     | Purpose                                    |
| -------------------------- | -------------------------------------------------------- | ------------------------------------------ |
| `GET` · `POST`             | `/v1/annotation-queues`                                  | List / create queues                       |
| `GET` · `PATCH` · `DELETE` | `/v1/annotation-queues/{queueId}`                        | Read / update / delete a queue             |
| `GET` · `POST`             | `/v1/annotation-queues/{queueId}/items`                  | List or add traces, spans, or sessions     |
| `GET` · `PATCH` · `DELETE` | `/v1/annotation-queues/{queueId}/items/{itemId}`         | Read, update, or remove an item            |
| `POST`                     | `/v1/annotation-queues/{queueId}/items/{itemId}/reviews` | Submit a review (LLM-as-judge entry point) |

The review-submission endpoint is the one that makes this more than just queue CRUD: posting `{ "status": "completed" }` records the authenticated user as a reviewer, and when the queue's `reviewers_required` threshold is met the item auto-advances to `completed`. That lets an LLM-as-judge pipeline submit annotations that count toward the same threshold as human reviewers.

API keys need the `annotation_queue.read`, `annotation_queue.write`, `annotation_queue.delete`, and `annotation_queue.review` permissions (split so CI pipelines can be granted `review` without queue-CRUD access). See the [API reference](/api-reference/overview) for full endpoint schemas.

## End-to-end example: dataset curation

A common workflow is using annotation queues to curate high-quality datasets from production traces.

<Steps>
  <Step title="Create a queue">
    Create a queue with a boolean score config (e.g., `dataset_quality`) and set the **default dataset** to your target dataset.
  </Step>

  <Step title="Add production traces">
    Go to **Traces**, filter to interesting traces (errors, low automated scores, specific prompts), select them, and add to the queue.
  </Step>

  <Step title="Review and score">
    Click **Start Review**. For each trace, read the I/O, mark Pass or Fail, and optionally edit the input/output before saving to the dataset.
  </Step>

  <Step title="Save to dataset">
    Expand the **Save to dataset** section, verify the auto-extracted fields, and click **Save**. The default dataset is pre-selected. Saved items are staged on the queue and remain pending until the queue is completed.
  </Step>

  <Step title="Complete the queue">
    Once every item has been reviewed, mark the queue as completed. All staged dataset items are committed to the target dataset in a single batch at this point.
  </Step>

  <Step title="Use in experiments">
    Run experiments against the curated dataset to validate prompt changes against human-verified examples.
  </Step>
</Steps>

## Human annotation vs automated evals

Use both together. They serve different purposes.

|                | Human annotation                       | Automated evals                        |
| -------------- | -------------------------------------- | -------------------------------------- |
| **Created by** | Team members in the Dashboard          | Eval functions during experiments      |
| **Best for**   | Subjective quality, edge cases, nuance | Regression testing, scale, consistency |
| **Scale**      | Tens to hundreds of items              | Entire datasets                        |
| **When**       | Anytime, on any trace                  | During experiment runs                 |

Automated evals catch regressions at scale. Human annotations handle the cases machines can't judge — and provide the ground truth to calibrate your automated scorers against.

<Tip>
  Score the same set of traces with both human reviewers and your LLM-as-judge eval. Compare the results to identify where your automated scorer disagrees with human judgment, then tune your eval prompt accordingly.
</Tip>

## Related

<CardGroup>
  <Card title="Evaluations" icon="gauge" href="/evaluate/writing-evals">
    Automate scoring with eval functions
  </Card>

  <Card title="Datasets" icon="database" href="/evaluate/datasets">
    Create and manage test datasets
  </Card>

  <Card title="Experiments" icon="flask" href="/evaluate/running-experiments">
    Run prompts against datasets to validate quality
  </Card>

  <Card title="Traces" icon="route" href="/observe/traces-and-logs">
    View and explore trace data
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Datasets
Source: https://docs.agentmark.co/evaluate/datasets

Create test datasets for your prompts

Datasets are JSONL files containing test cases to validate prompt behavior. Each line has an `input` (required) and an optional `expected_output`. The same files power both Cloud and Local — in Cloud they sync to the Dashboard through the deployment pipeline, and in Local you run them directly from the CLI.

<Tabs>
  <Tab title="Cloud">
    ## Datasets in the Dashboard

    Datasets live as JSONL files in your repo. The git deployment pipeline syncs them to AgentMark Cloud, where you select them when you create experiments and when you configure review queues.

    <img alt="New Experiment dialog in the AgentMark Dashboard showing the dataset selector" />

    The New Experiment dialog includes a **Dataset** field listing the datasets synced to your app. When you select a prompt, the dataset auto-fills from its `test_settings` frontmatter.

    ### How datasets reach Cloud

    <Steps>
      <Step title="Add a JSONL file to your repo">
        Create the dataset alongside your prompts, for example `agentmark/datasets/sentiment.jsonl`.
      </Step>

      <Step title="Reference it from a prompt">
        Set `test_settings.dataset` in the prompt frontmatter so the dialog can auto-fill it.
      </Step>

      <Step title="Deploy to sync">
        Push to your connected branch. The deployment pipeline syncs the dataset to AgentMark Cloud, where it appears in the dataset selector.
      </Step>
    </Steps>

    The dataset structure is identical to Local — see the **Local** tab for the JSONL schema, what to test, sizing guidance, held-out sets, and statistical significance.

    ### Where dataset rows appear

    * **Experiment detail** — each dataset row's input and `expected_output` are shown next to the actual AI output and evaluator scores. See [Running experiments](/evaluate/running-experiments).
    * **Review queues** — set a default dataset on a queue so the "Save to dataset" action is pre-filled during annotation review.

    ### Appending rows

    Rows are appended to a synced dataset in two ways:

    * **Save to dataset** — during annotation review, save a reviewed trace's input and output to the queue's default dataset. Saved items are staged and committed when the queue is marked completed. See [Human annotation](/evaluate/annotations).
    * **REST API** — POST a row to `/v1/datasets/{datasetName}/rows`. See [Programmatic access](#programmatic-access) in the **Local** tab for the request shape (the same endpoint serves Cloud and Local).
  </Tab>

  <Tab title="Local">
    <img alt="Dataset JSONL file editor" />

    The dataset editor shows each JSONL row on its own line, with syntax highlighting for the `input` and `expected_output` fields. Add rows inline or upload a `.jsonl` file from disk.

    ## Quick start

    **1. Create a dataset file** (`agentmark/datasets/sentiment.jsonl`):

    ```jsonl theme={null}
    {"input": {"text": "I love this!"}, "expected_output": "positive"}
    {"input": {"text": "Terrible product"}, "expected_output": "negative"}
    {"input": {"text": ""}}
    ```

    **2. Link to your prompt** (frontmatter):

    ```mdx theme={null}
    ---
    name: sentiment-classifier
    test_settings:
      dataset: ./datasets/sentiment.jsonl
    ---

    <System>
    Classify the sentiment
    </System>
    <User>{props.text}</User>
    ```

    **3. Run experiments**:

    ```bash theme={null}
    npx agentmark run-experiment agentmark/sentiment.prompt.mdx
    ```

    ## Dataset structure

    Each line must be valid JSON:

    * **`input`** (required) - Props passed to your prompt
    * **`expected_output`** (optional) - Expected result for evaluation

    **With expected output** (enables evaluations):

    ```jsonl theme={null}
    {"input": {"text": "Great!", "category": "electronics"}, "expected_output": "positive"}
    ```

    **Without expected output** (output-only mode):

    ```jsonl theme={null}
    {"input": {"text": "Great!", "category": "electronics"}}
    ```

    ## What to test

    **Common cases**:

    ```jsonl theme={null}
    {"input": {"query": "What is AI?"}, "expected_output": "explanation"}
    {"input": {"query": "Explain ML"}, "expected_output": "explanation"}
    ```

    **Edge cases**:

    ```jsonl theme={null}
    {"input": {"text": ""}, "expected_output": "error"}
    {"input": {"text": "a"}, "expected_output": "too_short"}
    {"input": {"text": "Lorem ipsum... [5000 chars]"}, "expected_output": "truncated"}
    ```

    **Failure modes**:

    ```jsonl theme={null}
    {"input": {"email": "invalid-email"}, "expected_output": "error: invalid email"}
    {"input": {"amount": -100}, "expected_output": "error: amount must be positive"}
    ```

    **Real-world data** - Use anonymized production data when possible.

    <Tip>
      **LLM-assisted generation** - Use LLMs to generate test cases, but have humans verify outputs before using them.
    </Tip>

    ## Expected-output types

    **Strings** (classification):

    ```jsonl theme={null}
    {"input": {"text": "sunny day"}, "expected_output": "positive"}
    ```

    **Objects** (structured data):

    ```jsonl theme={null}
    {"input": {"text": "John, john@example.com"}, "expected_output": {"name": "John", "email": "john@example.com"}}
    ```

    **Flexible** (patterns, not exact matches):

    ```jsonl theme={null}
    {"input": {"topic": "AI"}, "expected_output": "explanation containing: artificial intelligence"}
    ```

    Your evaluation function validates flexible expectations.

    ## Dataset size

    **Start small** (10-20 cases):

    * 5-7 common scenarios
    * 3-5 edge cases
    * 2-3 failure modes

    **Scale based on needs**:

    * **Initial development**: 50-100 cases (recommended by [Confident AI](https://www.confident-ai.com/blog/evaluating-llm-systems-metrics-benchmarks-and-best-practices))
    * **Statistical significance**: \~250 cases (for 95% confidence, 5% margin of error)
    * **Production systems**: 100-300 cases minimum
    * **High-stakes applications**: 300+ cases

    Quality > quantity. Start with 50-100 high-quality cases, then grow based on statistical power analysis and real-world findings.

    ## Best practices

    * One test case per line (valid JSONL)
    * Use descriptive inputs that clearly show what's being validated
    * Version control datasets alongside prompts
    * Avoid duplicates - each case should validate something unique
    * Always anonymize data (never leak sensitive information)

    ## Advanced: held-out test sets

    Create separate datasets to avoid overfitting:

    ```
    datasets/
    ├── development.jsonl       # Use during iteration (60-70%)
    ├── validation.jsonl        # Check progress periodically (15-20%)
    └── held-out.jsonl         # Final test before production (15-20%)
    ```

    **Critical rules**:

    * Never iterate on held-out data
    * Don't peek at held-out results during development
    * If you look at held-out results and make changes, create a new held-out set

    **Example workflow**:

    ```
    Week 1-2: Iterate on development set
      ├─ Test prompt v1 → 75% pass rate
      └─ Test prompt v2 → 82% pass rate

    Week 3: Check validation set
      └─ Test prompt v2 → 79% pass rate (close to dev, good sign!)

    Before deploy: Test held-out set
      └─ Test prompt v3 → 81% pass rate → Deploy if meets requirements
    ```

    ## Advanced: statistical significance

    **Sample size requirements** ([source](https://www.confident-ai.com/blog/evaluating-llm-systems-metrics-benchmarks-and-best-practices)):

    * **Quick iteration**: 10-20 cases (directional feedback only)
    * **Initial development**: 50-100 cases (industry standard)
    * **Statistical rigor**: \~250 cases (95% confidence, 5% margin of error)
    * **Production deployment**: 100-300 cases minimum
    * **High-stakes systems**: 300+ cases

    **Why size matters**: With 10 cases, one failure = 10% change. With 100 cases, one failure = 1% change. Research shows datasets with N ≤ 300 often overestimate performance.

    **Confidence intervals** - Report uncertainty:

    ```
    Pass rate: 85% (85 passed out of 100 tests)
    Standard error: √(0.85 × 0.15 / 100) = 0.036
    95% confidence interval: 85% ± 7% → [78%, 92%]
    ```

    ✅ "Pass rate: 85% \[CI: 77%-91%]"
    ❌ "Pass rate: 85%"

    **Comparing prompts** - Use paired comparisons on same dataset:

    ```typescript theme={null}
    // For each test case, record if new prompt performed better
    const improvements = testCases.map(tc => {
      const oldPassed = evaluateOld(tc);
      const newPassed = evaluateNew(tc);
      return newPassed && !oldPassed ? 1 : (oldPassed && !newPassed ? -1 : 0);
    });

    const netImprovement = improvements.reduce((a, b) => a + b, 0);
    // netImprovement > 10 with 100 cases suggests real improvement
    ```

    **Power analysis** - Determine how many samples you need before creating your dataset.

    Power analysis answers: "How many test cases do I need to reliably detect a meaningful improvement?"

    **Key parameters**:

    * **Effect size**: Minimum improvement you want to detect (e.g., 5% better pass rate)
    * **Significance level (α)**: Probability of false positive (typically 0.05 = 5%)
    * **Statistical power (1-β)**: Probability of detecting real improvement (typically 0.80 = 80%)

    **Formula for binary outcomes** (pass/fail):

    ```typescript theme={null}
    // Simplified formula for comparing two proportions
    n ≈ (Z_α/2 + Z_β)² × 2p(1-p) / (effect_size)²

    // Example: Detect 5% improvement with 80% power, 95% confidence
    // Assuming baseline pass rate p = 0.80
    n ≈ (1.96 + 0.84)² × 2(0.80)(0.20) / (0.05)²
    n ≈ 7.84 × 0.32 / 0.0025
    n ≈ 1,003 test cases
    ```

    **Practical rules of thumb**:

    | Minimum detectable difference | Required sample size (per group) |
    | ----------------------------- | -------------------------------- |
    | 10% (e.g., 80% → 90%)         | \~100 samples                    |
    | 5% (e.g., 80% → 85%)          | \~400 samples                    |
    | 2% (e.g., 80% → 82%)          | \~2,500 samples                  |
    | 1% (e.g., 80% → 81%)          | \~10,000 samples                 |

    **Why this matters**:

    If you only have 50 test cases, you can only reliably detect large improvements (>15%). Smaller improvements will look like noise. Plan your dataset size based on the smallest improvement that matters to your application.

    **Practical approach**:

    ```typescript theme={null}
    // 1. Define minimum improvement you care about
    const minImprovement = 0.05; // 5% better pass rate

    // 2. Calculate required sample size
    const alpha = 0.05;  // 5% false positive rate
    const power = 0.80;  // 80% chance to detect real improvement
    const baselineRate = 0.80; // Current pass rate

    const n = calculateSampleSize(alpha, power, baselineRate, minImprovement);
    console.log(`Need ${n} test cases to detect ${minImprovement * 100}% improvement`);

    // 3. Collect that many test cases before running experiments
    ```

    ## Programmatic access

    You can list datasets and append new rows through the [REST API](/api-reference/overview), or from an IDE agent via the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server. Use either to pull dataset metadata into external tools or automate dataset ingestion.

    ```bash theme={null}
    # List all datasets from the local dev server
    curl "http://localhost:9418/v1/datasets"

    # List datasets from AgentMark Cloud
    curl "https://api.agentmark.co/v1/datasets" \
      -H "Authorization: Bearer $AGENTMARK_API_KEY" \
      -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"
    ```

    Datasets are keyed by file path, not UUID. To append a row, POST to `/v1/datasets/{datasetName}/rows` where `datasetName` is the dataset path without the `.jsonl` extension, URL-encoded (for example, `evals/sentiment-test.jsonl` → `evals%2Fsentiment-test`):

    ```bash theme={null}
    # Append a row via curl
    curl -X POST \
      -H "Authorization: Bearer <API_KEY>" \
      -H "X-Agentmark-App-Id: <APP_ID>" \
      -H "Content-Type: application/json" \
      -d '{"input": {"text": "Great!"}, "expected_output": "positive"}' \
      https://api.agentmark.co/v1/datasets/evals%2Fsentiment-test/rows
    ```

    The local dev server and the AgentMark Cloud gateway both implement the datasets endpoints, so you can develop integrations locally before deploying. Use the [`capabilities`](/api-reference/overview) endpoint to check which endpoints a given server supports.
  </Tab>
</Tabs>

## Next steps

<CardGroup>
  <Card title="Evaluations" icon="check-circle" href="/evaluate/writing-evals">
    Write evaluation functions
  </Card>

  <Card title="Running Experiments" icon="flask" href="/evaluate/running-experiments">
    Test your datasets
  </Card>

  <Card title="Testing overview" icon="clipboard-list" href="/evaluate/overview">
    Learn testing concepts
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# GitLab CI/CD
Source: https://docs.agentmark.co/evaluate/gitlab-ci

Run AgentMark evals on changed prompts and gate merge requests on the results

The [`agentmark-ai/eval-component`](https://gitlab.com/agentmark-ai/eval-component) GitLab CI/CD Catalog component diffs each merge request, runs `@agentmark-ai/cli` against the changed `.prompt.mdx` files, and emits JUnit XML. Failures show up in the MR widget and the pipeline **Tests** tab natively — GitLab parses the JUnit via `artifacts:reports:junit:`, no third-party reporter required.

This is the GitLab counterpart of [`agentmark-ai/eval-action`](https://github.com/agentmark-ai/eval-action). Both wrap the same CLI command (`agentmark run-experiment --format junit`), accept the same `threshold` / `baseline-ref` semantics, and emit the same JUnit XML schema. Switching CI platforms doesn't require relearning the gates.

<Note>
  The `agentmark-ai/eval-component` Catalog project publishes alongside the first GitLab-parity release. If `gitlab.com/agentmark-ai/eval-component/eval@v1` resolves to a 404 for you, the component hasn't been published yet — use the raw-CLI fallback at the bottom of this page in the meantime (it runs the same gate from a hand-rolled `.gitlab-ci.yml`).
</Note>

## Quick start

Paste this into your repo's `.gitlab-ci.yml`:

```yaml theme={null}
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY    # masked, protected CI variable

variables:
  GIT_DEPTH: "0"                     # required so the diff base resolves
```

That's it. On every MR, the component evaluates the `.prompt.mdx` files changed in the diff and surfaces results inline in the MR widget.

<Warning>
  `GIT_DEPTH: "0"` is required. GitLab's default shallow checkout does not contain the diff base, so the component cannot resolve `$CI_MERGE_REQUEST_DIFF_BASE_SHA` to a tree hash. When that happens the regression gate is disabled for the run rather than failing the job.
</Warning>

## Set up the API key

Add `AGENTMARK_API_KEY` as a **masked**, **protected** CI/CD variable in your project's **Settings → CI/CD → Variables**:

<Steps>
  <Step title="Get the key from AgentMark Cloud">
    In the [AgentMark Dashboard](https://app.agentmark.co), open **Settings → API Keys** and create a key scoped to the app whose prompts you're gating.
  </Step>

  <Step title="Store it as a masked variable">
    In GitLab, **Settings → CI/CD → Variables → Add variable**:

    * Key: `AGENTMARK_API_KEY`
    * Value: the key from step 1
    * Type: **Variable** (not File)
    * Flags: **Masked**, **Protected**
  </Step>

  <Step title="Reference it in inputs">
    The component reads it via `inputs.api-key: $AGENTMARK_API_KEY`. Don't hard-code the key in `.gitlab-ci.yml`.
  </Step>
</Steps>

The key is required for cloud-backed runs (regression-gate baselines, dataset sync). For fully local evals — no Cloud features — you can omit the input and run without a key.

## Inputs

| Input               | Required | Default                     | Description                                                                                                                                                                           |
| ------------------- | -------- | --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `api-key`           | optional | —                           | AgentMark API key. Required for cloud-backed runs; omit for fully local evals.                                                                                                        |
| `prompts`           | optional | changed `.prompt.mdx` files | Newline- or space-separated list of prompt files to evaluate.                                                                                                                         |
| `threshold`         | optional | —                           | Pass-rate threshold (0–100). Fails the job if overall pass rate is below this number.                                                                                                 |
| `baseline-ref`      | optional | MR diff base                | Git ref to compare scores against for the regression gate. Resolved to a tree hash and passed to the CLI as `--baseline-commit`. Requires `GIT_DEPTH=0`. Set empty (`''`) to disable. |
| `working-directory` | optional | `.`                         | Directory to run from.                                                                                                                                                                |
| `results-glob`      | optional | `agentmark-results-*.xml`   | Pattern for per-prompt JUnit XML output files. Must contain exactly one `*` wildcard — the prefix and suffix around it become the per-prompt filename template.                       |
| `cli-version`       | optional | `latest`                    | npm version specifier for `@agentmark-ai/cli`. Pin for reproducible CI.                                                                                                               |
| `image`             | optional | `node:20-bookworm-slim`     | Docker image. Must include npm, git, bash.                                                                                                                                            |

## What gets gated

Up to four independent gate predicates fire on every run; any failing fails the job.

1. **Per-row gate** — every `(row × scorer)` pair is a `<testcase>` in the JUnit XML. If the scorer's `passed` flag is `false`, the component emits `<failure>` and GitLab reports it inline in the MR widget.
2. **Threshold gate** (optional) — when `threshold:` is set, the job fails if the overall pass rate is below the threshold.
3. **Regression gate** (optional) — when `baseline-ref:` resolves to a prior run and the prompt sets `test_settings.regression_tolerance`, a row fails if a scorer's score dropped more than the tolerance below its baseline. Catches silent quality drops even when the scorer still "passes" in absolute terms.
4. **Per-scorer threshold gate** (optional) — when the prompt sets `test_settings.score_thresholds` (a `{ scorer: minMeanScore }` map), the run fails if a scorer's mean score across the run falls below the configured minimum.

The contract is identical to the GitHub Action because both wrap the same CLI. The full mechanics — how the baseline is resolved by tree hash, how rows are matched by input content, how missing baselines stay inert — are documented in [Regression gates](/deploy/regression-gates).

## When the job runs

The component's default `rules:` runs on:

* every **merge request** pipeline (the primary gate), and
* pushes to the **default branch** (so a fresh baseline is recorded after merge).

Override in your `.gitlab-ci.yml` to change the cadence:

```yaml theme={null}
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

# Run only on MRs — skip the default-branch baseline write.
agentmark_eval:
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
```

The default-branch run is what populates the baseline that subsequent MRs gate against. Skip it only if you're recording baselines through a separate process (a scheduled job, a manual trigger, or the SDK).

## With a regression-tolerance threshold

Set the per-case tolerance and run-level floors in the prompt's frontmatter — the component reads them automatically from `test_settings`. The component itself doesn't need any extra inputs.

```yaml theme={null}
# agentmark/qa.prompt.mdx (frontmatter)
test_settings:
  dataset: ./data/qa.jsonl
  regression_tolerance: 0.05            # fail a case if a scorer drops >5% below baseline
  score_thresholds:
    groundedness: 0.9                   # fail the run if mean groundedness < 0.9
```

```yaml theme={null}
# .gitlab-ci.yml
include:
  - component: gitlab.com/agentmark-ai/eval-component/eval@v1
    inputs:
      api-key: $AGENTMARK_API_KEY

variables:
  GIT_DEPTH: "0"
```

`baseline-ref` defaults to `$CI_MERGE_REQUEST_DIFF_BASE_SHA`, so MR pipelines pick up the right comparison automatically. The first run on the default branch records the baseline; from then on every MR gates against the run captured at its base commit's tree hash.

See [Regression gates](/deploy/regression-gates) for the full gate semantics — both the per-case tolerance check and the run-level `score_thresholds` apply identically here.

## Coexists with your existing tests

The component emits JUnit XML — the same format `pytest`, `jest`, and `vitest` already emit. Failures appear in the MR widget alongside any other failing test, and in the **Tests** tab of the pipeline view. No new dashboard to learn, no additional reporter to install.

## Raw-CLI fallback

If you can't yet consume the Catalog component — it hasn't been published yet (see the note at the top of this page), or you're pinning the CLI to a specific version and want the YAML in your repo — drop down to the raw CLI:

```yaml theme={null}
agentmark_eval:
  image: node:20-bookworm-slim
  variables:
    GIT_DEPTH: "0"
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  script:
    - npm install -g @agentmark-ai/cli@latest
    - npx agentmark run-experiment agentmark/qa.prompt.mdx --format junit > results.xml
  artifacts:
    when: always
    reports:
      junit: results.xml
```

This loses the automatic prompt-diff scoping (you list each prompt manually) and the baseline-ref resolution helper, but the JUnit output and the gate semantics are identical.

## See also

* [`agentmark-ai/eval-component` README](https://gitlab.com/agentmark-ai/eval-component) — the source, examples, and changelog.
* [`agentmark-ai/eval-action`](https://github.com/agentmark-ai/eval-action) — GitHub Actions sibling that wraps the same CLI.
* [Regression gates](/deploy/regression-gates) — full mechanics of the per-case and run-level gates the component applies.
* [Running experiments](/evaluate/running-experiments) — CLI reference and JUnit output details.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Evaluate
Source: https://docs.agentmark.co/evaluate/overview

Test and improve your prompts with datasets, evaluators, experiments, and annotations

AgentMark gives you two ways to test prompts, and they share the same building blocks. In **Cloud**, you run and review experiments in the [AgentMark Dashboard](https://app.agentmark.co) — datasets and score configs are synced from your repo through the git deployment pipeline, and your deployed handler runs the evals. In **Local**, you keep datasets as JSONL files alongside your prompts, write eval functions in code, and run experiments from the CLI.

The Dashboard experiment views are the same shared UI components the local dev server renders, so Cloud and self-hosted Local look the same. The difference is the data source — git-synced versus local files — and who runs the eval handler.

<img alt="Experiment detail view in the AgentMark Dashboard showing per-row scores and aggregate metrics" />

The experiment detail view shows each dataset row's input, the AI output, expected output, and evaluator scores, alongside aggregate metrics for the run (average score, average latency, total cost, total tokens).

## Why test prompts?

LLM outputs are non-deterministic — the same prompt can produce different results. Testing helps you:

* **Catch regressions** — Know when prompt changes break existing functionality
* **Validate quality** — Ensure outputs meet standards across diverse scenarios
* **Measure improvements** — Quantify whether prompt iterations actually perform better
* **Build confidence** — Deploy changes backed by data, not guesswork

## Testing workflow

<Tabs>
  <Tab title="Cloud">
    <Steps>
      <Step title="Define a dataset in your repo">
        Add a JSONL file to your `agentmark/` directory. Each line is one test case.
      </Step>

      <Step title="Declare score configs and write evals">
        Add score configs to `agentmark.json` under `scores`, and write the eval functions on your handler.
      </Step>

      <Step title="Deploy to sync">
        Push to your connected branch. The deployment pipeline syncs your datasets and score configs to AgentMark Cloud.
      </Step>

      <Step title="Run an experiment">
        Open **Experiments** in the Dashboard, click **New Experiment**, choose the prompt, dataset, and evaluations, and run. Results stream in live, then open in the experiment detail view.
      </Step>
    </Steps>
  </Tab>

  <Tab title="Local">
    <Steps>
      <Step title="Create a dataset">
        Define test inputs in a JSONL file. Each line is one test case.
      </Step>

      <Step title="Write evaluations">
        Create eval functions that score outputs. Register them in your client.
      </Step>

      <Step title="Connect to prompts">
        Add `test_settings.dataset` and `test_settings.evals` to your prompt frontmatter.
      </Step>

      <Step title="Run experiments">
        Execute `npx agentmark run-experiment` to test your prompt against the dataset.
      </Step>
    </Steps>

    <Note>
      **Prerequisites:** You must have `npx agentmark dev` running in a separate terminal before running experiments.
    </Note>
  </Tab>
</Tabs>

## Core concepts

### Datasets

Collections of test inputs (and optionally expected outputs) that define the scenarios your prompt should handle — common cases, edge cases, and failure modes.

```jsonl theme={null}
{"input": {"text": "Great product!"}, "expected_output": "positive"}
{"input": {"text": "Terrible experience"}, "expected_output": "negative"}
{"input": {"text": ""}, "expected_output": "neutral"}
```

<Tabs>
  <Tab title="Cloud">
    Datasets live as JSONL files in your repo and sync to AgentMark Cloud through the deployment pipeline. In the Dashboard you pick a synced dataset when you create an experiment or configure a review queue. Rows are appended through the "Save to dataset" flow during annotation review and through the REST API.
  </Tab>

  <Tab title="Local">
    Store JSONL files alongside your prompts in the `agentmark/` directory and run them with the CLI.
  </Tab>
</Tabs>

[Learn more about datasets →](/evaluate/datasets)

### Evaluations

Functions that score prompt outputs and determine pass/fail status. Define your success criteria — what makes an output correct, high-quality, or acceptable.

```typescript theme={null}
export const accuracy = async ({ output, expectedOutput }) => {
  const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
  return { passed: match, score: match ? 1 : 0 };
};
```

<Tabs>
  <Tab title="Cloud">
    Score configs are declared in `agentmark.json` under `scores` and synced to AgentMark Cloud through the deployment pipeline. Eval functions run during experiments on your deployed handler. In the New Experiment dialog you select which registered evals to run, and results appear as per-row scores and aggregates in the experiment detail view.
  </Tab>

  <Tab title="Local">
    Write eval functions in code, register them in your client, and run them through the CLI with `run-experiment`.
  </Tab>
</Tabs>

[Learn more about evaluations →](/evaluate/writing-evals)

### Experiments

Run a prompt against a dataset with evaluations. Use them to validate prompt changes, compare model configurations, and enforce quality thresholds.

<Tabs>
  <Tab title="Cloud">
    Run experiments from the **Experiments** page in the Dashboard. Review results with per-row score drill-down, aggregate metrics, and charts, and compare runs side by side.
  </Tab>

  <Tab title="Local">
    Run from the CLI and view results as tables in your terminal:

    ```bash theme={null}
    npx agentmark run-experiment agentmark/<your-prompt>.prompt.mdx
    ```
  </Tab>
</Tabs>

[Learn more about running experiments →](/evaluate/running-experiments)

### Annotations

<Info>**Cloud feature.** Annotations are available in the [AgentMark Dashboard](https://app.agentmark.co).</Info>

Manually label and score traces for human-in-the-loop evaluation. Add scores, labels, and detailed reasoning to any span. Complement automated evals with human judgment.

[Learn more about annotations →](/evaluate/annotations)

## Testing strategies

* **Start small** (5-10 cases), then grow with real data
* **Test multiple dimensions** — accuracy, completeness, tone, format
* **Version control everything** — datasets live alongside prompts in your repo
* **Run in CI/CD** — gate deployments on pass-rate thresholds

## Programmatic access

Query datasets, experiments, runs, and prompt execution logs through the [REST API](/api-reference/overview), or from an IDE agent via the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server. Use either to build custom reporting, export evaluation results to external tools, or integrate experiment data into CI/CD pipelines.

```bash theme={null}
# List datasets and experiments from the local dev server
curl "http://localhost:9418/v1/datasets"
curl "http://localhost:9418/v1/experiments?limit=10"

# Get detailed results for a specific experiment
curl "http://localhost:9418/v1/experiments/<experimentId>"

# List traces produced by a specific experiment run
# (filter /v1/traces by dataset_run_id — the former /v1/runs/{runId}/traces
# endpoint is deprecated on Local and returns 501 on Cloud)
curl "http://localhost:9418/v1/traces?dataset_run_id=<runId>"

# Against Cloud, set the auth + app headers:
curl "https://api.agentmark.co/v1/experiments?limit=10" \
  -H "Authorization: Bearer $AGENTMARK_API_KEY" \
  -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"
```

The local dev server and the AgentMark Cloud gateway share the same `/v1/*` wire contract. A small number of routes are environment-specific — see [API reference → Available endpoints](/api-reference/overview#available-endpoints) for the `Where` column.

## Next steps

<CardGroup>
  <Card title="Datasets" icon="database" href="/evaluate/datasets">
    Create test datasets
  </Card>

  <Card title="Writing Evals" icon="check-circle" href="/evaluate/writing-evals">
    Write evaluation functions
  </Card>

  <Card title="Running Experiments" icon="flask" href="/evaluate/running-experiments">
    Execute tests with the CLI or Dashboard
  </Card>

  <Card title="Annotations" icon="pen" href="/evaluate/annotations">
    Human-in-the-loop scoring
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Running experiments
Source: https://docs.agentmark.co/evaluate/running-experiments

Test prompts with datasets in the Dashboard, CLI, or SDK

Run prompts against datasets with automatic evaluation to validate quality and consistency. In Cloud, you create and review experiments in the [AgentMark Dashboard](https://app.agentmark.co) against git-synced datasets. In Local, you run experiments from the CLI against JSONL files on disk.

<Tabs>
  <Tab title="Cloud">
    ## Experiments in the Dashboard

    The Dashboard runs experiments against datasets and score configs that are synced from your repo through the [deployment pipeline](/deploy/deployment), using the evals registered on your deployed handler. Open **Experiments** (flask icon) in the sidebar to get started.

    <Note>
      Running an experiment from the Dashboard requires the app to be connected to a deployed handler. If it isn't, the dialog returns an "app not connected" error. See [Deployment](/deploy/deployment) for connecting a handler.
    </Note>

    ### Browse the experiments list

    The Experiments page is a paginated list of every run in your app. Filter it by **prompt name** and **dataset path** to find a specific run.

    <img alt="Experiments list in the AgentMark Dashboard with prompt and dataset filters and the New Experiment button" />

    The Experiments list shows each run as a row, with filters for prompt name and dataset path and a **New Experiment** button in the top-right. Comparison charts — average latency, total cost, and average score across the runs — sit above the list. Select 2 to 3 runs to enable **Compare**.

    Running an experiment requires the `experiment.run` permission.

    ### Create and run an experiment

    Click **New Experiment** to open the dialog.

    <img alt="New Experiment dialog with name, prompt, dataset, and evaluations fields" />

    The New Experiment dialog has four fields: **Name**, **Prompt**, **Dataset**, and **Evaluations** (a multi-select populated from the evals your deployed handler registers). Selecting a prompt auto-fills the dataset and evaluations from its `test_settings` frontmatter.

    The **Name** must start with a letter and may contain letters, numbers, hyphens, and underscores, up to 100 characters.

    <Steps>
      <Step title="Name the experiment">
        Enter a **Name** that starts with a letter.
      </Step>

      <Step title="Choose a prompt">
        Pick the **Prompt** to test. The **Dataset** and **Evaluations** auto-fill from its `test_settings`.
      </Step>

      <Step title="Confirm dataset and evaluations">
        Adjust **Dataset** and **Evaluations** if you want to run against a different dataset or eval set.
      </Step>

      <Step title="Run">
        Click **Run Experiment**. Results stream in live, then open in the experiment detail view.
      </Step>
    </Steps>

    As the run executes, results stream in row by row, and a summary reports the item count and total tokens when it finishes. Open the experiment to review the full results.

    ### Read the experiment detail

    Click any experiment to open its detail view.

    <img alt="Experiment detail: per-row input, expected, and actual output with evaluator scores, plus aggregate metrics and charts" />

    The experiment detail view lists each dataset row in a table — **Item**, **Input**, **Output**, **Expected Output**, **Model**, latency, cost, tokens, **Scores**, and a **Trace** link. Above the table, aggregate metrics summarize the run (items, average score, total cost, average latency, total tokens) alongside charts.

    Use **Send to Review Queue** on the detail page to send the experiment's items to an annotation queue for human review. See [Human annotation](/evaluate/annotations).

    ### Compare runs

    Select 2 to 3 experiments in the list, then click **Compare** to view them side by side.

    <img alt="Two experiments compared side by side in the AgentMark Dashboard" />

    The comparison view places runs side by side (2 to 3) and tags each item as **Improved**, **Regressed**, or **Unchanged**, so you can see exactly which cases a prompt change fixed or broke.
  </Tab>

  <Tab title="Local">
    <img alt="Running experiments with the AgentMark CLI" />

    The animation shows `npx agentmark run-experiment` executing against a dataset: each row is processed, the AI output is scored, and a results table prints to stdout with pass/fail status per evaluator.

    ## CLI usage

    ### Quick start

    ```bash theme={null}
    npx agentmark run-experiment agentmark/classifier.prompt.mdx
    ```

    **Requirements**:

    * Dataset configured in prompt frontmatter
    * Development server running (`npx agentmark dev`)
    * Optional: Evaluation functions defined

    <Note>
      Keep `npx agentmark dev` running in a separate terminal. The `run-experiment` command talks to it on port 9417.
    </Note>

    ### Full command signature

    ```bash theme={null}
    npx agentmark run-experiment <filepath> [options]

    Options:
      --server <url>        Webhook server URL (default: http://localhost:9417)
      --skip-eval           Skip running evals even if they exist
      --format <format>     Output format: table, csv, json, jsonl, or junit (default: table)
      --threshold <percent> Fail if pass percentage is below threshold (0-100)
      --truncate <chars>    Truncate long cells in table output (default 1000; 0 = unlimited)

    Dataset sampling (pick at most one):
      --sample <percent>    Run on a random N% of rows (1-100)
      --rows <spec>         Select specific rows by index or range (e.g., 0,3-5,9)
      --split <spec>        Train/test split (e.g., train:80 or test:80)
      --seed <number>       Seed for reproducible sampling/splitting
    ```

    The `--server` flag defaults to the `AGENTMARK_WEBHOOK_URL` environment variable if set, otherwise `http://localhost:9417`.

    ### Command options

    **Skip evaluations** (output-only mode):

    ```bash theme={null}
    npx agentmark run-experiment agentmark/test.prompt.mdx --skip-eval
    ```

    **Output format**:

    ```bash theme={null}
    npx agentmark run-experiment agentmark/test.prompt.mdx --format table   # Default
    npx agentmark run-experiment agentmark/test.prompt.mdx --format csv     # Spreadsheets
    npx agentmark run-experiment agentmark/test.prompt.mdx --format json    # Structured
    npx agentmark run-experiment agentmark/test.prompt.mdx --format jsonl   # Line-delimited
    npx agentmark run-experiment agentmark/test.prompt.mdx --format junit   # JUnit XML for CI gating
    ```

    **Pass rate threshold** (CI/CD):

    ```bash theme={null}
    npx agentmark run-experiment agentmark/test.prompt.mdx --threshold 85
    ```

    Exits with non-zero code if pass rate falls below the threshold. Requires evaluations that return a `passed` field.

    <Note>
      `--threshold` is an absolute pass-rate gate on a single run. To gate CI on per-case regressions against a baseline — failing a PR when a case scores worse than it did before — see [Regression gates](/deploy/regression-gates).
    </Note>

    **JUnit XML for CI gating**:

    `--format junit` emits a [JUnit XML](https://github.com/testmoapp/junitxml) document that every major CI system already parses natively — GitHub Actions (via marketplace parsers), GitLab CI (via `artifacts.reports.junit`), Jenkins, CircleCI, and others. Each `(row × scorer)` pair becomes one `<testcase>`; failing scorers emit `<failure>` with input/actual/expected payload in CDATA.

    ```bash theme={null}
    npx agentmark run-experiment agentmark/test.prompt.mdx --format junit > results.xml
    ```

    The XML can be combined with `--threshold` for a suite-level gate on top of the per-row failures already surfaced in the report.

    **GitHub Actions** — use the [`agentmark-ai/eval-action`](https://github.com/agentmark-ai/eval-action) composite, which diffs the PR, runs `--format junit` per changed prompt, and pipes results to `mikepenz/action-junit-report`:

    ```yaml theme={null}
    - uses: agentmark-ai/eval-action@v1
      with:
        api-key: ${{ secrets.AGENTMARK_API_KEY }}
    ```

    **GitLab CI** — use the [`agentmark-ai/eval-component`](/evaluate/gitlab-ci) Catalog component, which diffs the MR, runs `--format junit` per changed prompt, and surfaces results in the MR widget via `artifacts:reports:junit:`:

    ```yaml theme={null}
    include:
      - component: gitlab.com/agentmark-ai/eval-component/eval@v1
        inputs:
          api-key: $AGENTMARK_API_KEY

    variables:
      GIT_DEPTH: "0"
    ```

    Other CI systems (Jenkins, CircleCI, Buildkite) consume the same XML via their native JUnit-report plugins.

    **Dataset sampling** (see [Dataset sampling](#dataset-sampling) below):

    ```bash theme={null}
    npx agentmark run-experiment agentmark/test.prompt.mdx --sample 20
    npx agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9
    npx agentmark run-experiment agentmark/test.prompt.mdx --split train:80
    ```

    **Custom server**:

    ```bash theme={null}
    npx agentmark run-experiment agentmark/test.prompt.mdx --server http://staging:9417
    ```

    ### Dataset sampling

    Run experiments on a subset of your dataset without modifying the dataset file. The three sampling modes are mutually exclusive — use only one per run.

    **Random sample** (`--sample <percent>`):

    Run on a random N% of rows. Useful for quick smoke tests against large datasets.

    ```bash theme={null}
    # Run on ~20% of rows (random, non-reproducible)
    npx agentmark run-experiment agentmark/test.prompt.mdx --sample 20

    # Reproducible: same 20% every time
    npx agentmark run-experiment agentmark/test.prompt.mdx --sample 20 --seed 42
    ```

    **Specific rows** (`--rows <spec>`):

    Select individual rows by zero-based index. Supports comma-separated indices and ranges.

    ```bash theme={null}
    # Row 0 only
    npx agentmark run-experiment agentmark/test.prompt.mdx --rows 0

    # Rows 0, 3, 4, 5, and 9
    npx agentmark run-experiment agentmark/test.prompt.mdx --rows 0,3-5,9
    ```

    **Train/test split** (`--split <spec>`):

    Split the dataset into train and test portions. Run only the train portion or only the test portion.

    ```bash theme={null}
    # Run on the first 80% (train portion), positional split
    npx agentmark run-experiment agentmark/test.prompt.mdx --split train:80

    # Run on the remaining 20% (test portion), positional split
    npx agentmark run-experiment agentmark/test.prompt.mdx --split test:80

    # Seeded split — random assignment, reproducible across runs
    npx agentmark run-experiment agentmark/test.prompt.mdx --split train:80 --seed 42
    npx agentmark run-experiment agentmark/test.prompt.mdx --split test:80 --seed 42
    ```

    <Note>
      Without `--seed`, `--split` uses positional assignment: the first N% of rows are "train" and the rest are "test". With `--seed`, each row is assigned to train or test by a deterministic hash — the order in the file does not matter.
    </Note>

    **Reproducibility with `--seed`**:

    The `--seed` flag guarantees the same rows are selected every time, across TypeScript and Python. Pass the same seed to get identical results on any machine or language runtime.

    ```bash theme={null}
    # These two runs always process the exact same rows
    npx agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
    npx agentmark run-experiment agentmark/test.prompt.mdx --sample 30 --seed 99
    ```

    <Tip>
      Use `--seed` in CI/CD pipelines to prevent flaky results from random row selection.
    </Tip>

    ### Output example

    | # | Input                  | AI Result | Expected Output | sentiment\_check |
    | - | ---------------------- | --------- | --------------- | ---------------- |
    | 1 | `{"text":"I love it"}` | positive  | positive        | PASS (1.00)      |
    | 2 | `{"text":"Terrible"}`  | negative  | negative        | PASS (1.00)      |
    | 3 | `{"text":"It's okay"}` | neutral   | neutral         | PASS (1.00)      |

    On a normal run the table (or `--format` output) is the only result. Pass `--threshold <0-100>` to also print a pass-rate summary and gate the exit code — the pass rate is counted over **evaluations** (row × evaluator pairs), not rows:

    ```
    ✅ Experiment passed threshold check
       Pass rate: 100% (3/3 evaluations passed)
       Threshold: 85%
    ```

    If the pass rate falls below the threshold, the CLI prints `❌ Experiment failed threshold check` and exits non-zero — wire that into CI for regression gating.

    The CLI supports both `.mdx` source files and pre-built `.json` files (from `npx agentmark build`). Media outputs (images, audio) are saved to `.agentmark-outputs/` with clickable file paths.

    ## How it works

    The `run-experiment` command:

    1. Loads your prompt file (`.mdx` or pre-built `.json`) and parses the frontmatter
    2. Reads the dataset specified in `test_settings.dataset`
    3. Sends the prompt and dataset to the dev server (default: `http://localhost:9417`)
    4. The server runs the prompt against each dataset row
    5. Evaluates results using the evals specified in `test_settings.evals`
    6. Streams results back to the CLI as they complete
    7. Displays formatted output (table, CSV, JSON, or JSONL)

    ## Configuration

    Link dataset and evals in prompt frontmatter:

    ```mdx theme={null}
    ---
    name: sentiment-classifier
    test_settings:
      dataset: ./datasets/sentiment.jsonl
      evals:
        - sentiment_check
    ---

    <System>Classify the sentiment</System>
    <User>{props.text}</User>
    ```

    You can also provide default props via `test_settings.props`:

    ```yaml theme={null}
    test_settings:
      props:
        language: en
        verbose: false
      dataset: ./datasets/sentiment.jsonl
      evals:
        - sentiment_check
    ```

    Props from each dataset row override the defaults.

    **Dataset (sentiment.jsonl)**:

    ```jsonl theme={null}
    {"input": {"text": "I love this!"}, "expected_output": "positive"}
    {"input": {"text": "Terrible product"}, "expected_output": "negative"}
    {"input": {"text": "It's okay"}, "expected_output": "neutral"}
    ```

    [Learn more about datasets →](/evaluate/datasets)

    [Learn more about evals →](/evaluate/writing-evals)

    ## Workflow

    **1. Develop prompts** - Iterate on your prompt design

    **2. Create datasets** - Add test cases covering your scenarios

    **3. Write evaluations** - Define success criteria

    **4. Run experiments** - Test against dataset

    ```bash theme={null}
    npx agentmark run-experiment agentmark/my-prompt.prompt.mdx
    ```

    **5. Review results** - Identify failures and patterns

    **6. Iterate** - Fix issues, improve prompts, add test cases

    **7. Deploy with confidence** - Pass rate meets your threshold

    ## SDK usage

    Run experiments programmatically using `formatWithDataset()`:

    ```typescript theme={null}
    import { client } from './agentmark-client';
    import { generateText } from 'ai';  // Or your adapter's generation function

    const prompt = await client.loadTextPrompt('agentmark/classifier.prompt.mdx');

    // Returns a stream of formatted inputs from the dataset
    const datasetStream = await prompt.formatWithDataset();

    // Process each test case
    for await (const item of datasetStream) {
      const { dataset, formatted, evals } = item;

      // Run the prompt with your AI SDK
      const result = await generateText(formatted);

      // Check results
      const passed = result.text === dataset.expected_output;
      console.log(`Input: ${JSON.stringify(dataset.input)}`);
      console.log(`Expected: ${dataset.expected_output}`);
      console.log(`Got: ${result.text}`);
      console.log(`Result: ${passed ? 'PASS' : 'FAIL'}\n`);
    }
    ```

    The stream returns objects with:

    * `dataset` - The test case (`input` and `expected_output`)
    * `formatted` - The formatted prompt ready for your AI SDK
    * `evals` - List of evaluation names to run
    * `type` - Always `"dataset"`

    **Options** (`FormatWithDatasetOptions`):

    * `datasetPath?: string` - Override dataset from frontmatter
    * `format?: 'ndjson' | 'json'` - Buffer all rows (`'json'`) or stream as available (`'ndjson'`, default)

    **When to use**:

    * Custom test logic in your test framework
    * Fine-grained control over test execution
    * Integrating with existing test infrastructure
    * Running experiments in application code

    ## Troubleshooting

    ### CLI issues

    **Dataset not found**:

    * Check dataset path in frontmatter
    * Verify file exists and is valid JSONL

    **Server connection error**:

    * Ensure `npx agentmark dev` is running
    * Check ports are available (default webhook port: 9417)
    * Verify `--server` URL if using a custom server

    **Invalid dataset format**:

    * Each line must be valid JSON
    * Required: `input` field
    * Optional: `expected_output` field

    **No evaluations ran**:

    * Add `evals` to `test_settings` in frontmatter
    * Or use `--skip-eval` flag for output-only mode

    **Threshold check failed**:

    * The `--threshold` flag requires evals that return a `passed` field
    * Verify your eval functions return `{ passed: true/false, ... }`

    **Sampling options conflict**:

    * Only one of `--sample`, `--rows`, or `--split` may be used at a time
    * `--seed` can be combined with any of them

    ## Programmatic access

    You can query experiment results, run traces, and prompt file listings through the [REST API](/api-reference/overview), or from an IDE agent via the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server. Use either to build custom reporting, export results to external tools, or integrate experiment data into CI/CD pipelines.

    ```bash theme={null}
    # List experiments
    curl "http://localhost:9418/v1/experiments?limit=10"

    # Get a specific experiment, including its runs and evaluation results
    curl "http://localhost:9418/v1/experiments/<experimentId>"

    # List traces for a specific experiment run — filter `/v1/traces` by `dataset_run_id`
    # (the former `/v1/runs/{runId}/traces` endpoint is deprecated; both paths hit the
    # same predicate, but the filter approach works on Cloud + Local without a second
    # endpoint).
    curl "http://localhost:9418/v1/traces?dataset_run_id=<runId>"

    # List prompt files registered with the local dev server
    curl "http://localhost:9418/v1/prompts?limit=10"
    ```

    ```bash theme={null}
    # Same call against AgentMark Cloud — set auth + app headers
    curl "https://api.agentmark.co/v1/experiments?limit=10" \
      -H "Authorization: Bearer $AGENTMARK_API_KEY" \
      -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"
    ```

    <Note>
      `experiments` ships on Cloud + Local. `prompts` is Local-only today — Cloud returns `501 not_available_on_cloud`. The legacy `/v1/runs/{runId}/traces` endpoint is deprecated but still works on Local for backwards compatibility; use `/v1/traces?dataset_run_id=…` in new code. Call [`GET /v1/capabilities`](/api-reference/overview) to check which features a server supports at runtime.
    </Note>
  </Tab>
</Tabs>

## Next steps

<CardGroup>
  <Card title="Datasets" icon="database" href="/evaluate/datasets">
    Create test datasets
  </Card>

  <Card title="Evaluations" icon="check-circle" href="/evaluate/writing-evals">
    Write evaluation functions
  </Card>

  <Card title="Testing overview" icon="clipboard-list" href="/evaluate/overview">
    Learn testing concepts
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Evaluations
Source: https://docs.agentmark.co/evaluate/writing-evals

Write evaluation functions to score prompt outputs

Evaluations (evals) are functions that score prompt outputs and determine pass/fail status. You declare what each eval scores on in `agentmark.json`, write the function in code, and connect the two by name. In Cloud, score configs sync to the Dashboard and you pick evals to run from the New Experiment dialog. In Local, you register eval functions in your client and run them through the CLI.

<Tip>
  **Start with evals first** - Build your evaluation framework before writing prompts. Evals provide the foundation for measuring effectiveness and iterating.
</Tip>

<Tabs>
  <Tab title="Cloud">
    ## Evals in the Dashboard

    Cloud evals come from two pieces that you maintain in your repo:

    * **Score configs** declared in `agentmark.json` under `scores`. These define what each eval scores on (boolean, numeric, or categorical) and sync to AgentMark Cloud through the [deployment pipeline](/deploy/deployment).
    * **Eval functions** on your deployed handler. These run during experiments and produce the scores.

    ### Declare score configs

    Add a `scores` block to `agentmark.json`. Each key is an eval name that your eval functions return scores for.

    ```json agentmark.json theme={null}
    {
      "scores": {
        "accuracy": {
          "type": "boolean",
          "description": "Was the response factually correct?"
        },
        "tone": {
          "type": "categorical",
          "description": "Response tone classification",
          "categories": [
            { "label": "professional", "value": 1 },
            { "label": "casual", "value": 0.5 },
            { "label": "inappropriate", "value": 0 }
          ]
        }
      }
    }
    ```

    Push to your connected branch so the deployment pipeline syncs the score configs to AgentMark Cloud. Once synced, they stay available in the Dashboard across deployments. See [Project configuration](/configure/project-config) for the full `scores` schema, and the **Local** tab for writing the eval functions themselves.

    ### Pick evals when you run an experiment

    In the New Experiment dialog, the **Evaluations** field is a multi-select populated from the evals your deployed handler registers. Selecting a prompt auto-fills the evaluations from its `test_settings` frontmatter.

    <img alt="New Experiment dialog in the AgentMark Dashboard showing the evals selector" />

    The New Experiment dialog includes an **Evaluations** multi-select. Options come from the evals your deployed handler registers.

    ### See results as scores

    Eval results appear as per-row scores and run-level aggregates in the experiment detail view.

    <img alt="Experiment detail view showing per-row evaluator scores and aggregate metrics" />

    The experiment detail view lists each dataset row with its evaluator scores, plus aggregate metrics for the run such as average score. See [Running experiments](/evaluate/running-experiments) for the full detail view.

    <Note>
      The same score configs power human annotation. Reviewers score traces against the configs you declare in `agentmark.json`. See [Human annotation](/evaluate/annotations).
    </Note>
  </Tab>

  <Tab title="Local">
    ## Quick start

    **1. Define score schemas** in `agentmark.json`:

    ```json agentmark.json theme={null}
    {
      "scores": {
        "accuracy": {
          "type": "boolean",
          "description": "Was the response factually correct?"
        }
      }
    }
    ```

    **2. Create an eval function** that returns `{passed, score, reason}`:

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        export const accuracy = async ({ output, expectedOutput, input }) => {
          const match = output.trim() === expectedOutput.trim();
          return {
            passed: match,
            score: match ? 1.0 : 0.0,
            reason: match ? undefined : `Expected "${expectedOutput}", got "${output}"`
          };
        };
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        from agentmark.prompt_core import EvalParams, EvalResult

        def accuracy(params: EvalParams) -> EvalResult:
            output = str(params["output"]).strip()
            expected = str(params["expectedOutput"]).strip()
            match = output == expected

            return {
                "passed": match,
                "score": 1.0 if match else 0.0,
                "reason": None if match else f'Expected "{expected}", got "{output}"'
            }
        ```
      </Tab>
    </Tabs>

    **3. Reference in your prompt's frontmatter:**

    ```mdx theme={null}
    ---
    name: sentiment-classifier
    test_settings:
      dataset: ./datasets/sentiment.jsonl
      evals:
        - accuracy
    ---
    ```

    ## Writing eval functions

    Eval functions are plain functions that score prompt outputs. Score schemas are defined separately in `agentmark.json` — the two are connected by name.

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        const evals = {
          accuracy: async ({ output, expectedOutput }) => {
            const match = output.trim() === expectedOutput?.trim();
            return { passed: match, score: match ? 1 : 0 };
          },
          tone: async ({ output }) => {
            const isProfessional = !output.includes("lol") && !output.includes("!!!");
            return {
              passed: isProfessional,
              label: isProfessional ? "professional" : "casual",
            };
          },
        };
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        from agentmark.prompt_core import EvalParams, EvalResult

        evals = {
            "accuracy": lambda params: {
                "passed": str(params["output"]).strip() == str(params.get("expectedOutput", "")).strip(),
                "score": 1.0 if str(params["output"]).strip() == str(params.get("expectedOutput", "")).strip() else 0.0,
            },
            "tone": lambda params: {
                "passed": "lol" not in str(params["output"]) and "!!!" not in str(params["output"]),
                "label": "professional" if ("lol" not in str(params["output"]) and "!!!" not in str(params["output"])) else "casual",
            },
        }
        ```
      </Tab>
    </Tabs>

    Define the corresponding score schemas in `agentmark.json` so they are available in the Dashboard:

    ```json agentmark.json theme={null}
    {
      "scores": {
        "accuracy": {
          "type": "boolean",
          "description": "Was the response factually correct?"
        },
        "tone": {
          "type": "categorical",
          "description": "Response tone classification",
          "categories": [
            { "label": "professional", "value": 1 },
            { "label": "casual", "value": 0.5 },
            { "label": "inappropriate", "value": 0 }
          ]
        }
      }
    }
    ```

    Score schemas defined in `agentmark.json` are synced to AgentMark Cloud through the [deployment pipeline](/deploy/deployment) and remain available in the Dashboard across deployments.

    ## Function signature

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        interface EvalParams {
          input: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
          output: string | Record<string, unknown> | Array<Record<string, unknown> | string>;
          expectedOutput?: string;  // Maps from dataset's expected_output field
          metadata?: Record<string, unknown> | null;
        }

        interface EvalResult {
          score?: number;    // Numeric score (0-1 recommended)
          passed?: boolean;  // Pass/fail status (used by --threshold)
          label?: string;    // Classification label for categorization
          reason?: string;   // Explanation for the result
        }

        type EvalFunction = (params: EvalParams) => Promise<EvalResult> | EvalResult;
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        from typing import Any, TypedDict, Callable, Awaitable

        class EvalParams(TypedDict, total=False):
            input: str | dict[str, Any] | list[dict[str, Any] | str]
            output: str | dict[str, Any] | list[dict[str, Any] | str]
            expectedOutput: str | None  # Note: camelCase in Python
            metadata: dict[str, Any] | None

        class EvalResult(TypedDict, total=False):
            passed: bool      # Pass/fail status
            score: float      # Numeric score (0-1)
            reason: str       # Explanation for failure
            label: str        # Custom label for categorization

        # Both sync and async functions are supported
        EvalFunction = Callable[[EvalParams], EvalResult | Awaitable[EvalResult]]
        ```
      </Tab>
    </Tabs>

    ## Registering evals

    Pass your eval functions to the AgentMark client using the `evals` option:

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        const client = createAgentMarkClient({
          loader,
          modelRegistry,
          evals: {
            accuracy: accuracyFn,
            contains_keyword: containsKeywordFn,
          },
        });
        ```
      </Tab>

      <Tab title="Python (Pydantic AI)">
        ```python theme={null}
        from agentmark_pydantic_ai_v0 import (
            create_pydantic_ai_client,
            PydanticAIModelRegistry,
        )
        from agentmark.prompt_core import FileLoader

        model_registry = PydanticAIModelRegistry()
        model_registry.register_providers({
            "openai": "openai",
            "anthropic": "anthropic",
        })

        evals = {
            "accuracy": accuracy,
            "contains_keyword": contains_keyword,
        }

        client = create_pydantic_ai_client(
            model_registry=model_registry,
            loader=FileLoader(base_dir="./"),
            evals=evals,
        )
        ```
      </Tab>

      <Tab title="Python (Claude Agent SDK)">
        ```python theme={null}
        from agentmark_claude_agent_sdk_v0 import (
            create_claude_agent_client,
            ClaudeAgentModelRegistry,
            ClaudeAgentAdapterOptions,
        )

        model_registry = ClaudeAgentModelRegistry()
        model_registry.register_providers({
            "anthropic": "anthropic",
        })

        evals = {
            "accuracy": accuracy,
        }

        client = create_claude_agent_client(
            model_registry=model_registry,
            adapter_options=ClaudeAgentAdapterOptions(
                permission_mode="bypassPermissions",
            ),
            evals=evals,
        )
        ```
      </Tab>
    </Tabs>

    ### Eval function registry

    Eval functions are plain functions mapped by name. In TypeScript, use `Record<string, EvalFunction>`. In Python, use `dict[str, EvalFunction]`. Score schemas are defined separately in `agentmark.json` and deployed to AgentMark Cloud. The eval functions run during experiments and are connected to scores by name.

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        import type { EvalFunction } from "@agentmark-ai/prompt-core";

        const evals: Record<string, EvalFunction> = {
          accuracy: accuracyFn,
          relevance: relevanceFn,
        };

        // Standard object operations
        evals["new_eval"] = newEvalFn;
        const fn = evals["accuracy"];
        const exists = "accuracy" in evals;
        delete evals["accuracy"];
        const names = Object.keys(evals);
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        from agentmark.prompt_core import EvalFunction

        evals: dict[str, EvalFunction] = {
            "accuracy": accuracy_fn,
            "relevance": relevance_fn,
        }

        # Standard dict operations
        evals["new_eval"] = new_eval_fn
        fn = evals["accuracy"]
        exists = "accuracy" in evals
        del evals["accuracy"]
        names = list(evals.keys())
        ```
      </Tab>
    </Tabs>

    <Note>
      All fields in `EvalResult` are optional. Return whichever fields are relevant to your eval. The `passed` field is used by the CLI `--threshold` flag to calculate pass rates.
    </Note>

    ## Evaluation types

    ### Reference-based (ground truth)

    Compare outputs against known correct answers:

    ```typescript theme={null}
    export const exact_match = async ({ output, expectedOutput }) => {
      return {
        passed: output === expectedOutput,
        score: output === expectedOutput ? 1 : 0
      };
    };
    ```

    **Use for:** Classification, extraction, math problems, multiple choice

    ### Reference-free (heuristic)

    Check structural requirements without ground truth:

    ```typescript theme={null}
    export const has_required_fields = async ({ output }) => {
      const required = ['name', 'email', 'summary'];
      const hasAll = required.every(field => output[field]);
      return {
        passed: hasAll,
        score: hasAll ? 1 : 0,
        reason: hasAll ? undefined : 'Missing required fields'
      };
    };
    ```

    **Use for:** Format validation, length checks, required content

    ### Model-graded (LLM-as-judge)

    Use an LLM to evaluate subjective criteria:

    ```typescript theme={null}
    import { generateObject } from 'ai';
    import { openai } from '@ai-sdk/openai';
    import { z } from 'zod';

    export const tone_eval = async ({ output, expectedOutput }) => {
      const { object } = await generateObject({
        model: openai('gpt-4o-mini'),
        schema: z.object({ passed: z.boolean(), reasoning: z.string() }),
        prompt: `Evaluate if this response has appropriate ${expectedOutput} tone:\n\n${output}`,
        temperature: 0.1,
      });

      return {
        passed: object.passed,
        score: object.passed ? 1 : 0,
        reason: object.reasoning,
      };
    };
    ```

    **Use for:** Tone, creativity, helpfulness, semantic similarity

    <Tip>
      **Combine approaches** - Use reference-based for correctness, reference-free for structure, and model-graded for subjective quality.
    </Tip>

    ## Common patterns

    ### Classification

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        export const classification_accuracy = async ({ output, expectedOutput }) => {
          const match = output.trim().toLowerCase() === expectedOutput.trim().toLowerCase();
          return {
            passed: match,
            score: match ? 1 : 0,
            reason: match ? undefined : `Expected ${expectedOutput}, got ${output}`
          };
        };
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        def classification_accuracy(params: EvalParams) -> EvalResult:
            output = str(params["output"]).strip().lower()
            expected = str(params["expectedOutput"]).strip().lower()
            match = output == expected

            return {
                "passed": match,
                "score": 1.0 if match else 0.0,
                "reason": None if match else f"Expected {expected}, got {output}"
            }
        ```
      </Tab>
    </Tabs>

    ### Contains keyword

    <Tabs>
      <Tab title="TypeScript">
        ```typescript theme={null}
        export const contains_keyword = async ({ output, expectedOutput }) => {
          const contains = output.includes(expectedOutput);
          return {
            passed: contains,
            score: contains ? 1 : 0,
            reason: contains ? undefined : `Output missing "${expectedOutput}"`
          };
        };
        ```
      </Tab>

      <Tab title="Python">
        ```python theme={null}
        def contains_keyword(params: EvalParams) -> EvalResult:
            output = str(params["output"])
            expected = str(params["expectedOutput"])
            contains = expected in output

            return {
                "passed": contains,
                "score": 1.0 if contains else 0.0,
                "reason": None if contains else f'Output missing "{expected}"'
            }
        ```
      </Tab>
    </Tabs>

    ### Field presence

    ```typescript theme={null}
    export const required_fields = async ({ output }) => {
      const required = ['name', 'email', 'message'];
      const missing = required.filter(field => !(field in output));

      return {
        passed: missing.length === 0,
        score: (required.length - missing.length) / required.length,
        reason: missing.length > 0 ? `Missing: ${missing.join(', ')}` : undefined
      };
    };
    ```

    ### Length check

    ```typescript theme={null}
    export const length_check = async ({ output }) => {
      const length = output.length;
      const passed = length >= 10 && length <= 500;
      return {
        passed,
        score: passed ? 1 : 0,
        reason: passed ? undefined : `Length ${length} outside range [10, 500]`
      };
    };
    ```

    ### Format validation

    ```typescript theme={null}
    export const email_format = async ({ output }) => {
      const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
      const passed = emailRegex.test(output);
      return {
        passed,
        score: passed ? 1 : 0,
        reason: passed ? undefined : 'Invalid email format'
      };
    };
    ```

    ### Graduated scoring

    Use `label` field to categorize results:

    ```typescript theme={null}
    export const sentiment_gradual = async ({ output, expectedOutput }) => {
      if (output === expectedOutput) {
        return { passed: true, score: 1.0, label: 'exact_match' };
      }

      const partialMatches = {
        'positive': ['very positive', 'somewhat positive'],
        'negative': ['very negative', 'somewhat negative']
      };

      if (partialMatches[expectedOutput]?.includes(output)) {
        return {
          passed: true,
          score: 0.7,
          label: 'partial_match',
          reason: 'Close semantic match'
        };
      }

      return {
        passed: false,
        score: 0,
        label: 'no_match',
        reason: `Expected ${expectedOutput}, got ${output}`
      };
    };
    ```

    Filter by label (`exact_match`, `partial_match`, `no_match`) to understand patterns.

    ## LLM-as-judge

    ### Using AgentMark prompts (recommended)

    **1. Create eval prompt** (`agentmark/evals/tone-judge.prompt.mdx`):

    ```mdx theme={null}
    ---
    name: tone-judge
    object_config:
      model_name: openai/gpt-4o-mini
      temperature: 0.1
      schema:
        type: object
        properties:
          passed:
            type: boolean
          reasoning:
            type: string
    ---

    <System>
    You are evaluating whether an AI response has appropriate professional tone.

    First explain your reasoning step-by-step, then provide your final judgment.
    </System>

    <User>
    **Output to evaluate:**
    {props.output}

    **Expected tone:**
    {props.expectedOutput}
    </User>
    ```

    **2. Use in eval function**:

    ```typescript theme={null}
    import { client } from './agentmark-client';
    import { generateObject } from 'ai';

    export const tone_check = async ({ output, expectedOutput }) => {
      const evalPrompt = await client.loadObjectPrompt('evals/tone-judge.prompt.mdx');
      const formatted = await evalPrompt.format({
        props: { output, expectedOutput }
      });
      const { object } = await generateObject(formatted);

      return {
        passed: object.passed,
        score: object.passed ? 1 : 0,
        reason: object.reasoning,
      };
    };
    ```

    **Benefits**: Version control eval logic, iterate independently, reuse prompts, leverage templating.

    ### LLM-as-judge best practices

    **Configuration**:

    * Use low temperature (0.1-0.3) for consistency
    * Ask for reasoning before judgment (chain-of-thought)
    * Use binary scoring (PASS/FAIL) not scales (1-10)
    * Test one dimension at a time

    **Model selection**:

    * Use stronger model to grade weaker models (GPT-4 → GPT-3.5)
    * Avoid grading a model with itself
    * Validate with human evaluation before scaling

    **Usage**:

    * Use sparingly - slower and more expensive
    * Reserve for subjective criteria
    * Watch for position bias, verbosity bias, self-enhancement bias

    <Warning>
      **Avoid exact-match for open-ended outputs** - Use only for classification or short outputs. For longer text, use semantic similarity or LLM-based evaluation.
    </Warning>

    ## Domain-specific evals

    ### RAG (retrieval-augmented generation)

    ```typescript theme={null}
    export const faithfulness = async ({ output, input }) => {
      const context = input.retrieved_context;
      const claims = extractClaims(output);
      const supported = claims.every(claim => isSupported(claim, context));

      return {
        passed: supported,
        score: supported ? 1 : 0,
        reason: supported ? undefined : 'Output contains unsupported claims'
      };
    };

    export const answer_relevancy = async ({ output, input }) => {
      const isRelevant = output.toLowerCase().includes(input.query.toLowerCase());
      return {
        passed: isRelevant,
        score: isRelevant ? 1 : 0,
        reason: isRelevant ? undefined : 'Answer not relevant to query'
      };
    };
    ```

    ### Agent / tool calling

    ```typescript theme={null}
    export const tool_correctness = async ({ output, expectedOutput }) => {
      const correctTool = output.tool === expectedOutput.tool;
      const correctParams = JSON.stringify(output.parameters) ===
                           JSON.stringify(expectedOutput.parameters);

      return {
        passed: correctTool && correctParams,
        score: correctTool && correctParams ? 1 : 0.5,
        reason: !correctTool ? 'Wrong tool selected' :
                !correctParams ? 'Incorrect parameters' : undefined
      };
    };
    ```

    ## Best practices

    * Test one thing per eval - separate functions for different criteria
    * Provide helpful failure reasons for debugging
    * Use meaningful names (`sentiment_accuracy` not `eval1`)
    * Keep scores in 0-1 range
    * Make evals deterministic and consistent (avoid flaky tests)
    * Validate general behavior, not specific outputs (avoid overfitting)
  </Tab>
</Tabs>

## Next steps

<CardGroup>
  <Card title="Datasets" icon="database" href="/evaluate/datasets">
    Create test datasets
  </Card>

  <Card title="Running Experiments" icon="flask" href="/evaluate/running-experiments">
    Run your evaluations
  </Card>

  <Card title="Testing Overview" icon="clipboard-list" href="/evaluate/overview">
    Learn testing concepts
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Quickstart
Source: https://docs.agentmark.co/getting-started/quickstart

Bootstrap an AgentMark project, let your AI tool wire it into your codebase, run your first prompt — in under 5 minutes.

# Quickstart

`npm create agentmark` does the absolute minimum bootstrap — writes `agentmark.json`, creates an empty `agentmark/` directory, installs the [AgentMark agent skill](/sdk-reference/tools/agentmark-mcp) into your IDE, and hands off to your AI tool. The AI tool reads your project, asks the docs MCP for the right integration pattern, and wires the SDK into your existing code. No template menu, no opinionated scaffolding.

## Prerequisites

* Node.js 18+
* An AI-tool-aware editor: [Claude Code](https://code.claude.com), [Cursor](https://cursor.com), [VS Code](https://code.visualstudio.com/) (Copilot Chat), or [Zed](https://zed.dev)
* An LLM provider API key (OpenAI, Anthropic, etc.) for the model you want to run

## Step 1: Bootstrap

Run from inside your project directory (or pass a folder name to scaffold a fresh one):

<CodeGroup>
  ```bash npm theme={null}
  npm create agentmark@latest
  ```

  ```bash yarn theme={null}
  yarn create agentmark
  ```

  ```bash pnpm theme={null}
  pnpm create agentmark
  ```
</CodeGroup>

The CLI asks two short questions, scaffolds, and exits:

```text theme={null}
? Where would you like to set up AgentMark?  .
? Wire AgentMark MCP into which IDE clients?
  Space to toggle. Enter to submit. Skip all = empty selection.
  ◉ Claude Code
  ◉ Cursor
  ◉ VS Code
  ◉ Zed

✅ agentmark.json
✅ agentmark/ (empty, ready for your .prompt.mdx files)
✅ MCP wired (Claude Code): .mcp.json
✅ MCP wired (Cursor): .cursor/mcp.json
✅ MCP wired (VS Code): .vscode/mcp.json
✅ MCP wired (Zed): .zed/settings.json

📚 Installing AgentMark agent skill...
✅ Agent skill installed at ./.agents/skills/agentmark/

✨ AgentMark is wired up.

   Next: open this project in Claude Code, Cursor, VS Code, or Zed and say:

       "Set up AgentMark in this project."
```

<Tip>
  **Non-interactive (CI / scripting):**

  ```bash theme={null}
  npm create agentmark@latest my-app -- --client all --overwrite
  ```

  Flags: `--path <dir>` • `--client <id|all>` (repeatable or comma-separated) • `--overwrite` (replace existing `agentmark.json`) • positional folder name.
</Tip>

## Step 2: Tell your AI tool to integrate

Open your project in Claude Code, Cursor, VS Code, or Zed and send the agent this message:

> **Set up AgentMark in this project.**

The AgentMark skill takes over. It:

1. Detects your project's framework (Next.js, FastAPI, Hono, plain Node, etc.)
2. Queries the docs MCP for the right integration recipe
3. Proposes a concrete plan back to you — packages to install, where the client file goes, what your first prompt looks like
4. After you confirm, installs the SDK, writes the client (`agentmark.client.ts` / `agentmark_client.py`), scaffolds a first prompt, and smoke-tests it

It will **not** touch your existing LLM-SDK call sites during setup. Migrating those is a separate confirmation — ask the agent when you're ready.

## Step 3: Add your provider key

The agent will tell you which env var to set for the model it picked. For OpenAI's `gpt-4o-mini` (the common default) that's:

```bash theme={null}
echo "OPENAI_API_KEY=sk-..." >> .env
```

## Step 4: Run your first prompt

<Tabs>
  <Tab title="Local">
    Start the dev server (keep it running in a separate terminal):

    ```bash theme={null}
    npx agentmark dev
    ```

    Then run the prompt the agent scaffolded (the agent will tell you the path; `chat.prompt.mdx` is the conventional default):

    ```bash theme={null}
    npx agentmark run-prompt agentmark/chat.prompt.mdx --props '{"message":"hello"}'
    ```

    The CLI prints the model output, token counts, cost estimate, and a `📊 View trace` URL you can open in the browser for the full span tree.

    <Note>
      The dev server listens on ports `9418` (API), `9417` (webhook), and `3000` (UI app). Override with `--api-port` / `--webhook-port` / `--app-port` if you need different ports.
    </Note>
  </Tab>

  <Tab title="Cloud">
    1. Commit and push your project to a Git repository (GitHub or GitLab).
    2. In the [AgentMark Dashboard](https://app.agentmark.co), click **Create App** and select your repository.
    3. Add your LLM provider API key in **Settings → Environment Variables**.

    <img alt="Apps list in the AgentMark Dashboard showing the Create App button" />

    Once connected, AgentMark Cloud syncs your prompts on every push. Open a prompt and click **Run** — output streams back in real time.

    <img alt="Running a prompt in the AgentMark Dashboard" />
  </Tab>
</Tabs>

## Step 5: Run an experiment

An experiment runs a prompt against a dataset and scores each row. Add a `test_settings` block to your prompt's frontmatter pointing at a `.jsonl` dataset (see [Datasets](/evaluate/datasets) for the row shape), then:

```bash theme={null}
npx agentmark run-experiment agentmark/chat.prompt.mdx --threshold 80
```

The CLI runs every row, applies your evaluators, prints a results table, and **exits non-zero if pass rate is below `--threshold`** — wire that into CI for prompt regression gating.

<Tip>
  Need worked examples? See [Example prompts](/build/example-prompts) — four copy-paste recipes covering all four generation types (object, text+tools, image, speech).
</Tip>

## What's in your project after bootstrap

| File                                    | Source                     | Purpose                                                                 |
| --------------------------------------- | -------------------------- | ----------------------------------------------------------------------- |
| `agentmark.json`                        | CLI                        | Project config — `agentmarkPath`, `version`, models, scores             |
| `agentmark/.gitkeep`                    | CLI                        | Empty prompts directory (drop `.prompt.mdx` files here)                 |
| `.mcp.json` (and per-IDE configs)       | CLI                        | MCP wiring — docs MCP, agentmark-mcp (Cloud), agentmark-local (dev)     |
| `.agents/skills/agentmark/`             | CLI (via `npx skills add`) | Agent skill that knows AgentMark — teaches Claude Code / Cursor / etc.  |
| `agentmark.client.ts` (or `_client.py`) | **Skill**                  | Configured SDK client — added when you ask the AI tool to integrate     |
| Your first `.prompt.mdx`                | **Skill**                  | Scaffolded by the AI tool, named for your use case                      |
| `.env`                                  | **You**                    | Provider API key(s); `AGENTMARK_API_KEY` / `AGENTMARK_APP_ID` for Cloud |

The CLI ships **only** the unopinionated bits. Everything stack-specific comes from the AI tool reading your project + the docs MCP — so the integration matches whatever framework you're already on.

## Next steps

<CardGroup>
  <Card title="Build Prompts" icon="hammer" href="/build/overview">
    Author `.prompt.mdx` files: text, object, image, speech
  </Card>

  <Card title="Example prompts" icon="lightbulb" href="/build/example-prompts">
    Copy-paste starters for all four generation types
  </Card>

  <Card title="Evaluate" icon="check" href="/evaluate/overview">
    Test prompts with datasets + evaluators; gate CI on regressions
  </Card>

  <Card title="Observe" icon="chart-line" href="/observe/overview">
    Traces, sessions, cost-and-token tracking
  </Card>

  <Card title="Integrations" icon="plug" href="/integrations/overview">
    Vercel AI SDK, Mastra, Claude Agent SDK, Pydantic AI
  </Card>

  <Card title="Deploy" icon="rocket" href="/deploy/deployment">
    Git-based deploys to AgentMark Cloud
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Custom integration
Source: https://docs.agentmark.co/integrations/custom

Learn how to create your own custom integration with AgentMark.

AgentMark uses adapters to convert its normalized prompt configuration (`PromptShape`) into model-specific input formats.

This guide walks you through how to build a custom adapter, using the AI SDK adapter as a reference implementation.

## What is an adapter?

An **adapter** implements methods to transform AgentMark prompt types (`text`, `object`, `image`, `speech`) into provider-specific input formats (OpenAI chat completions, Claude messages, etc.).

It takes the raw prompt configuration (`TextConfig`, `ObjectConfig`, `ImageConfig`, `SpeechConfig`) and combines it with:

* Model configuration (name, temperature, max tokens, etc.)
* Tool definitions (functions to be called during inference)
* Optional telemetry or metadata

## Requirements

To create an adapter, implement the `Adapter` interface from `@agentmark-ai/prompt-core`:

```typescript theme={null}
export interface Adapter<D extends PromptShape<D>> {
  readonly __dict: D;
  readonly __name: string;

  adaptText<_K extends KeysWithKind<D, "text"> & string>(
    input: TextConfig,
    options: AdaptOptions,
    metadata: PromptMetadata
  ): any;

  adaptObject<_K extends KeysWithKind<D, "object"> & string>(
    input: ObjectConfig,
    options: AdaptOptions,
    metadata: PromptMetadata
  ): any;

  adaptImage<_K extends KeysWithKind<D, "image"> & string>(
    input: ImageConfig,
    options: AdaptOptions
  ): any;

  adaptSpeech<_K extends KeysWithKind<D, "speech"> & string>(
    input: SpeechConfig,
    options: AdaptOptions
  ): any;

  // Optional — adapters that support dev mode provide this.
  getDevServerFactory?(): (options: { port: number; client: any }) => Promise<any>;
}
```

Note the asymmetry: `adaptText` and `adaptObject` receive a third `metadata: PromptMetadata` argument; `adaptImage` and `adaptSpeech` do not.

### Minimal adapter skeleton

```typescript theme={null}
import {
  Adapter,
  AdaptOptions,
  ImageConfig,
  KeysWithKind,
  ObjectConfig,
  PromptMetadata,
  PromptShape,
  SpeechConfig,
  TextConfig,
} from "@agentmark-ai/prompt-core";

export class MyCustomAdapter<T extends PromptShape<T>> implements Adapter<T> {
  declare readonly __dict: T;
  readonly __name = "my-custom-adapter";

  adaptText<_K extends KeysWithKind<T, "text"> & string>(
    input: TextConfig,
    options: AdaptOptions,
    metadata: PromptMetadata
  ) {
    // Transform AgentMark config into your provider's request shape.
    return { /* ... */ };
  }

  adaptObject<_K extends KeysWithKind<T, "object"> & string>(
    input: ObjectConfig,
    options: AdaptOptions,
    metadata: PromptMetadata
  ) {
    return { /* ... */ };
  }

  adaptImage<_K extends KeysWithKind<T, "image"> & string>(
    input: ImageConfig,
    options: AdaptOptions
  ) {
    return { /* ... */ };
  }

  adaptSpeech<_K extends KeysWithKind<T, "speech"> & string>(
    input: SpeechConfig,
    options: AdaptOptions
  ) {
    return { /* ... */ };
  }
}
```

<Note>
  Each method must transform the AgentMark prompt config into the correct shape for your target provider.
</Note>

## Creating a custom client

You can wrap `AgentMark` directly, or expose a factory that wires up your adapter:

```typescript theme={null}
import { AgentMark, Loader, PromptShape } from "@agentmark-ai/prompt-core";
import { MyCustomAdapter } from "./my-custom-adapter";

export function createCustomAgentMarkClient<D extends PromptShape<D>>(opts: {
  loader?: Loader<D>;
}) {
  return new AgentMark<D, MyCustomAdapter<D>>({
    loader: opts.loader,
    adapter: new MyCustomAdapter<D>(),
  });
}
```

`PromptShape<D>` describes the shape of your prompts (generated by `npx agentmark build` as `agentmark.types.ts`). `KeysWithKind<D, "object">` extracts the keys whose prompt kind is `object`. See [Type safety](/sdk-reference/typescript/type-safety) for details.

## Model registry

A **model registry** lets you register AI model constructors and look them up by name during `adaptText` / `adaptObject` / etc. It's useful for swapping providers or configuring per-model parameters.

```typescript theme={null}
class CustomModelRegistry {
  private models: Record<string, (name: string) => unknown> = {};

  register(modelName: string, creator: (name: string) => unknown) {
    this.models[modelName] = creator;
  }

  getModelFunction(name: string): (name: string) => unknown {
    const fn = this.models[name];
    if (!fn) throw new Error(`Model not registered: ${name}`);
    return fn;
  }
}
```

Your adapter's constructor can accept this registry and look up model instances when processing prompts. See the [AI SDK adapter source](https://github.com/agentmark-ai/agentmark/blob/main/packages/ai-sdk-v5-adapter/src/adapter.ts) for a reference implementation.

## Tools

Adapters accept native SDK tool objects as a `Record<string, Tool>` passed to `createAgentMarkClient`. Your adapter should implement a `resolveTools()` helper that takes the tool name strings from prompt frontmatter and looks them up in the provided `tools` record.

```typescript theme={null}
import { createCustomAgentMarkClient } from "./my-custom-adapter";
import { mySearchTool, myCalculatorTool } from "./tools";

const client = createCustomAgentMarkClient({
  loader,
  // If your adapter supports tools, accept them here and thread to the adapter.
});
```

AgentMark prompt files reference tools by name:

```yaml theme={null}
---
tools:
  - search
  - calculator
---
```

The adapter's `resolveTools()` maps each name to its implementation before dispatching the LLM call. See the [AI SDK adapter source](https://github.com/agentmark-ai/agentmark/blob/main/packages/ai-sdk-v5-adapter/src/adapter.ts) for a reference implementation.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Default integration
Source: https://docs.agentmark.co/integrations/fallback

Use AgentMark with the fallback adapter for raw prompt configuration output.

The default (fallback) adapter transforms prompts into AgentMark's raw configuration format without targeting a specific AI SDK. This lets you map the output directly to your preferred provider.

<Note>
  With the Default integration, you can map the parameters directly to your preferred provider (e.g., OpenAI, Ollama) or your own SDK. This allows you to maintain flexibility while using AgentMark's interface.
</Note>

## Installation

```bash theme={null}
npm install @agentmark-ai/fallback-adapter @agentmark-ai/loader-file
```

## Usage

```typescript theme={null}
import { createAgentMarkClient } from "@agentmark-ai/fallback-adapter";
import { FileLoader } from "@agentmark-ai/loader-file";

const loader = new FileLoader("./dist/agentmark");

const agentmark = createAgentMarkClient({
  loader,
});

const prompt = await agentmark.loadTextPrompt("<example>.prompt.mdx");
const result = await prompt.format({
  props: {
    // prompt props
  },
});

console.log(result);
```

<Note>
  `FileLoader` takes the output directory from `npx agentmark build` (typically `./dist/agentmark`), not your source prompt directory. Prompts are pre-compiled to JSON by `build` before `FileLoader` reads them.
</Note>

## What it returns

The fallback adapter returns the raw prompt configuration as-is, without transforming it for a specific SDK. The output includes:

* **Model configuration** — model name, temperature, max tokens, and other parameters from your prompt's frontmatter
* **Messages** — the rendered system/user/assistant messages after template processing
* **Tool references** — tool names referenced in the prompt frontmatter
* **Schema** — for object prompts, the output schema

This raw config can be mapped to any provider's API format (OpenAI, Anthropic, Google, Ollama, etc.) in your own code.

## When to use

* **Unsupported SDK** — Your AI SDK doesn't have a dedicated AgentMark adapter yet
* **Custom provider** — You're calling a provider API directly without an SDK
* **Inspection/debugging** — You want to see the raw config AgentMark produces before sending it to a provider
* **Adapter development** — You're building a new adapter and want to understand the input format

For most applications, prefer a dedicated adapter like the [AI SDK](/integrations/typescript/ai-sdk) or [Claude Agent SDK](/integrations/typescript/claude-agent-sdk) for a streamlined experience.

## Next steps

<CardGroup>
  <Card title="AI SDK" icon="bolt" href="/integrations/typescript/ai-sdk">
    Recommended adapter for most apps
  </Card>

  <Card title="Custom adapter" icon="code" href="/integrations/custom">
    Build your own adapter
  </Card>

  <Card title="Prompts" icon="file-code" href="/build/overview">
    Learn about prompt syntax
  </Card>

  <Card title="All integrations" icon="plug" href="/integrations/overview">
    See all available adapters
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Overview
Source: https://docs.agentmark.co/integrations/overview

Connect AgentMark to your AI SDK through adapters

AgentMark works with multiple AI SDKs through adapters. Choose the adapter that fits your tech stack, or build your own.

## What are adapters?

Adapters connect AgentMark prompts to AI SDKs. They translate AgentMark's prompt format into the format your AI SDK expects.

**The pattern is always the same:**

1. Load prompt: `client.loadTextPrompt()` / `loadObjectPrompt()`
2. Format with props: `await prompt.format({ props: {...} })`
3. Pass the result to your AI SDK's generation function

## Available adapters

### AI SDK

The Vercel AI SDK adapter for Next.js and Node.js applications. Supports text, object, image, and speech generation with streaming.

```typescript theme={null}
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";

const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai });

const client = createAgentMarkClient({ loader, modelRegistry });
const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
const input = await prompt.format({ props: { name: "Alice" } });
const result = await generateText(input);
```

[Learn more →](/integrations/typescript/ai-sdk)

### Claude Agent SDK

Run AgentMark prompts as agentic tasks with Anthropic's Claude Agent SDK. Supports tool use, budget controls, and tracing.

```typescript theme={null}
import { createAgentMarkClient, ClaudeAgentModelRegistry } from "@agentmark-ai/claude-agent-sdk-v0-adapter";
import { query } from "@anthropic-ai/claude-agent-sdk";

const client = createAgentMarkClient({
  loader,
  modelRegistry: ClaudeAgentModelRegistry.createDefault(),
});

const prompt = await client.loadTextPrompt("task.prompt.mdx");
const adapted = await prompt.format({ props: { task: "Refactor auth module" } });

for await (const message of query({
  prompt: adapted.query.prompt,
  options: adapted.query.options,
})) {
  console.log(message);
}
```

[Learn more →](/integrations/typescript/claude-agent-sdk)

### Mastra

Built for agentic workflows and multi-step LLM applications with Mastra's framework.

```typescript theme={null}
import { createAgentMarkClient, MastraModelRegistry } from "@agentmark-ai/mastra-v0-adapter";
import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";

const modelRegistry = new MastraModelRegistry();
modelRegistry.registerProviders({ openai });

const client = createAgentMarkClient({ loader, modelRegistry });
const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
const agentConfig = await prompt.formatAgent({ props: { name: "Alice" } });

const agent = new Agent(agentConfig);
const [messages, options] = await agent.formatMessages(...);
const result = await agent.generate(messages, options);
```

[Learn more →](/integrations/typescript/mastra)

### Pydantic AI

The recommended adapter for Python applications. Supports text, object generation, streaming, and type-safe outputs via Pydantic models.

```python theme={null}
from agentmark_pydantic_ai_v0 import (
    create_pydantic_ai_client,
    PydanticAIModelRegistry,
    run_text_prompt,
)
from agentmark.prompt_core import FileLoader

model_registry = PydanticAIModelRegistry()
model_registry.register_models(
    ["gpt-4o", "gpt-4o-mini"],
    lambda name, opts=None: f"openai:{name}",
)

loader = FileLoader("./dist/agentmark")
client = create_pydantic_ai_client(model_registry=model_registry, loader=loader)

prompt = await client.load_text_prompt("greeting.prompt.mdx")
params = await prompt.format(props={"name": "Alice"})
result = await run_text_prompt(params)
```

[Learn more →](/integrations/python/pydantic-ai)

### Default (fallback)

Returns raw prompt configuration without SDK-specific formatting. Useful for mapping to any provider directly.

```typescript theme={null}
import { createAgentMarkClient } from "@agentmark-ai/fallback-adapter";

const client = createAgentMarkClient({ loader });
const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
const result = await prompt.format({ props: { name: "Alice" } });
// Returns raw config — pass to your own provider
```

[Learn more →](/integrations/fallback)

### Custom adapter

Build your own adapter for any AI SDK by implementing the `Adapter<D>` interface.

[Learn more →](/integrations/custom)

## How to choose

| Adapter              | Best for                                              | Language            | Streaming                  | Image / speech |
| -------------------- | ----------------------------------------------------- | ------------------- | -------------------------- | -------------- |
| **AI SDK**           | Next.js, Node.js apps with broad model support        | TypeScript          | Yes                        | Yes            |
| **Claude Agent SDK** | Agentic tasks with Claude (tool use, budget controls) | TypeScript + Python | Yes (async message stream) | No             |
| **Mastra**           | Complex agentic workflows and orchestration           | TypeScript          | Yes                        | No             |
| **Pydantic AI**      | Python applications with type-safe outputs            | Python              | Yes                        | No             |
| **Default**          | Direct provider mapping or unsupported SDKs           | TypeScript          | N/A                        | N/A            |
| **Custom**           | Any SDK with specific requirements                    | TypeScript          | You decide                 | You decide     |

### For Python developers

**Pydantic AI** — covers the common case: type-safe outputs via Pydantic models, streaming, sync and async tool functions, and all major LLM providers (OpenAI, Anthropic, Google).

**[Claude Agent SDK](/integrations/python/claude-agent-sdk)** — use when you need agentic capabilities with Claude (multi-turn tool use, budget controls, permission management).

<Note>
  **Image and speech generation** are not available in Python adapters. If you need these features, consider using the AI SDK adapter via a Node.js service, or use provider SDKs directly.
</Note>

## Switching adapters

Switch between adapters without changing your prompts. Only your client configuration changes:

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    // Switch from AI SDK to Mastra — your prompts stay exactly the same
    import { createAgentMarkClient, MastraModelRegistry } from "@agentmark-ai/mastra-v0-adapter";
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    # Switch from Pydantic AI to Claude Agent SDK — your prompts stay exactly the same
    from agentmark_claude_agent_sdk_v0 import create_claude_agent_client, ClaudeAgentModelRegistry
    ```
  </Tab>
</Tabs>

<Tip>
  **Your prompts are adapter-agnostic.** The same `.prompt.mdx` files work with any adapter. Only your client configuration (`agentmark.client.ts` or `agentmark_client.py`) needs to change.
</Tip>

## Package versioning

Adapter packages use a `-v0` suffix (e.g., `agentmark-pydantic-ai-v0`, `@agentmark-ai/mastra-v0-adapter`). This indicates the **adapter API version**, not stability:

* `v0` = current stable API
* Future breaking changes would be released as `v1`, `v2`, etc.

This lets you pin a specific adapter API version and still receive bug fixes.

## Next steps

<CardGroup>
  <Card title="AI SDK" icon="bolt" href="/integrations/typescript/ai-sdk">
    Next.js and Node.js apps
  </Card>

  <Card title="Claude Agent SDK" icon="microchip" href="/integrations/typescript/claude-agent-sdk">
    Agentic tasks with Claude
  </Card>

  <Card title="Mastra" icon="robot" href="/integrations/typescript/mastra">
    Agentic workflows
  </Card>

  <Card title="Pydantic AI" icon="python" href="/integrations/python/pydantic-ai">
    Python type-safe integration
  </Card>

  <Card title="Claude Agent SDK (Python)" icon="microchip" href="/integrations/python/claude-agent-sdk">
    Agentic Claude tasks in Python
  </Card>

  <Card title="Default adapter" icon="arrow-right" href="/integrations/fallback">
    Raw config pass-through
  </Card>

  <Card title="Custom adapter" icon="code" href="/integrations/custom">
    Build your own
  </Card>
</CardGroup>


# Claude Agent SDK (Python)
Source: https://docs.agentmark.co/integrations/python/claude-agent-sdk

Run AgentMark prompts as agentic tasks in Python with Anthropic's Claude Agent SDK

The Claude Agent SDK adapter runs AgentMark prompts as agentic tasks using [Anthropic's Claude Agent SDK](https://github.com/anthropics/claude-agent-sdk) — multi-turn tool use, budget controls, permission management, and tracing. This page covers the **Python** adapter; for TypeScript, see [Claude Agent SDK (TypeScript)](/integrations/typescript/claude-agent-sdk).

<Note>
  The Python adapter (`agentmark-claude-agent-sdk-v0`) is in **alpha**. The API surface documented here is stable, but expect occasional changes ahead of the `v1` release. For type-safe structured outputs across all major providers, [Pydantic AI](/integrations/python/pydantic-ai) is the recommended general-purpose Python adapter — reach for Claude Agent SDK when you specifically need Claude's agentic capabilities.
</Note>

## Installation

```bash theme={null}
pip install agentmark-claude-agent-sdk-v0 agentmark-prompt-core claude-agent-sdk
```

Set your Anthropic API key:

```bash theme={null}
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
```

## Setup

Create your client with a `ClaudeAgentModelRegistry`. There is **no** `create_default()` — you must register models or providers explicitly so the model names in your prompt frontmatter resolve.

The simplest setup registers a provider by prefix (`anthropic/...`):

```python agentmark_client.py theme={null}
from pathlib import Path
from dotenv import load_dotenv
from agentmark.prompt_core import FileLoader
from agentmark_claude_agent_sdk_v0 import (
    create_claude_agent_client,
    ClaudeAgentModelRegistry,
)

load_dotenv()

model_registry = ClaudeAgentModelRegistry()
model_registry.register_providers({"anthropic": "anthropic"})

loader = FileLoader(str(Path(__file__).parent / "dist" / "agentmark"))

client = create_claude_agent_client(
    model_registry=model_registry,
    loader=loader,
)
```

For per-model options like `max_thinking_tokens`, register models explicitly with a `ModelConfig` creator. The creator receives the model name and the run options:

```python agentmark_client.py theme={null}
from agentmark_claude_agent_sdk_v0 import (
    create_claude_agent_client,
    ClaudeAgentModelRegistry,
    ModelConfig,
)

model_registry = ClaudeAgentModelRegistry()
model_registry.register_models(
    ["claude-sonnet-4-20250514"],
    lambda name, _: ModelConfig(model=name),
)
model_registry.register_models(
    ["claude-opus-4-20250514"],
    lambda name, _: ModelConfig(model=name, max_thinking_tokens=10000),
)

client = create_claude_agent_client(
    model_registry=model_registry,
    loader=loader,
)
```

## Running prompts

Use `traced_query` — it accepts the output of `prompt.format()` directly, runs the Claude Agent SDK query internally, and yields messages as the agent works. (`run_text_prompt` / `run_object_prompt` belong to the Pydantic AI adapter and are **not** exported here.)

```python theme={null}
import asyncio
from agentmark_claude_agent_sdk_v0 import traced_query
from agentmark_client import client

async def main():
    prompt = await client.load_text_prompt("code-reviewer.prompt.mdx")
    adapted = await prompt.format(props={
        "task": "Analyze the auth module and suggest improvements",
    })

    async for message in traced_query(adapted):
        print(message)

asyncio.run(main())
```

Messages stream as the agent runs; the final aggregated result is only available once all turns complete.

## Adapter options

Adapter options are set at **client construction time**, not inside `prompt.format()`:

```python agentmark_client.py theme={null}
from agentmark_claude_agent_sdk_v0 import (
    create_claude_agent_client,
    ClaudeAgentAdapterOptions,
)

client = create_claude_agent_client(
    model_registry=model_registry,
    loader=loader,
    adapter_options=ClaudeAgentAdapterOptions(
        permission_mode="bypassPermissions",
        max_turns=10,
        cwd="/path/to/project",
        max_budget_usd=5.00,
        allowed_tools=["Read", "Write", "Bash"],
        disallowed_tools=["WebFetch"],
        system_prompt_preset=False,
        on_warning=lambda w: print(f"Warning: {w}"),
    ),
)
```

| Option                 | Type / values                                                 | Purpose                                       |
| ---------------------- | ------------------------------------------------------------- | --------------------------------------------- |
| `permission_mode`      | `'default' \| 'acceptEdits' \| 'bypassPermissions' \| 'plan'` | How the agent handles tool-permission prompts |
| `max_turns`            | `int`                                                         | Cap on agent turns                            |
| `cwd`                  | `str`                                                         | Working directory for file/Bash tools         |
| `max_budget_usd`       | `float`                                                       | Hard spend cap in USD                         |
| `allowed_tools`        | `list[str]`                                                   | Tool whitelist                                |
| `disallowed_tools`     | `list[str]`                                                   | Tool blacklist                                |
| `system_prompt_preset` | `bool` (default `False`)                                      | Use Claude Code's built-in system prompt      |
| `on_warning`           | `Callable[[str], None]`                                       | Warning handler                               |

## Object generation

For structured output, load an object prompt. The structured result arrives in the final `result` message:

```python theme={null}
from agentmark_claude_agent_sdk_v0 import traced_query
from agentmark_client import client

prompt = await client.load_object_prompt("sentiment.prompt.mdx")
adapted = await prompt.format(props={"text": "This product is amazing!"})

async for message in traced_query(adapted):
    if message.type == "result":
        print(message.result)
```

## Tools

The Claude Agent SDK adapter handles tools by **name**, not by registering executors. List tool names in your prompt frontmatter and the adapter passes them through as `allowed_tools` to the SDK. Tools can be the SDK's built-ins (`Read`, `Write`, `Bash`, …) or tools served by MCP servers.

Configure MCP servers on the client (the Python field is `mcp_servers`, snake\_case):

```python agentmark_client.py theme={null}
client = create_claude_agent_client(
    model_registry=model_registry,
    loader=loader,
    mcp_servers={"weather": {"url": "https://weather-mcp.example.com/sse"}},
)
```

Then reference tools by name in your prompt:

```mdx task.prompt.mdx theme={null}
---
name: task
text_config:
  model_name: claude-sonnet-4-20250514
  tools:
    - weather
---

<System>You are a helpful assistant with access to weather data.</System>
<User>{props.task}</User>
```

## Evals

Register evaluation functions to score prompt outputs during [experiments](/evaluate/running-experiments). Score schemas live in `agentmark.json`; eval functions connect to them by name.

```python agentmark_client.py theme={null}
from agentmark.prompt_core import EvalParams, EvalResult

def exact_match(params: EvalParams) -> EvalResult:
    match = str(params["output"]).strip() == str(params.get("expectedOutput", "")).strip()
    return {"passed": match, "score": 1.0 if match else 0.0}

evals = {
    "exact_match": exact_match,
}

client = create_claude_agent_client(
    model_registry=model_registry,
    loader=loader,
    evals=evals,
)
```

Reference evals in your prompt frontmatter:

```mdx theme={null}
---
test_settings:
  dataset: ./datasets/test.jsonl
  evals:
    - exact_match
---
```

[Learn more about evaluations](/evaluate/writing-evals).

## Tracing

`traced_query` emits OpenTelemetry spans automatically when telemetry is enabled on `prompt.format()`. All tracing context — prompt name, model, system prompt, and props — is extracted from the adapted output:

```python theme={null}
from agentmark_claude_agent_sdk_v0 import traced_query

adapted = await prompt.format(
    props={"task": "..."},
    telemetry={"isEnabled": True},
)

async for message in traced_query(adapted):
    print(message)
```

See [Tracing setup](/observe/tracing-setup) for wiring spans to the local dev server or AgentMark Cloud.

## Limitations

* **No image generation** — use the [AI SDK adapter](/integrations/typescript/ai-sdk) via a Node.js service.
* **No speech generation** — same as above.
* The aggregated final result is only available after all turns complete; intermediate state arrives as streamed messages.

## Next steps

<CardGroup>
  <Card title="Pydantic AI" icon="python" href="/integrations/python/pydantic-ai">
    The general-purpose Python adapter
  </Card>

  <Card title="Tools and agents" icon="wrench" href="/build/tools-and-agents">
    Configure tools for your agents
  </Card>

  <Card title="Observability" icon="chart-line" href="/observe/overview">
    Monitor your agents in production
  </Card>

  <Card title="Other integrations" icon="plug" href="/integrations/overview">
    Explore other AI frameworks
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Pydantic AI
Source: https://docs.agentmark.co/integrations/python/pydantic-ai

Use AgentMark prompts with Pydantic AI in Python

The Pydantic AI adapter lets you use AgentMark prompts with [Pydantic AI](https://ai.pydantic.dev) in Python applications. It's the recommended adapter for Python projects.

## Installation

```bash theme={null}
pip install agentmark-pydantic-ai-v0 agentmark-prompt-core
```

<Note>
  **Package names vs import names:**

  * `agentmark-prompt-core` → `from agentmark.prompt_core import ...`
  * `agentmark-pydantic-ai-v0` → `from agentmark_pydantic_ai_v0 import ...`

  `ApiLoader` and `FileLoader` both ship with `agentmark-prompt-core` — there's no separate `agentmark-loader-api` package.
</Note>

For specific providers, install the Pydantic AI provider extras you need:

```bash theme={null}
pip install "pydantic-ai[openai]"     # OpenAI
pip install "pydantic-ai[anthropic]"  # Anthropic
pip install "pydantic-ai[google]"     # Google Gemini
```

## Setup

The Python adapters don't ship a "default" model registry — you register provider prefixes explicitly. The `"<provider>:<model>"` string format tells Pydantic AI which provider to use at runtime:

```python agentmark_client.py theme={null}
import os
from dotenv import load_dotenv
from agentmark.prompt_core import ApiLoader
from agentmark_pydantic_ai_v0 import (
    create_pydantic_ai_client,
    PydanticAIModelRegistry,
)

load_dotenv()

model_registry = PydanticAIModelRegistry()
model_registry.register_models(
    ["gpt-4o", "gpt-4o-mini"],
    lambda name, opts=None: f"openai:{name}",
)
model_registry.register_models(
    ["claude-sonnet-4-20250514"],
    lambda name, opts=None: f"anthropic:{name}",
)

if os.getenv("NODE_ENV") == "development":
    loader = ApiLoader.local(
        base_url=os.getenv("AGENTMARK_BASE_URL", "http://localhost:9418")
    )
else:
    loader = ApiLoader.cloud(
        api_key=os.environ["AGENTMARK_API_KEY"],
        app_id=os.environ["AGENTMARK_APP_ID"],
    )

client = create_pydantic_ai_client(
    model_registry=model_registry,
    loader=loader,
)
```

### Registering models with patterns

`register_models` accepts an exact string, a `re.Pattern`, or a list of strings. You can also register a `set_default` fallback:

```python theme={null}
import re
from agentmark_pydantic_ai_v0 import PydanticAIModelRegistry

model_registry = PydanticAIModelRegistry()

# Exact matches
model_registry.register_models(
    ["gpt-4o", "gpt-4o-mini"],
    lambda name, opts=None: f"openai:{name}",
)

# Regex pattern
model_registry.register_models(
    re.compile(r"^claude-"),
    lambda name, opts=None: f"anthropic:{name}",
)

# Fallback for unmatched names
model_registry.set_default(lambda name, opts=None: name)
```

## Running prompts

Load and run prompts with `run_text_prompt`:

```python theme={null}
import asyncio
from agentmark_pydantic_ai_v0 import run_text_prompt
from agentmark_client import client

async def main():
    prompt = await client.load_text_prompt("greeting.prompt.mdx")
    params = await prompt.format(props={"name": "Alice"})

    result = await run_text_prompt(params)
    print(result.output)
    print(f"Tokens: {result.usage.total_tokens}")

asyncio.run(main())
```

## Object generation

For structured output, the adapter automatically converts JSON Schema to Pydantic models:

```python theme={null}
from agentmark_pydantic_ai_v0 import run_object_prompt
from agentmark_client import client

prompt = await client.load_object_prompt("sentiment.prompt.mdx")
params = await prompt.format(props={"text": "This product is amazing!"})

result = await run_object_prompt(params)
print(result.output)              # Typed Pydantic model instance
print(result.output.sentiment)    # 'positive'
```

## Streaming

Stream text responses for real-time output:

```python theme={null}
from agentmark_pydantic_ai_v0 import stream_text_prompt

params = await prompt.format(props={"query": "Explain quantum computing"})

async for chunk in stream_text_prompt(params):
    print(chunk, end="", flush=True)
```

## Tools

Pass native Python tool functions as a **list** to `create_pydantic_ai_client`. The adapter derives the tool name from each function's `__name__`, then matches that name against the prompt's `tools:` frontmatter:

```python agentmark_client.py theme={null}
from agentmark_pydantic_ai_v0 import create_pydantic_ai_client

# Sync tool — name comes from `search.__name__` = "search"
def search(query: str) -> str:
    return f"Results for: {query}"

# Async tool — name comes from `fetch_data.__name__` = "fetch_data"
async def fetch_data(url: str) -> str:
    return await api.get(url)

client = create_pydantic_ai_client(
    model_registry=model_registry,
    tools=[search, fetch_data],
    loader=loader,
)
```

<Note>
  The function name in your Python code must match the name used in the prompt's `tools:` frontmatter. Use `pydantic_ai.Tool(function=..., name="custom-name")` if you need to rename.
</Note>

Then reference tools in your prompts by name:

```mdx search.prompt.mdx theme={null}
---
name: search
text_config:
  model_name: gpt-4o
  tools:
    - search
---

<System>You are a helpful search assistant.</System>
<User>Search for {props.query}</User>
```

## MCP servers

MCP servers are passed via an `McpServerRegistry`, not a raw `mcp_servers` dict. Construct the registry and pass it as `mcp_registry`:

```python agentmark_client.py theme={null}
from agentmark_pydantic_ai_v0 import create_pydantic_ai_client, McpServerRegistry

mcp_registry = McpServerRegistry()
mcp_registry.register_servers({
    # URL-based server
    "search": {
        "url": "http://localhost:8000/mcp",
    },
    # Stdio-based server
    "python-runner": {
        "command": "python",
        "args": ["-m", "mcp_server"],
        "cwd": "/app",
    },
})

client = create_pydantic_ai_client(
    model_registry=model_registry,
    loader=loader,
    mcp_registry=mcp_registry,
)
```

Reference MCP tools in prompts with the `mcp://` prefix:

```mdx theme={null}
---
name: task
text_config:
  model_name: gpt-4o
  tools:
    - mcp://search/web_search
    - mcp://python-runner/*
---
```

## Evals

Register evaluation functions to score prompt outputs during experiments. Pass an `evals` dictionary of plain functions to `create_pydantic_ai_client`. Score schemas are defined separately in `agentmark.json` — eval functions are connected to scores by name.

```python agentmark_client.py theme={null}
from agentmark.prompt_core import EvalParams, EvalResult
from agentmark_pydantic_ai_v0 import create_pydantic_ai_client

def exact_match(params: EvalParams) -> EvalResult:
    match = str(params["output"]).strip() == str(params.get("expectedOutput", "")).strip()
    return {"passed": match, "score": 1.0 if match else 0.0}

evals = {
    "exact_match": exact_match,
}

client = create_pydantic_ai_client(
    model_registry=model_registry,
    loader=loader,
    evals=evals,
)
```

Each entry maps a score name to a sync or async function that receives `EvalParams` and returns `EvalResult`.

Reference evals in your prompt frontmatter:

```mdx theme={null}
---
test_settings:
  dataset: ./datasets/test.jsonl
  evals:
    - exact_match
---
```

[Learn more about evaluations](/evaluate/writing-evals)

## Getting started

The fastest way to scaffold a Python project:

```bash theme={null}
npm create agentmark@latest my-app
# Select "Python" when prompted for language
# Select "Pydantic AI" as the adapter
```

Run the local dev server:

```bash theme={null}
npx agentmark dev
```

The CLI automatically detects Python projects via `pyproject.toml` or `agentmark_client.py`.

## Limitations

* **No image generation** — use the [AI SDK adapter](/integrations/typescript/ai-sdk) (TypeScript) for `experimental_generateImage`.
* **No speech generation** — use the [AI SDK adapter](/integrations/typescript/ai-sdk) (TypeScript) for `experimental_generateSpeech`.

## Next steps

<CardGroup>
  <Card title="AI SDK" icon="bolt" href="/integrations/typescript/ai-sdk">
    TypeScript adapter for Node.js
  </Card>

  <Card title="Claude Agent SDK" icon="microchip" href="/integrations/typescript/claude-agent-sdk">
    Agentic tasks with Claude
  </Card>

  <Card title="Prompts" icon="file-code" href="/build/overview">
    Learn about prompt syntax
  </Card>

  <Card title="Observability" icon="chart-line" href="/observe/overview">
    Monitor prompts in production
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# AI SDK
Source: https://docs.agentmark.co/integrations/typescript/ai-sdk

Use AgentMark prompts with the Vercel AI SDK

The AI SDK adapter allows you to use AgentMark prompts with [Vercel AI SDK](https://sdk.vercel.ai/docs)'s generation functions. This is the recommended adapter for most TypeScript/JavaScript applications.

AgentMark provides two versions of this adapter — one for AI SDK v4 and one for AI SDK v5. Both share the same API surface (`VercelAIModelRegistry`, `createAgentMarkClient`), so switching between them only requires changing the package import.

## Choosing a version

|                     | AI SDK v4 Adapter                            | AI SDK v5 Adapter                              |
| ------------------- | -------------------------------------------- | ---------------------------------------------- |
| **Package**         | `@agentmark-ai/ai-sdk-v4-adapter`            | `@agentmark-ai/ai-sdk-v5-adapter`              |
| **AI SDK peer**     | `ai` ^4.0.0                                  | `ai` ^5.0.52                                   |
| **MCP imports**     | Built into `ai` package                      | Separate `@ai-sdk/mcp` peer dependency         |
| **Tool definition** | `parameters` field                           | `inputSchema` field (wrapped via `jsonSchema`) |
| **Status**          | Stable — use if your project is on AI SDK v4 | **Recommended** — use for new projects         |

<Note>
  If you're starting a new project, use the **v5 adapter**. The v4 adapter is provided for projects that haven't yet upgraded to AI SDK v5.
</Note>

## Installation

Install the adapter, the `ai` core package, and the provider package(s) for the models you want to use. Provider packages must be compatible with your `ai` core version.

<Tabs>
  <Tab title="AI SDK v5 (Recommended)">
    ```bash theme={null}
    # Core
    npm install @agentmark-ai/ai-sdk-v5-adapter ai@^5

    # Provider packages (install the ones you need)
    npm install @ai-sdk/openai       # OpenAI / GPT models
    npm install @ai-sdk/anthropic    # Anthropic / Claude models
    npm install @ai-sdk/google       # Google / Gemini models

    # MCP server support (optional)
    npm install @ai-sdk/mcp
    ```
  </Tab>

  <Tab title="AI SDK v4">
    ```bash theme={null}
    # Core
    npm install @agentmark-ai/ai-sdk-v4-adapter ai@^4

    # Provider packages — use v4-compatible versions
    npm install @ai-sdk/openai@^1    # OpenAI / GPT models
    npm install @ai-sdk/anthropic@^1 # Anthropic / Claude models
    npm install @ai-sdk/google@^1    # Google / Gemini models
    ```

    <Note>
      AI SDK v4 uses `@ai-sdk/` provider packages at v1.x. AI SDK v5 uses v2.x+. Make sure you install the version that matches your `ai` core package.
    </Note>
  </Tab>
</Tabs>

## Setup

Create your AgentMark client with a model registry. Use `.registerProviders()` to register provider packages — model IDs written as `"<provider>/<model>"` (e.g., `"openai/gpt-4o"`) auto-resolve. Use `.registerModels()` for exact names or a single `RegExp` pattern. `.registerModels()` accepts `string | RegExp | string[]` — but NOT `RegExp[]` (wrap in a single regex instead).

<Tabs>
  <Tab title="AI SDK v5">
    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { anthropic } from "@ai-sdk/anthropic";
    import { openai } from "@ai-sdk/openai";

    const modelRegistry = new VercelAIModelRegistry();

    // Preferred: register providers, then use "<provider>/<model>" IDs in prompts
    modelRegistry.registerProviders({ openai, anthropic });

    // Or register models explicitly
    modelRegistry
      .registerModels(["claude-sonnet-4-20250514"], (name) => anthropic(name))
      .registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name))
      .registerModels(/^gpt-/, (name) => openai(name)); // single regex — not wrapped in []

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
    });
    ```
  </Tab>

  <Tab title="AI SDK v4">
    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v4-adapter";
    import { anthropic } from "@ai-sdk/anthropic";
    import { openai } from "@ai-sdk/openai";

    const modelRegistry = new VercelAIModelRegistry();
    modelRegistry.registerProviders({ openai, anthropic });

    modelRegistry
      .registerModels(["claude-3-5-sonnet-20241022"], (name) => anthropic(name))
      .registerModels(/^gpt-/, (name) => openai(name));

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
    });
    ```
  </Tab>
</Tabs>

The setup is identical — only the import path changes.

## Running prompts

Load and run prompts with `generateText()`:

```typescript theme={null}
import { client } from "./agentmark.client";
import { generateText } from "ai";

const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
const input = await prompt.format({
  props: { name: "Alice" },
});

const result = await generateText(input);
console.log(result.text);
```

## Object generation

For structured output, use `generateObject()`:

The output schema lives in the prompt's `object_config.schema` frontmatter — not in the load call:

```typescript theme={null}
import { client } from "./agentmark.client";
import { generateObject } from "ai";

const prompt = await client.loadObjectPrompt("extract.prompt.mdx");

const input = await prompt.format({
  props: { text: "This product is amazing!" },
});

const result = await generateObject(input);
console.log(result.object);
// { sentiment: 'positive', confidence: 0.95 }
```

## Streaming

Stream responses for real-time output with `streamText` or `streamObject`:

```typescript theme={null}
import { streamText } from "ai";

const prompt = await client.loadTextPrompt("story.prompt.mdx");
const input = await prompt.format({
  props: { topic: "space exploration" },
});

const result = streamText(input);

for await (const chunk of result.textStream) {
  process.stdout.write(chunk);
}
```

For streaming structured objects:

```typescript theme={null}
import { streamObject } from "ai";

const prompt = await client.loadObjectPrompt("extract.prompt.mdx");

const input = await prompt.format({ props: { text: "..." } });
const result = streamObject(input);

for await (const partial of result.partialObjectStream) {
  console.log(partial);
}
```

## Image generation

Generate images using `experimental_generateImage`:

```typescript theme={null}
import { experimental_generateImage } from "ai";

const prompt = await client.loadImagePrompt("avatar.prompt.mdx");
const input = await prompt.format({
  props: { description: "A futuristic city skyline" },
});

const result = await experimental_generateImage(input);
```

## Speech generation

Generate speech using `experimental_generateSpeech`:

```typescript theme={null}
import { experimental_generateSpeech } from "ai";

const prompt = await client.loadSpeechPrompt("narration.prompt.mdx");
const input = await prompt.format({
  props: { text: "Welcome to AgentMark." },
});

const result = await experimental_generateSpeech(input);
```

## Tools

Configure tools using the AI SDK's native `tool()` helper. **AI SDK v5 uses `inputSchema`; AI SDK v4 uses `parameters`.** Mixing them fails type-checking (`TS2769`):

<Tabs>
  <Tab title="AI SDK v5">
    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { tool } from "ai";
    import { z } from "zod";

    const weatherTool = tool({
      description: "Get current weather for a location",
      inputSchema: z.object({
        location: z.string(),
      }),
      execute: async ({ location }) => {
        return `The weather in ${location} is sunny and 72°F`;
      },
    });

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      tools: { weather: weatherTool },
    });
    ```
  </Tab>

  <Tab title="AI SDK v4">
    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v4-adapter";
    import { tool } from "ai";
    import { z } from "zod";

    const weatherTool = tool({
      description: "Get current weather for a location",
      parameters: z.object({
        location: z.string(),
      }),
      execute: async ({ location }) => {
        return `The weather in ${location} is sunny and 72°F`;
      },
    });

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      tools: { weather: weatherTool },
    });
    ```
  </Tab>
</Tabs>

Then reference tools in your prompts:

```mdx weather.prompt.mdx theme={null}
---
name: weather
text_config:
  model_name: claude-3-5-sonnet-20241022
  tools:
    - weather
---

<System>You are a helpful weather assistant.</System>
<User>What's the weather in {props.location}?</User>
```

## MCP servers

AgentMark supports [Model Context Protocol](https://modelcontextprotocol.io) servers for extended capabilities. Configure both stdio and URL-based servers:

<Tabs>
  <Tab title="AI SDK v5">
    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      mcpServers: {
        filesystem: {
          command: "npx",
          args: ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/files"],
        },
        remote: {
          url: "https://mcp.example.com/sse",
        },
      },
    });
    ```

    The v5 adapter imports MCP support from `@ai-sdk/mcp` (installed separately).
  </Tab>

  <Tab title="AI SDK v4">
    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v4-adapter";

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      mcpServers: {
        filesystem: {
          command: "npx",
          args: ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/files"],
        },
        remote: {
          url: "https://mcp.example.com/sse",
        },
      },
    });
    ```

    The v4 adapter uses MCP support built into the `ai` package.
  </Tab>
</Tabs>

## Observability

Enable telemetry to track performance and debug issues:

```typescript theme={null}
const input = await prompt.format({
  props: { name: "Alice" },
  telemetry: {
    isEnabled: true,
    functionId: "greeting-handler",
    metadata: {
      userId: "user-123",
      sessionId: "session-abc",
    },
  },
});

const result = await generateText(input);
```

Learn more in [Tracing setup](/observe/tracing-setup).

## Migrating from v4 to v5

If you're upgrading from the v4 adapter to v5:

1. **Update packages** — upgrade the adapter, `ai` core, provider packages, and add MCP if needed:
   ```bash theme={null}
   npm uninstall @agentmark-ai/ai-sdk-v4-adapter
   npm install @agentmark-ai/ai-sdk-v5-adapter ai@^5 @ai-sdk/mcp

   # Also upgrade your provider packages to v5-compatible versions
   npm install @ai-sdk/openai@latest @ai-sdk/anthropic@latest
   ```

2. **Update imports** — change the package path in your `agentmark.client.ts`:
   ```typescript theme={null}
   // Before
   import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v4-adapter";
   // After
   import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
   ```

3. **Rename tool `parameters` to `inputSchema`** — AI SDK v5's `tool()` helper renamed the field. This is the most common cause of TS2769 errors after a v4→v5 upgrade.

4. **Otherwise no changes required** — the `VercelAIModelRegistry` and `createAgentMarkClient` APIs are the same. Tools and MCP servers are passed as plain objects. All prompt `.format()` calls and AI SDK generation functions (`generateText`, `generateObject`, `streamText`, etc.) remain identical.

## Next steps

<CardGroup>
  <Card title="Prompts" icon="file-code" href="/build/overview">
    Learn about prompt syntax
  </Card>

  <Card title="Testing" icon="flask" href="/evaluate/overview">
    Test your prompts with datasets
  </Card>

  <Card title="Observability" icon="chart-line" href="/observe/overview">
    Monitor your prompts in production
  </Card>

  <Card title="Other integrations" icon="plug" href="/integrations/overview">
    Explore other AI frameworks
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Claude Agent SDK
Source: https://docs.agentmark.co/integrations/typescript/claude-agent-sdk

Use AgentMark prompts with Anthropic's Claude Agent SDK

The Claude Agent SDK adapter lets you run AgentMark prompts as agentic tasks using [Anthropic's Claude Agent SDK](https://github.com/anthropics/claude-agent-sdk) — tool use, budget controls, and tracing. Available for both TypeScript and Python.

<Note>
  Python developer? This page shows TypeScript and Python side by side. For a Python-only walkthrough, see [Claude Agent SDK (Python)](/integrations/python/claude-agent-sdk).
</Note>

## Installation

<Tabs>
  <Tab title="TypeScript">
    ```bash theme={null}
    npm install @agentmark-ai/claude-agent-sdk-v0-adapter @anthropic-ai/claude-agent-sdk
    ```
  </Tab>

  <Tab title="Python">
    ```bash theme={null}
    pip install agentmark-claude-agent-sdk-v0 agentmark-prompt-core claude-agent-sdk
    ```
  </Tab>
</Tabs>

## Setup

<Tabs>
  <Tab title="TypeScript">
    Create your AgentMark client with a `ClaudeAgentModelRegistry`. The registry creator is a **function** that receives the model name and returns a `ModelConfig`. Use `createDefault()` for a pass-through registry, or register models explicitly if you need per-model options like `maxThinkingTokens`:

    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, ClaudeAgentModelRegistry } from "@agentmark-ai/claude-agent-sdk-v0-adapter";

    // Option 1: pass-through registry
    const modelRegistry = ClaudeAgentModelRegistry.createDefault();

    // Option 2: explicit registration with per-model config
    const modelRegistry = new ClaudeAgentModelRegistry()
      .registerModels(["claude-sonnet-4-20250514"], (name) => ({ model: name }))
      .registerModels(["claude-opus-4-20250514"], (name) => ({
        model: name,
        maxThinkingTokens: 10000,
      }));

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
    });
    ```
  </Tab>

  <Tab title="Python">
    The Python adapter does **not** ship a `create_default()` / `.createDefault()` — register models explicitly. Use `ModelConfig` for per-model options:

    ```python agentmark_client.py theme={null}
    from pathlib import Path
    from dotenv import load_dotenv
    from agentmark.prompt_core import FileLoader
    from agentmark_claude_agent_sdk_v0 import (
        create_claude_agent_client,
        ClaudeAgentModelRegistry,
        ClaudeAgentAdapterOptions,
        ModelConfig,
    )

    load_dotenv()

    model_registry = ClaudeAgentModelRegistry()
    model_registry.register_models(
        ["claude-sonnet-4-20250514"],
        lambda name, _: ModelConfig(model=name),
    )
    model_registry.register_models(
        ["claude-opus-4-20250514"],
        lambda name, _: ModelConfig(model=name, max_thinking_tokens=10000),
    )

    loader = FileLoader(str(Path(__file__).parent / "dist" / "agentmark"))

    client = create_claude_agent_client(
        model_registry=model_registry,
        loader=loader,
    )
    ```
  </Tab>
</Tabs>

## Running prompts

The adapter returns `{ query: { prompt, options }, messages, telemetry }`. Pass `adapted.query` directly to `query()` from `@anthropic-ai/claude-agent-sdk`:

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { client } from "./agentmark.client";
    import { query } from "@anthropic-ai/claude-agent-sdk";

    const prompt = await client.loadTextPrompt("task.prompt.mdx");
    const adapted = await prompt.format({
      props: { task: "Analyze the auth module and suggest improvements" },
    });

    for await (const message of query(adapted.query)) {
      console.log(message);
    }
    ```
  </Tab>

  <Tab title="Python">
    Use `traced_query` — `run_text_prompt` / `run_object_prompt` are Pydantic AI symbols and are not exported by the Claude adapter:

    ```python theme={null}
    import asyncio
    from agentmark_claude_agent_sdk_v0 import traced_query
    from agentmark_client import client

    async def main():
        prompt = await client.load_text_prompt("code-reviewer.prompt.mdx")
        adapted = await prompt.format(props={
            "task": "Analyze the auth module and suggest improvements"
        })

        async for message in traced_query(adapted):
            print(message)

    asyncio.run(main())
    ```
  </Tab>
</Tabs>

## Adapter options

Adapter options are configured at **client construction time** (via `createAgentMarkClient`), not inside `prompt.format()`. `prompt.format()` silently ignores them:

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      adapterOptions: {
        permissionMode: "bypassPermissions",
        maxTurns: 10,
        cwd: "/path/to/project",
        maxBudgetUsd: 5.00,
        allowedTools: ["Read", "Write", "Bash"],
        disallowedTools: ["WebFetch"],
        systemPromptPreset: false,
        onWarning: (warning) => {
          console.warn("Agent warning:", warning);
        },
      },
    });
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    from agentmark_claude_agent_sdk_v0 import (
        create_claude_agent_client,
        ClaudeAgentAdapterOptions,
    )

    client = create_claude_agent_client(
        model_registry=model_registry,
        loader=loader,
        adapter_options=ClaudeAgentAdapterOptions(
            permission_mode="bypassPermissions",
            max_turns=10,
            cwd="/path/to/project",
            max_budget_usd=5.00,
            allowed_tools=["Read", "Write", "Bash"],
            disallowed_tools=["WebFetch"],
            system_prompt_preset=False,
            on_warning=lambda w: print(f"Warning: {w}"),
        ),
    )
    ```
  </Tab>
</Tabs>

| Option               | TypeScript           | Python                 | Type / values                                                 |
| -------------------- | -------------------- | ---------------------- | ------------------------------------------------------------- |
| Permission mode      | `permissionMode`     | `permission_mode`      | `'default' \| 'acceptEdits' \| 'bypassPermissions' \| 'plan'` |
| Max turns            | `maxTurns`           | `max_turns`            | `number`                                                      |
| Working directory    | `cwd`                | `cwd`                  | `string`                                                      |
| Budget limit         | `maxBudgetUsd`       | `max_budget_usd`       | `number` (USD)                                                |
| Allowed tools        | `allowedTools`       | `allowed_tools`        | `string[]` — whitelist                                        |
| Disallowed tools     | `disallowedTools`    | `disallowed_tools`     | `string[]` — blacklist                                        |
| System prompt preset | `systemPromptPreset` | `system_prompt_preset` | `boolean` (use Claude Code's built-in)                        |
| Warning handler      | `onWarning`          | `on_warning`           | `(message: string) => void`                                   |

## Object generation

For structured output, use object prompts:

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { client } from "./agentmark.client";
    import { query } from "@anthropic-ai/claude-agent-sdk";

    const prompt = await client.loadObjectPrompt("extract.prompt.mdx");
    const adapted = await prompt.format({
      props: { text: "This product is amazing!" },
    });

    for await (const message of query(adapted.query)) {
      if (message.type === "result" && message.subtype === "success") {
        console.log(message.result);
      }
    }
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    from agentmark_claude_agent_sdk_v0 import traced_query
    from agentmark_client import client

    prompt = await client.load_object_prompt("sentiment.prompt.mdx")
    adapted = await prompt.format(props={"text": "This product is amazing!"})

    async for message in traced_query(adapted):
        if message.type == "result":
            print(message.result)
    ```
  </Tab>
</Tabs>

## Tools

The Claude Agent SDK adapter handles tools differently from the [AI SDK adapter](/integrations/typescript/ai-sdk). Instead of registering custom tool executors, you list tool names in your prompt frontmatter. The adapter passes these names through as `allowedTools` to the Claude Agent SDK.

Tools can be any of the SDK's built-in tools (Read, Write, Bash, etc.) or tools provided by MCP servers. Configure MCP servers on the client:

<Tabs>
  <Tab title="TypeScript">
    ```typescript agentmark.client.ts theme={null}
    import { createAgentMarkClient, ClaudeAgentModelRegistry } from "@agentmark-ai/claude-agent-sdk-v0-adapter";

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      mcpServers: {
        weather: { url: "https://weather-mcp.example.com/sse" },
      },
    });
    ```

    The TypeScript field is `mcpServers` (camelCase). The Python adapter uses `mcp_servers` (snake\_case).
  </Tab>

  <Tab title="Python">
    ```python agentmark_client.py theme={null}
    client = create_claude_agent_client(
        model_registry=model_registry,
        mcp_servers={"weather": {"url": "https://weather-mcp.example.com/sse"}},
        loader=loader,
    )
    ```
  </Tab>
</Tabs>

Then reference tools by name in your prompts:

```mdx task.prompt.mdx theme={null}
---
name: task
text_config:
  model_name: claude-sonnet-4-20250514
  tools:
    - weather
---

<System>You are a helpful assistant with access to weather data.</System>
<User>{props.task}</User>
```

## Evals

Register evaluation functions for scoring prompt outputs during experiments. Score schemas are defined separately in `agentmark.json` — eval functions are connected to scores by name.

<Tabs>
  <Tab title="TypeScript">
    ```typescript agentmark.client.ts theme={null}
    import type { EvalFunction } from "@agentmark-ai/prompt-core";

    const evals: Record<string, EvalFunction> = {
      exact_match: ({ output, expectedOutput }) => ({
        passed: output === expectedOutput,
        score: output === expectedOutput ? 1 : 0,
      }),
    };

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      evals,
    });
    ```
  </Tab>

  <Tab title="Python">
    ```python agentmark_client.py theme={null}
    from agentmark.prompt_core import EvalParams, EvalResult

    def exact_match(params: EvalParams) -> EvalResult:
        match = str(params["output"]).strip() == str(params.get("expectedOutput", "")).strip()
        return {"passed": match, "score": 1.0 if match else 0.0}

    evals = {
        "exact_match": exact_match,
    }

    client = create_claude_agent_client(
        model_registry=model_registry,
        loader=loader,
        evals=evals,
    )
    ```
  </Tab>
</Tabs>

Reference evals in your prompt frontmatter:

```mdx theme={null}
---
test_settings:
  dataset: ./datasets/test.jsonl
  evals:
    - exact_match
---
```

[Learn more about evaluations](/evaluate/writing-evals)

## Tracing

<Tabs>
  <Tab title="TypeScript">
    Wrap `query` with `withTracing` to emit OpenTelemetry spans. Pass `adapted.query` and `adapted.telemetry` in an object:

    ```typescript theme={null}
    import { client } from "./agentmark.client";
    import { withTracing } from "@agentmark-ai/claude-agent-sdk-v0-adapter";
    import { query } from "@anthropic-ai/claude-agent-sdk";

    const prompt = await client.loadTextPrompt("task.prompt.mdx");
    const adapted = await prompt.format({
      props: { task: "..." },
      telemetry: { isEnabled: true },
    });

    const result = await withTracing(query, {
      query: adapted.query,
      telemetry: adapted.telemetry,
    });

    console.log("Trace ID:", result.traceId);

    for await (const message of result) {
      console.log(message);
    }
    ```
  </Tab>

  <Tab title="Python">
    `traced_query` automatically emits OpenTelemetry spans when the adapter's telemetry is enabled on `prompt.format()`:

    ```python theme={null}
    from agentmark_claude_agent_sdk_v0 import traced_query

    adapted = await prompt.format(
        props={"task": "..."},
        telemetry={"isEnabled": True},
    )

    async for message in traced_query(adapted):
        print(message)
    ```
  </Tab>
</Tabs>

Learn more in [Tracing setup](/observe/tracing-setup).

## Getting started (Python)

Scaffold a Python project with the Claude Agent SDK adapter:

```bash theme={null}
npm create agentmark@latest my-app
# Select "Python" when prompted for language
# Select "Claude Agent SDK" as the adapter
```

Run the local dev server:

```bash theme={null}
npx agentmark dev
```

## Limitations

* **No image generation** — use the [AI SDK adapter](/integrations/typescript/ai-sdk) for `experimental_generateImage`.
* **No speech generation** — use the [AI SDK adapter](/integrations/typescript/ai-sdk) for `experimental_generateSpeech`.
* Messages stream as the agent runs via `query()` / `traced_query()`, but the final aggregated result is only available after all turns complete.

## Next steps

<CardGroup>
  <Card title="Prompts" icon="file-code" href="/build/overview">
    Learn about prompt syntax
  </Card>

  <Card title="Tools and agents" icon="wrench" href="/build/tools-and-agents">
    Configure tools for your agents
  </Card>

  <Card title="Observability" icon="chart-line" href="/observe/overview">
    Monitor your agents in production
  </Card>

  <Card title="Other integrations" icon="plug" href="/integrations/overview">
    Explore other AI frameworks
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Mastra
Source: https://docs.agentmark.co/integrations/typescript/mastra

Use AgentMark prompts with Mastra agents

The Mastra adapter lets you use AgentMark prompts with [Mastra](https://mastra.ai)'s agentic workflow framework.

## Installation

```bash theme={null}
npm install @agentmark-ai/mastra-v0-adapter @mastra/core @ai-sdk/openai
```

## Setup

Create your AgentMark client with a `MastraModelRegistry`. Use `.registerProviders()` to register AI SDK provider packages; Mastra uses the same `@ai-sdk/*` providers under the hood, so model IDs written as `"openai/gpt-4o"` auto-resolve:

```typescript agentmark.client.ts theme={null}
import { createAgentMarkClient, MastraModelRegistry } from "@agentmark-ai/mastra-v0-adapter";
import { openai } from "@ai-sdk/openai";
import { anthropic } from "@ai-sdk/anthropic";

const modelRegistry = new MastraModelRegistry();
modelRegistry.registerProviders({ openai, anthropic });

// Or register models explicitly:
// modelRegistry.registerModels(["gpt-4o"], (name) => openai(name));

export const client = createAgentMarkClient({
  loader,
  modelRegistry,
});
```

## Running prompts

Mastra prompts go through four steps: `formatAgent()` returns an `AgentConfig`, you construct a `new Agent(agentConfig)`, `formatMessages()` (async) returns a `[messages, options]` tuple, and finally `agent.generate(messages, options)` runs the prompt:

```typescript theme={null}
import { client } from "./agentmark.client";
import { Agent } from "@mastra/core/agent";

const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
const agentConfig = await prompt.formatAgent({
  props: { name: "Alice" },
});

const agent = new Agent(agentConfig);
const [messages, options] = await agentConfig.formatMessages();
const result = await agent.generate(messages, options);

console.log(result.text);
```

<Note>
  `formatMessages()` is **async** — always `await` it. The returned tuple is `[messages, options]`, both required by `agent.generate()` / `agent.stream()`.
</Note>

## Object generation

For structured output, use object prompts. The schema lives in your `.prompt.mdx` frontmatter (`object_config.schema`), not as a second argument to `loadObjectPrompt()`:

```typescript theme={null}
import { client } from "./agentmark.client";
import { Agent } from "@mastra/core/agent";

const prompt = await client.loadObjectPrompt("sentiment.prompt.mdx");
const agentConfig = await prompt.formatAgent({
  props: { text: "This product is amazing!" },
});

const agent = new Agent(agentConfig);
const [messages, options] = await agentConfig.formatMessages();
const result = await agent.generate(messages, options);

console.log(result.object);
// { sentiment: 'positive', confidence: 0.95 }
```

## Streaming

Stream responses using `agent.stream()`:

```typescript theme={null}
import { Agent } from "@mastra/core/agent";

const prompt = await client.loadTextPrompt("story.prompt.mdx");
const agentConfig = await prompt.formatAgent({
  props: { topic: "space exploration" },
});

const agent = new Agent(agentConfig);
const [messages, options] = await agentConfig.formatMessages();
const stream = await agent.stream(messages, options);

for await (const chunk of stream.textStream) {
  process.stdout.write(chunk);
}
```

## Tools

Mastra tools use the `ai` v4 `tool()` helper with **`parameters:`** (not `inputSchema:` — that's AI SDK v5). This matches Mastra's internals:

```typescript agentmark.client.ts theme={null}
import { createAgentMarkClient, MastraModelRegistry } from "@agentmark-ai/mastra-v0-adapter";
import { tool } from "ai";
import { z } from "zod";

const weatherTool = tool({
  description: "Get current weather for a location",
  parameters: z.object({
    location: z.string(),
  }),
  execute: async ({ location }) => {
    return `The weather in ${location} is sunny and 72°F`;
  },
});

export const client = createAgentMarkClient({
  loader,
  modelRegistry,
  tools: {
    weather: weatherTool,
  },
});
```

Then reference tools in your prompts:

```mdx weather.prompt.mdx theme={null}
---
name: weather
text_config:
  model_name: openai/gpt-4o
  tools:
    - weather
---

<System>You are a helpful weather assistant.</System>
<User>What's the weather in {props.location}?</User>
```

## MCP servers

Configure MCP servers for extended capabilities:

```typescript agentmark.client.ts theme={null}
import { createAgentMarkClient, MastraModelRegistry } from "@agentmark-ai/mastra-v0-adapter";

export const client = createAgentMarkClient({
  loader,
  modelRegistry,
  mcpServers: {
    filesystem: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/files"],
    },
  },
});
```

## Limitations

* **No image generation** — use the [AI SDK adapter](/integrations/typescript/ai-sdk) for `experimental_generateImage`.
* **No speech generation** — use the [AI SDK adapter](/integrations/typescript/ai-sdk) for `experimental_generateSpeech`.

## Next steps

<CardGroup>
  <Card title="Prompts" icon="file-code" href="/build/overview">
    Learn about prompt syntax
  </Card>

  <Card title="Testing" icon="flask" href="/evaluate/overview">
    Test your prompts with datasets
  </Card>

  <Card title="Observability" icon="chart-line" href="/observe/overview">
    Monitor your agents in production
  </Card>

  <Card title="Other integrations" icon="plug" href="/integrations/overview">
    Explore other AI frameworks
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Core concepts
Source: https://docs.agentmark.co/introduction/core-concepts

The main concepts to help you get started with AgentMark

```mermaid theme={null}
graph TD
    Org[Organization]

    App1[App 1]
    App2[App 2]
    App3[App 3]

    P1[Prompts]
    T1[Traces/Logs]
    D1[Datasets]
    E1[Evals]

    P2[Prompts]
    T2[Traces/Logs]
    D2[Datasets]
    E2[Evals]

    P3[Prompts]
    T3[Traces/Logs]
    D3[Datasets]
    E3[Evals]

    Org --> App1
    Org --> App2
    Org --> App3

    App1 --> P1
    App1 --> T1
    App1 --> D1
    App1 --> E1

    App2 --> P2
    App2 --> T2
    App2 --> D2
    App2 --> E2

    App3 --> P3
    App3 --> T3
    App3 --> D3
    App3 --> E3

    style Org fill:#fff3c4,stroke:#333,stroke-width:2px
    style App1 fill:#e2f0d9,stroke:#333,stroke-width:2px
    style App2 fill:#e2f0d9,stroke:#333,stroke-width:2px
    style App3 fill:#e2f0d9,stroke:#333,stroke-width:2px
    style P1 fill:#dae3f3,stroke:#333,stroke-width:2px
    style T1 fill:#dae3f3,stroke:#333,stroke-width:2px
    style D1 fill:#dae3f3,stroke:#333,stroke-width:2px
    style E1 fill:#dae3f3,stroke:#333,stroke-width:2px
    style P2 fill:#dae3f3,stroke:#333,stroke-width:2px
    style T2 fill:#dae3f3,stroke:#333,stroke-width:2px
    style D2 fill:#dae3f3,stroke:#333,stroke-width:2px
    style E2 fill:#dae3f3,stroke:#333,stroke-width:2px
    style P3 fill:#dae3f3,stroke:#333,stroke-width:2px
    style T3 fill:#dae3f3,stroke:#333,stroke-width:2px
    style D3 fill:#dae3f3,stroke:#333,stroke-width:2px
    style E3 fill:#dae3f3,stroke:#333,stroke-width:2px
```

The diagram shows AgentMark's three-level hierarchy. An **Organization** (yellow) contains multiple **Apps** (green), and each App owns its own set of resources (blue): Prompts, Traces/Logs, Datasets, and Evals. Resources are isolated per app — a prompt in App 1 is not visible to App 2.

## Organizations

Each organization is typically associated with an individual company. Organizations each have their own billing configuration.
Each organization can have multiple users, with the following roles: **Owner**, **Admin**, **Write**, or **Read**. See [Users and access control](/deploy/users-and-access-control) for what each role can do.
An organization often has multiple apps within it.

## Apps

Many apps can exist within an organization. Each app can be synced to a Git repository (GitHub or GitLab). Apps are isolated from each other, and each contain
their own prompt templates, traces, metrics, and API keys. Use separate apps for staging, production, or dev environments.

## Branches

Each app is backed by a default branch in its connected Git repository. AgentMark reads your prompt templates, datasets, and configuration files from this branch. You can work on additional branches — for previews, staging, or review workflows — and AgentMark syncs each one independently.

## Environments

Each app runs in one or more **environments** — isolated runtimes that serve a specific version of your prompts and code. Every app starts with a default `dev` environment that tracks your connected branch's HEAD live: every push deploys to `dev` instantly. Create additional environments like `staging` and `prod` to run **pinned, immutable version snapshots** that don't change until you explicitly **promote** a tested version into them. The version running in `prod` is the one you validated in `staging`, so pushing a fix to `dev` never silently changes production. [Learn more about Environments and promotions](/deploy/environments-and-promotions).

## Prompts

Prompts are defined in `.prompt.mdx` files — AgentMark's serialized format that bundles prompt content, reusable components, and associated evals into a single versioned artifact. Fetch them from your Git repository, or from AgentMark's secure CDN to iterate on prompts separately from your application code. [Learn more about Build](/build/overview).

## Traces

Traces capture every step from input to output. Each individual step is a span. For example, a prompt chain with 3 tool calls produces one trace containing multiple spans. [Learn more about Traces](/observe/traces-and-logs).

## Datasets

Datasets are collections of data you use to test prompts in bulk. Create datasets from your own data, public datasets, traces you've already captured in AgentMark, synthetic data, or manual entry. [Learn more about Datasets](/evaluate/datasets).

## Metrics

Metrics show you at a high level how users interact with your application — cost, latency, model usage, active users, and more. Filter metrics by time period, model, or other dimensions to drill in. [Learn more about Dashboards](/observe/dashboards).

## Evals

Evals are functions, declared in `.prompt.mdx` files, that automatically grade the outputs of your prompts. Run them locally via the CLI or SDK, or in AgentMark Cloud. Use evals to catch quality regressions before deploying to production. [Learn more about Evals](/evaluate/writing-evals).

## Sessions

Sessions group related traces to represent multi-turn conversations or workflows. For example, a chat conversation with multiple back-and-forth exchanges is tracked as a single session containing multiple traces. [Learn more about Sessions](/observe/sessions).

## Alerts

Alerts notify you when important thresholds are crossed in your application. Configure alerts for cost limits, latency spikes, error rates, and quality metrics to catch issues before they impact users. [Learn more about Alerts](/observe/alerts).

## Annotations

Annotations provide a human-in-the-loop quality assessment workflow. Team members manually label and review trace outputs to build ground-truth datasets and ensure prompt quality. [Learn more about Annotations](/evaluate/annotations).

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Deployment modes
Source: https://docs.agentmark.co/introduction/deployment-modes

The two ways to run AgentMark: local development and AgentMark Cloud.

# Deployment modes

AgentMark supports two modes:

## Local

Run everything on your machine with the CLI and SDK. No AgentMark account required.

* **Create prompts** as `.prompt.mdx` files in your editor
* **Run prompts** via CLI (`agentmark run-prompt`) or SDK
* **Run experiments** against datasets with automated evaluators
* **View traces** in your local dev server at `http://localhost:3000`
* **Iterate** with hot-reloading via `agentmark dev`

Local mode is ideal for development, prototyping, and teams that want full control over their data.

## Cloud

Take your app to production. Once you sync your app, Cloud features work automatically on top of your local workflow.

* **Production traces** — Collect and explore traces from live traffic with filtering, search, and graph view
* **Datasets & annotation queues** — Link production traces to datasets for regression testing, route them to annotation queues for human review
* **Scoring** — Run automated evaluators against production data, track quality over time
* **Dashboards & alerts** — Monitor costs, latency, error rates, and evaluation scores with customizable dashboards and real-time alerts
* **Visual prompt editor** — Create and edit prompts in the browser, run them in the Playground
* **Team collaboration** — Shared experiments, review workflows, role-based access, and version control with rollback
* **Enterprise** — SSO (SAML), custom roles, and app-level permissions. For data residency options, [contact us](mailto:hello@agentmark.co).

The local workflow keeps working after you adopt Cloud. Your `.prompt.mdx` files, local traces, and `agentmark dev` stay unchanged — Cloud is additive.

## Get started

<CardGroup>
  <Card title="Quickstart" icon="play" href="/getting-started/quickstart">
    Create your first project in under 5 minutes
  </Card>

  <Card title="Configuration" icon="gear" href="/configure/project-config">
    Configure your project and client
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# What is AgentMark?
Source: https://docs.agentmark.co/introduction/overview

AgentMark helps teams build reliable AI agents. Manage prompts, trace executions, run evaluations, and deploy with confidence — locally or in the cloud.

# What is AgentMark?

AgentMark is a prompt engineering and LLM observability platform for teams building AI agents. It covers the full lifecycle: create prompts, run them against any model, trace every execution, evaluate quality, and monitor production.

Unlike most AI platforms, **AgentMark doesn't require a cloud account to get started.** Your prompts live in your codebase as `.prompt.mdx` files, traces stay on your machine, and evaluations run from your terminal. **AgentMark Cloud** adds visual editing, rich trace exploration, team collaboration, and production monitoring — when you want it.

## Two ways to work

<CardGroup>
  <Card title="Local" icon="terminal">
    Everything on your machine. Create prompts as files, run them via SDK or CLI, trace executions locally, run evaluations from your terminal. No account needed. No data leaves your environment.
  </Card>

  <Card title="Cloud" icon="cloud">
    Visual tools and collaboration on top of your local workflow. A prompt editor, trace explorer, dashboards, alerts, annotations, and team management — accessible from any browser.
  </Card>
</CardGroup>

Most teams start local and add Cloud as they grow. Some stay local-only. Both are fully supported. See [pricing](/deploy/billing-and-usage) for Cloud tier details.

The local workflow keeps working after you adopt Cloud. Your `.prompt.mdx` files, local traces, and `agentmark dev` stay unchanged — Cloud is additive.

## What you can do

### Build prompts

Create prompts as `.prompt.mdx` files in your editor, or use the visual editor in the Dashboard. Both produce the same format — you can switch between them freely.

* **TemplateDX syntax** with variables, expressions, logic, and reusable components
* **Multiple output types**: text, structured objects, images, and speech
* **Tools and function calling** for agentic workflows
* **Version control** built in — every change tracked with history and rollback

[Learn more about Build](/build/overview)

### Evaluate quality

Run evaluators from code or CLI to score outputs automatically. Use the Dashboard for human annotations and shared experiment results.

* **Datasets** for bulk testing against input/output pairs
* **Custom evaluators** — numeric scores, pass/fail, classifications, LLM-as-judge
* **Experiments** to compare prompt versions and track performance over time
* **Annotations** for human-in-the-loop scoring and labeling

[Learn more about Evaluate](/evaluate/overview)

### Observe in production

Instrument with the SDK to capture traces automatically. Explore them in your terminal (local) or in the Dashboard with filtering, search, dashboards, and alerts.

* **Distributed tracing** built on OpenTelemetry — tracks inference spans, tool calls, streaming
* **Sessions** to group related traces across multi-turn conversations
* **Cost and token tracking** across models and time periods
* **Alerts** for latency spikes, cost thresholds, error rates, and quality drops
* **[REST API](/api-reference/overview)** for programmatic access to traces, scores, and metrics
* **[`agentmark-mcp` MCP server](/sdk-reference/tools/agentmark-mcp)** exposes the gateway as MCP tools — works with both the local dev server and Cloud, and is what your IDE agent (Claude Code, Cursor, …) uses to query AgentMark headlessly

[Learn more about Observe](/observe/overview)

### Integrate with your stack

AgentMark works with the tools you already use.

* **TypeScript**: Vercel AI SDK, Claude Agent SDK, Mastra
* **Python**: Pydantic AI, Claude Agent SDK
* **Any framework** via custom adapters and OpenTelemetry

[Learn more about Integrations](/integrations/overview)

## Get started

<CardGroup>
  <Card title="Quickstart" icon="play" href="/getting-started/quickstart">
    Create your first prompt and see traces in under 5 minutes
  </Card>

  <Card title="Core Concepts" icon="table-list" href="/introduction/core-concepts">
    Organizations, apps, branches, and how they fit together
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference/overview">
    Query traces, scores, and metrics via REST API
  </Card>

  <Card title="CLI Reference" icon="terminal" href="/sdk-reference/cli/commands">
    Manage prompts, run evals, and query the API from your terminal
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Alerts
Source: https://docs.agentmark.co/observe/alerts

Monitor your application with customizable alerts

<Info>**Paid feature.** Alerts are available on Growth, Team, and Enterprise plans in the [AgentMark Dashboard](https://app.agentmark.co). They are not available on the Hobby (free) tier.</Info>

AgentMark's alert system monitors critical metrics and notifies you when thresholds are exceeded. Stay informed about your application's performance, costs, and potential issues.

<Note>
  Webhook endpoints are configured by developers. See [Webhook documentation](/deploy/webhooks) for setup instructions.
</Note>

<img alt="Alerts dashboard" />

## Overview

Alerts help you:

* Monitor important metrics like cost, latency, and error rates
* Set thresholds for acceptable values
* Define time windows for measurement
* Receive notifications via Slack or webhooks
* Track alert history to analyze patterns

## Available metrics

| Metric                | Description                                    |
| --------------------- | ---------------------------------------------- |
| **Cost**              | Total cost of LLM calls within the time window |
| **Latency**           | Response times for AI requests                 |
| **Error rate**        | Percentage of failed requests                  |
| **Evaluation scores** | Quality of AI responses from scoring pipelines |

## Creating alerts

To create a new alert:

1. Navigate to your app's **Alerts** tab in the Dashboard
2. Click **Create alert**
3. Configure the alert:
   * **Name** — descriptive name for the alert
   * **Metric** — what to monitor
   * **Threshold** — value that triggers the alert
   * **Time window** — period over which the metric is measured
   * **Evaluation name** (for score alerts) — specific score to monitor
   * **Aggregation type** (for score alerts) — average or individual scores
   * **Threshold direction** (for score alerts) — alert above or below threshold

## Notification options

### Slack integration

Receive alerts directly in Slack:

1. Enable **Send to Slack** when creating an alert
2. Click **Connect Slack** if not already connected
3. Select the channel for notifications

### Webhooks

For custom integrations:

1. Enable **Use webhook** when creating an alert
2. Configure your webhook endpoint in the Developers section
3. Receive alerts as HTTP POST requests with alert details

For webhook payload format and implementation details, see the [Webhook documentation](/deploy/webhooks).

## Alert status

Alerts have two states:

* **Triggered** — the monitored metric has crossed the threshold.
* **Resolved** — the metric has returned to the acceptable side of the threshold.

## Alert history

View alert history to analyze patterns:

1. Navigate to the **Alerts** tab
2. Click an alert to see when it triggered, the value that triggered it, when it resolved, and how frequently it fires

## Use cases

* **Cost management** — get notified when daily spending exceeds budget
* **Performance monitoring** — alert when latency degrades beyond acceptable levels
* **Quality assurance** — track when evaluation scores drop below quality thresholds
* **Error detection** — catch error rate spikes before they impact users

## Best practices

* **Set realistic thresholds** based on your app's normal behavior, not arbitrary values.
* **Choose appropriate windows** to match your usage patterns and avoid alert fatigue.
* **Configure multiple channels** — Slack plus a webhook for critical alerts.
* **Review regularly** and adjust thresholds as your usage patterns evolve.
* **Start conservative** — begin with higher thresholds and tighten them once you understand your baseline.

## Next steps

<CardGroup>
  <Card title="Traces and logs" icon="chart-line" href="/observe/traces-and-logs">
    Monitor prompt execution
  </Card>

  <Card title="Sessions" icon="users" href="/observe/sessions">
    Track user interactions
  </Card>

  <Card title="Dashboards" icon="chart-bar" href="/observe/dashboards">
    View overall performance
  </Card>

  <Card title="Evaluations" icon="flask" href="/evaluate/writing-evals">
    Set up evaluation alerts
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Cost and token tracking
Source: https://docs.agentmark.co/observe/cost-and-token-tracking

Monitor LLM spending and token usage across your application

AgentMark automatically tracks costs and token usage for every LLM call. Costs are calculated from token counts using provider pricing tables, and are available at the individual trace level and aggregated across your dashboards.

<img alt="Dashboard showing average cost per request, token metrics, and cost chart over time" />

<Note>
  Developers set up observability in your application. See [Observe overview](/observe/overview) for setup instructions.
</Note>

## What AgentMark tracks

AgentMark records the following token and cost data for each LLM generation span. Embedding spans are tracked the same way, so embedding calls are included in your cost totals and analytics breakdowns.

* **Input tokens** (prompt tokens): The number of tokens in the prompt sent to the model
* **Output tokens** (completion tokens): The number of tokens in the model's response
* **Total tokens**: The sum of input and output tokens
* **Reasoning tokens**: Additional tokens used by models that support chain-of-thought reasoning (such as OpenAI o1 and o3). These tokens represent the model's internal reasoning steps before producing a response.
* **Cost**: The dollar cost of the request, calculated from token counts and the model's pricing

<Tip>
  Token counts are reported directly by the LLM provider's response. AgentMark does not estimate token counts — it uses the exact values returned by the API.
</Tip>

## How costs are calculated

AgentMark computes cost automatically based on the model used and current provider pricing:

```
cost = (input_tokens × input_price_per_token) + (output_tokens × output_price_per_token)
```

Pricing comes from AgentMark's [model registry](https://github.com/agentmark-ai/agentmark/tree/main/packages/model-registry), which is sourced from LiteLLM and OpenRouter and refreshed at build time. Costs are calculated at ingestion and stored alongside each trace, so you always see accurate cost data without manual configuration.

<Note>
  For custom or self-hosted models not in the built-in pricing table, you can define pricing in your `agentmark.json` using model schemas. See [Custom model pricing](#custom-model-pricing) below.
</Note>

## Supported providers

AgentMark's pricing table covers models from these providers:

* **OpenAI** — GPT-4.x, GPT-3.5, o1, o3, and variants
* **Anthropic** — Claude 4, Claude 3.x, and variants
* **Google** — Gemini 2.x and variants
* **Mistral** — Mistral Large, Medium, Small, and variants
* **Cohere** — Command R, Command R+, and variants
* **xAI** — Grok models
* **DeepSeek** — DeepSeek chat and reasoning models
* **Perplexity** — Sonar models
* **Groq**, **Fireworks AI**, **Together AI** — inference providers for open-weight models (including Llama variants)
* **AWS Bedrock**, **Azure OpenAI** — cloud-hosted variants of the above

The registry is refreshed from LiteLLM and OpenRouter on every release — new models and pricing updates flow in automatically.

## Where to view cost data

### Dashboard metrics

The [Dashboards](/observe/dashboards) page shows aggregate cost data across your application:

* **Total cost** over your selected time range
* **Cost by model** to see which models drive your spending
* **Cost trends** over time to identify usage patterns
* **Average cost per request** to understand per-call economics

### Trace list

Each trace in the [Traces](/observe/traces-and-logs) list displays its cost and token counts. Use this to inspect individual requests and understand their resource consumption.

### Trace detail

When you open a trace, each generation span shows its own token breakdown:

* Input tokens, output tokens, and total tokens
* Reasoning tokens (when the model supports it)
* Cost for that specific LLM call

For traces with multiple LLM calls, the trace-level cost is the sum of all generation spans within it.

### Sessions

The [Sessions](/observe/sessions) view aggregates cost and token usage across all traces in a session. This is useful for understanding the total cost of multi-turn conversations or agent workflows.

### Per-user cost attribution

The Dashboard tracks cost and token usage per user when you pass a `userId` to the SDK's `span()` function (or `span_context()` in Python). Use this for billing, capacity planning, or identifying heavy users. Filter traces by user ID in the [Filtering and search](/observe/filtering-and-search) view.

## Filtering by cost and tokens

You can filter traces by cost or token count using numeric operators in the filter bar. This helps you quickly find expensive or token-heavy requests.

**Available cost and token filters:**

* \*\*Cost ($)** — Filter traces where cost equals, exceeds, or falls below a threshold (for example, cost > $0.10)
* **Prompt tokens** — Filter by input token count
* **Completion tokens** — Filter by output token count

**Available operators for numeric filters:**

* `equals` / `notEquals` — Exact match
* `>` / `>=` — Greater than / greater than or equal
* `<` / `<=` — Less than / less than or equal

<Tip>
  Combine cost filters with model or user filters to answer questions like "Which GPT-4o requests cost more than \$0.05?" or "Which users have the most expensive requests?"
</Tip>

## Aggregate analysis

On a dashboard, add or edit an operational widget and set a **Group by** dimension to compare cost and token usage:

* **Group by model** to compare cost efficiency across models
* **Group by user** to see per-user spending
* **Group by metadata key** (for example, `feature` or `environment`) to identify which flows drive cost

## Custom model pricing

For models not in the built-in pricing table (such as self-hosted models, fine-tuned models, or newer providers), you can define custom pricing in your `agentmark.json` using model schemas:

```json theme={null}
{
  "modelSchemas": {
    "my-fine-tuned-model": {
      "label": "My Fine-Tuned GPT-4o",
      "cost": {
        "inputCost": 0.005,
        "outputCost": 0.015,
        "unitScale": 1000
      }
    }
  }
}
```

| Property     | Description                                                                                   |
| ------------ | --------------------------------------------------------------------------------------------- |
| `inputCost`  | Cost per unit for input tokens                                                                |
| `outputCost` | Cost per unit for output tokens                                                               |
| `unitScale`  | Number of tokens per unit (e.g., `1000` = cost per 1K tokens, `1000000` = cost per 1M tokens) |

Custom model pricing is applied at ingestion time, the same as built-in pricing. Token counts are always tracked regardless of whether pricing is configured.

For full details on model schema configuration, see [Adding models](/configure/model-schemas).

## Best practices

**Monitor cost trends regularly.** Check your Dashboard to spot unexpected cost increases early. A sudden spike may indicate a prompt regression or unexpected traffic.

**Use cost filters to find expensive requests.** Filter traces where cost exceeds your expected per-request budget. Investigate high-cost traces to see if prompts can be optimized.

**Track per-user costs for billing.** If you bill customers based on AI usage, filter traces by `user_id` to pull per-user attribution data.

**Compare model costs.** Add a dashboard widget grouped by model to evaluate whether cheaper models can handle certain tasks without quality loss.

**Set up alerts for cost thresholds.** Configure [Alerts](/observe/alerts) to notify you when cost metrics exceed acceptable levels.

## Next steps

<CardGroup>
  <Card title="Dashboards" icon="chart-bar" href="/observe/dashboards">
    View aggregate cost and usage metrics
  </Card>

  <Card title="Traces and logs" icon="chart-line" href="/observe/traces-and-logs">
    Inspect individual request costs
  </Card>

  <Card title="Alerts" icon="bell" href="/observe/alerts">
    Get notified of cost spikes
  </Card>

  <Card title="Filtering and search" icon="filter" href="/observe/filtering-and-search">
    Filter traces by cost and tokens
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Dashboards
Source: https://docs.agentmark.co/observe/dashboards

Monitor operational metrics, evaluation scores, and custom analytics in one place

<Info>**Cloud feature.** Dashboards are available in the [AgentMark Dashboard](https://app.agentmark.co).</Info>

AgentMark Dashboards give you a unified view of your application's health — operational metrics (cost, latency, tokens, errors), evaluation scores (distributions, trends, cross-score comparison), and custom widgets — all on one page.

<Note>
  Developers set up observability in your application. See [Observe overview](/observe/overview) for setup instructions.
</Note>

<img alt="Dashboard showing operational widgets and score analytics section" />

***

## Operational metrics

The Dashboard automatically tracks key metrics from your prompt executions:

| Category    | Metrics                                                     |
| ----------- | ----------------------------------------------------------- |
| **Cost**    | Total cost, average cost per request, cost by model         |
| **Latency** | Average latency, P50/P95/P99 percentiles, latency trends    |
| **Tokens**  | Input tokens, output tokens, total tokens, tokens by model  |
| **Volume**  | Request count, error count, error rate, unique users        |
| **Models**  | Request count per model, cost per model, top models ranking |

These appear as widgets on your Dashboard — stat cards for at-a-glance numbers, line/bar/area charts for trends.

<img alt="Operational metrics dashboard" />

***

## Score analytics

Score analytics are available as **dashboard widgets** — add them to any dashboard through the "Add widget" dialog, or start from the **Score analytics** template in the template gallery.

Four score widget types are available:

### Summary cards

Aggregated statistics for each score name:

* **Avg** — Mean score value
* **Count** — Total number of scores recorded
* **Min / Max** — Range of observed values

<img alt="Summary cards showing avg, count, min, max per score name" />

### Score distribution

The histogram shows how score values are distributed. AgentMark auto-detects the score type:

* **Numeric scores** — 10 equal-width bins between min and max
* **Categorical scores** — Bar chart by category label
* **Boolean scores** — Two bars for true/false

<img alt="Score distribution histogram" />

### Trend over time

Average score values over time, bucketed by the dashboard's selected time granularity.

<img alt="Score trend chart" />

### Score comparison

Compare two scores of the same type to see how they align across shared traces:

* **Categorical / Boolean** — Confusion matrix (N×M heatmap)
* **Numeric** — Scatter plot with paired values

<img alt="Confusion matrix comparing two boolean scores" />

<Note>
  Both scores must be the same type. Mixing numeric with categorical will show an error. The scatter plot is capped at 10,000 data points for performance.
</Note>

### Score types

| Score type      | Detection rule                        | Distribution       | Comparison           |
| --------------- | ------------------------------------- | ------------------ | -------------------- |
| **Numeric**     | Float values, no labels               | 10-bin histogram   | Scatter plot         |
| **Categorical** | String labels (not just true/false)   | Category bar chart | N×M confusion matrix |
| **Boolean**     | Labels are only "true" and/or "false" | Two-bar chart      | 2×2 confusion matrix |

***

## Widgets

Dashboards are fully configurable with drag-and-drop widgets. Add any mix of operational and score widgets to create the view you need.

**Operational widgets** (stat card, line, bar, or area chart):

* Request count, error rate, cost, latency, tokens, unique users, model rankings
* Derived metrics: cost/request, tokens/request, success rate, and more

**Score widgets:**

* **Score summary** — aggregated stats for all scores
* **Score distribution** — histogram or category chart for a selected score
* **Score trend** — trend line over time for a selected score
* **Score comparison** — confusion matrix or scatter plot comparing two scores

<img alt="Custom dashboard with widgets" />

### Available metrics

**Volume:** `request_count`, `unique_users`, `total_tokens`, `avg_tokens`

**Cost:** `total_cost`, `avg_cost`

**Errors:** `error_count`, `error_rate`

**Latency:** `avg_latency`, `p50_latency`, `p95_latency`, `p99_latency`

**Rankings:** `top_models`

### Adding widgets

1. Click **+ Add widget** in the dashboard header
2. Choose a title and metric — operational metrics are under "Built-in" and "Derived", score metrics are under "Scores"
3. For score widgets, enter the score name(s) to track
4. Choose a visualization type and optional group-by dimension
5. The widget appears on the grid — drag to rearrange

Operational widgets support **group-by** dimensions (model, user, metadata key), **time granularity**, and **filters** (model, user ID, status).

***

## Templates

Start from a pre-built template or create a blank dashboard.

<img alt="Dashboard template gallery" />

| Template            | What it includes                                                         |
| ------------------- | ------------------------------------------------------------------------ |
| **Overview**        | Request volume, cost, errors, latency — stat cards + time series         |
| **Cost analysis**   | Total cost, avg cost/request, cost over time, top models by cost, tokens |
| **Performance**     | P50/P95/P99 latency, error count, error rate                             |
| **Score analytics** | Score summary, distribution, trend, and comparison widgets               |

***

## Dashboard settings

* **Default dashboard** — mark any dashboard as default to load it when you visit the Dashboards page
* **Time range** — global selector applies to all widgets on the dashboard

***

## Metrics API

You can retrieve aggregated operational metrics programmatically using the public REST API. The `GET /v1/metrics` endpoint returns a summary and an hourly time series for trace volume, latency, cost, token usage, and error rates.

```bash theme={null}
curl "https://api.agentmark.co/v1/metrics?start_date=2026-04-01T00:00:00Z&end_date=2026-04-18T00:00:00Z&extended=true" \
  -H "Authorization: Bearer am_live_abc123" \
  -H "X-Agentmark-App-Id: app_abc123"
```

Required parameters: `start_date` and `end_date` (ISO 8601). Pass `extended=true` for per-request averages and model count. See the [Metrics API reference](/api-reference/overview) for the full response schema.

***

## Scores API

You can create and retrieve scores programmatically using the public REST API. This is useful for recording evaluation results, human feedback, or quality metrics from automated pipelines.

**Create a score for a span or trace:**

```bash theme={null}
curl -X POST "https://api.agentmark.co/v1/scores" \
  -H "Authorization: Bearer am_live_abc123" \
  -H "X-Agentmark-App-Id: app_abc123" \
  -H "Content-Type: application/json" \
  -d '{
    "resource_id": "span_abc123",
    "name": "quality",
    "score": 0.95,
    "label": "high_quality",
    "reason": "Response addresses all requirements."
  }'
```

**List scores for a resource:**

```bash theme={null}
curl "https://api.agentmark.co/v1/scores?resource_id=span_abc123" \
  -H "Authorization: Bearer am_live_abc123" \
  -H "X-Agentmark-App-Id: app_abc123"
```

**Get a single score by ID:**

```bash theme={null}
curl "https://api.agentmark.co/v1/scores/550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer am_live_abc123" \
  -H "X-Agentmark-App-Id: app_abc123"
```

**Delete a score:**

```bash theme={null}
curl -X DELETE "https://api.agentmark.co/v1/scores/550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer am_live_abc123" \
  -H "X-Agentmark-App-Id: app_abc123"
```

Scores created via the API appear in the score analytics widgets on your dashboards. See the [Scoring API reference](/api-reference/overview) for the full request and response schema.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Filtering and search
Source: https://docs.agentmark.co/observe/filtering-and-search

Find specific traces and sessions using filters, date ranges, sorting, saved views, and shareable URLs

<Info>**Cloud feature.** Filtering and search is available in the [AgentMark Dashboard](https://app.agentmark.co).</Info>

AgentMark provides a comprehensive filtering system across both the Traces and Sessions pages. You can combine multiple filters, sort by any column, save filter configurations as views, and share filtered results via URL.

<img alt="Traces page showing filter toolbar with Filters button, date range selector, and Views dropdown" />

## Filter popover

Click the **Filters** button above the trace or session list to open the filter popover. From here you can build filter expressions by adding one or more filter rows.

Each filter row consists of three parts:

1. **Field** -- the trace attribute to filter on
2. **Operator** -- the comparison to apply
3. **Value** -- the value to match against

Click **Apply** to execute the filters, or **Clear** to reset all filter rows.

<img alt="Filter popover with Field, Operator, and Value dropdowns" />

<Tip>
  You can add multiple filter rows to narrow results further. All filters are combined with AND logic -- a trace must match every active filter to appear in the results.
</Tip>

## Available filter fields

### String fields

These fields support: **equals**, **not equals**, **contains**, **not contains**, **starts with**, and **ends with**.

* **Model** (`model_used`) -- the LLM model used for the inference (e.g., `gpt-4o`, `claude-sonnet-4-20250514`)
* **User ID** (`user_id`) -- the user identifier attached to the trace
* **Prompt** (`prompt_name`) -- the prompt or function name
* **Session** (`session_id`) -- the session identifier grouping related traces

### Enum fields

* **Status** (`status`) -- filter by trace outcome. Values: **OK** or **ERROR**. Supports **equals** and **not equals**.
* **Span kind** (`semantic_kind`) -- filter by the semantic type of span. Values: **function**, **llm**, **tool**, **agent**, **retrieval**, **embedding**, **guardrail**. Supports **equals** and **not equals**.

### Numeric fields

These fields support: **equals**, **not equals**, `<`, `<=`, `>`, and `>=`.

* **Latency (ms)** (`latency_ms`) -- total execution time in milliseconds
* **Cost (\$)** (`cost`) -- the computed cost of the inference
* **Prompt tokens** (`prompt_tokens`) -- number of input tokens consumed
* **Completion tokens** (`completion_tokens`) -- number of output tokens generated

### Content fields

These fields support **contains** and **starts with** only.

* **Input** (`input`) -- the prompt input text
* **Output** (`output`) -- the model response text

### Tags

* **Tags** (`tags`) -- trace-level tag labels. Supports **equals**, **not equals**, and **contains**.

### Dynamic fields

<img alt="Filter field dropdown showing all available fields including dynamic Score and Metadata entries" />

These fields are auto-populated from your application's trace data.

* **Metadata: \{key}** (`metadata.*`) -- any custom metadata key attached to your traces. Supports the string operators plus **exists** and **does not exist**.
* **Score: \{name}** (`score__*`) -- evaluation score names from your app's scoring pipeline. Supports the numeric operators.

## Date range

Both the Traces and Sessions pages include a date range selector. Use the preset ranges or define a custom window:

* **Today** -- traces from the current day
* **7d** -- last 7 days
* **30d** -- last 30 days
* **90d** -- last 90 days
* **Custom** -- pick a specific start and end date (UTC)

The date range applies alongside any active filters.

## Column sorting

Sortable columns in the trace and session lists can be sorted by clicking the column header. Click once to sort ascending, click again to toggle to descending. Only one sort column is active at a time.

## Active filters bar

<img alt="Active filter chips showing Model equals gpt-4o with filtered trace results" />

When filters are applied, they appear as removable chips above the results list. Each chip shows the field, operator, and value. Click the **X** on a chip to remove that individual filter without clearing the rest.

## Score filtering

AgentMark automatically detects evaluation score names from your trace data and makes them available as filter fields. This lets you find traces based on quality metrics from your evaluation pipeline.

To filter by score:

1. Open the filter popover
2. Select a score field (e.g., **Score: accuracy**)
3. Choose a numeric operator (e.g., `>=`)
4. Enter the threshold value (e.g., `0.8`)
5. Click **Apply**

<Note>
  Score names are dynamically populated. You will only see score fields that exist in your application's trace data.
</Note>

## Metadata filtering

If you attach custom metadata to your traces (via `agentmark.metadata.*` attributes), those keys are automatically discovered and available as filter fields.

To filter by metadata:

1. Open the filter popover
2. Select a metadata field (e.g., **Metadata: environment**)
3. Choose an operator (e.g., **equals**)
4. Enter the value (e.g., `production`)
5. Click **Apply**

You can also use the **exists** and **does not exist** operators to find traces that have (or lack) a specific metadata key, regardless of value.

<Tip>
  Use metadata filters to separate environments, feature flags, A/B test variants, or any other custom dimensions you track.
</Tip>

## Tag filtering

Tags are string labels attached to traces via the SDK for categorization by environment, team, feature, experiment, or release. Tags appear as a column in the trace list and are filterable through the filter popover.

To filter by tags:

1. Open the filter popover
2. Select **Tags** from the field dropdown
3. Choose an operator and enter the tag value
4. Click **Apply**

You can combine tag filters with other filters (metadata, model, status) to narrow results further.

The Tags field supports **equals**, **not equals**, and **contains** — a narrower set than most string fields. For a full guide on setting tags, naming conventions, and best practices, see the [Tags documentation](/observe/tags).

## Sessions filtering

The Sessions page has its own filtering and search capabilities:

* **Date range** -- the same date range selector as the Traces page (Today, 7d, 30d, 90d, Custom)
* **Search** -- search by session ID or session name using the search bar
* **Sort** -- sort sessions by cost, total tokens, duration, or trace count by clicking column headers
* **User filter** -- filter sessions by user ID to see all sessions for a specific user

For full details on sessions, see the [Sessions documentation](/observe/sessions).

## Saved views

<img alt="Saved Views dropdown with Save button and empty state" />

Use the **Views** dropdown to save and restore filter, sort, and date range configurations. Saved views let you quickly switch between commonly used filter sets without rebuilding them each time.

To save a view:

1. Configure your desired filters, sort, and date range
2. Click the **Views** dropdown
3. Select **Save current view**
4. Give the view a name

To restore a view, open the **Views** dropdown and select a previously saved view. Saved views are available on both the Traces and Sessions pages.

## URL parameters

All filter state -- including active filters, sort column, sort direction, and date range -- is persisted in URL query parameters. This means you can:

* **Bookmark** a filtered view for quick access
* **Share** a URL with teammates to show them the exact same filtered results
* **Link** from alerts, dashboards, or external tools directly to a filtered trace list

<Note>
  When you apply or remove filters, the URL updates automatically. Copy the URL from your browser's address bar to share the current view.
</Note>

## Programmatic span search

The Dashboard filters apply to the Traces and Sessions UI. To search spans programmatically across all traces, use the `GET /v1/spans` REST endpoint (or the equivalent MCP tool exposed by the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) server). This lets you filter by span type, status, model, name, and duration range without browsing individual traces.

See [Cross-trace span search](/observe/traces-and-logs#cross-trace-span-search) for examples and available filters, or the [API reference](/api-reference/overview) for the full endpoint specification.

## Next steps

<CardGroup>
  <Card title="Traces and logs" icon="chart-line" href="/observe/traces-and-logs">
    Understand trace details and span attributes
  </Card>

  <Card title="Sessions" icon="users" href="/observe/sessions">
    Group related traces together
  </Card>

  <Card title="Dashboards" icon="chart-bar" href="/observe/dashboards">
    Track usage, costs, and performance
  </Card>

  <Card title="Alerts" icon="bell" href="/observe/alerts">
    Get notified of critical issues
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Metadata
Source: https://docs.agentmark.co/observe/metadata

Attach custom key-value pairs to traces for filtering, debugging, and context

Metadata lets you attach custom key-value pairs to your AgentMark traces. Use metadata to add context like user IDs, environment names, feature flags, request IDs, and customer tiers — then filter and search by those values in the Dashboard.

<Note>
  Developers configure metadata in your application. See [Tracing setup](/observe/tracing-setup) for setup instructions.
</Note>

## Setting metadata

There are two ways to attach metadata to traces in the AgentMark SDK: via the `telemetry.metadata` object when formatting a prompt, and via the `span()` function's `metadata` option when grouping traces.

### Via telemetry metadata

Pass metadata when formatting a prompt. These key-value pairs are attached to the resulting span:

```typescript theme={null}
import { AgentMarkSDK } from "@agentmark-ai/sdk";
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";

const sdk = new AgentMarkSDK({
  apiKey: process.env.AGENTMARK_API_KEY,
  appId: process.env.AGENTMARK_APP_ID,
});

sdk.initTracing();

const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai });

const client = createAgentMarkClient({
  loader: sdk.getApiLoader(),
  modelRegistry,
});

const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
const input = await prompt.format({
  props: { query: "Hello" },
  telemetry: {
    isEnabled: true,
    functionId: "my-function",
    metadata: {
      userId: "user-123",
      environment: "production",
      feature: "chat-v2",
      customerTier: "enterprise",
    },
  },
});

const result = await generateText(input);
```

### Via the span function

Pass metadata when creating a trace group. All spans within the trace inherit this metadata:

```typescript theme={null}
import { span } from "@agentmark-ai/sdk";

const { result, traceId } = await span(
  {
    name: "my-workflow",
    metadata: {
      requestId: "req-abc-123",
      version: "2.1.0",
      environment: "production",
    },
  },
  async (ctx) => {
    const prompt = await client.loadTextPrompt("handler.prompt.mdx");
    const input = await prompt.format({
      props: { query: "What is AgentMark?" },
      telemetry: { isEnabled: true },
    });

    return await generateText(input);
  }
);
```

`SpanOptions.metadata` is typed as `Record<string, string>` — values must be strings. Convert numbers, booleans, and other types before passing.

<Tip>
  You can combine both approaches. Metadata on `span()` applies to the parent trace, while `telemetry.metadata` applies to individual LLM-call spans within that trace.
</Tip>

## Filtering by metadata

In the AgentMark Dashboard, metadata keys are auto-discovered from your trace data. You can filter traces by any metadata key that appears in your data.

To filter by metadata:

1. Navigate to the **Traces** tab in the Dashboard.
2. Open the filter dropdown.
3. Look for entries prefixed with **Metadata:** followed by the key name (for example, "Metadata: userId").
4. Select the key you want to filter by.
5. Choose an operator and enter a value.

AgentMark supports the following filter operators for metadata values:

* **equals** / **notEquals** — exact match / no match on the value
* **contains** / **notContains** — value includes / excludes the specified substring
* **starts with** — value begins with the specified string
* **ends with** — value ends with the specified string
* **exists** — the key is present, regardless of value
* **does not exist** — the key is not present on the trace

Filters can be combined and saved as views — see [Filtering and search](/observe/filtering-and-search).

## Metadata in the trace detail

When viewing an individual trace in the Dashboard, metadata appears in the attributes section. All key-value pairs you attached are displayed, making it easy to see the full context of a trace without switching to your application logs.

## How metadata is stored

AgentMark stores metadata as a `Map(LowCardinality(String), String)` column in ClickHouse. Keys are indexed for fast filtering and search. All values are stored as strings — if you need to attach non-string values, convert them first:

```typescript theme={null}
metadata: {
  userId: "user-123",
  resultCount: String(results.length),   // number → string
  isRetry: String(isRetry),              // boolean → string
}
```

## Best practices

### Recommended metadata keys

* **`user_id`** — Per-user debugging and cost attribution. Example: `"user-123"`. Use snake\_case to populate the trace's user field (camelCase `userId` is stored as ordinary metadata).
* **`session_id`** — Group related traces into a session. Must be snake\_case in metadata (or use the top-level `span()` `sessionId` option). Example: `"sess-abc"`
* **`environment`** — Distinguish staging from production when using a single app. Example: `"production"`
* **`version`** — Track which application version generated the trace. Example: `"2.1.0"`
* **`requestId`** — Correlate AgentMark traces with your application logs. Example: `"req-xyz"`
* **`feature`** — Identify which feature or flow triggered the trace. Example: `"chat-v2"`

### Tips

* **Use consistent key names** across your application. If one service sends `userId` and another sends `user_id`, they appear as separate keys in the Dashboard.
* **Keep values short.** Metadata is designed for identifiers and labels, not large payloads.
* **Use metadata for anything you want to filter by later.** If you find yourself searching your application logs for a value, it is a good candidate for metadata.
* **Metadata keys are case-sensitive.** `userId` and `userid` are treated as different keys.
* **All values are stored as strings.** Convert numbers, booleans, and other types before passing — `SpanOptions.metadata` and `telemetry.metadata` are both typed `Record<string, string>`.

## Next steps

<CardGroup>
  <Card title="Traces and logs" icon="chart-line" href="/observe/traces-and-logs">
    Understand trace details and span attributes
  </Card>

  <Card title="Sessions" icon="users" href="/observe/sessions">
    Group related traces together
  </Card>

  <Card title="Filtering and search" icon="filter" href="/observe/filtering-and-search">
    Filter traces by metadata keys
  </Card>

  <Card title="Tracing setup" icon="code" href="/observe/tracing-setup">
    Set up tracing in your application
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Observe
Source: https://docs.agentmark.co/observe/overview

Monitor, debug, and optimize your LLM applications with tracing, dashboards, and alerts

Instrument with the SDK to capture traces automatically. Explore them in your terminal (local) or in the Dashboard with filtering, search, graph view, dashboards, and alerts.

Built on [OpenTelemetry](https://opentelemetry.io/), AgentMark automatically collects telemetry data from your prompts and provides actionable insights.

## Core concepts

### Traces

A trace represents a complete request or workflow in your application. Each trace is identified by a unique trace ID and contains one or more spans. Traces carry top-level attributes such as metadata, tags, user ID, and session ID.

### Spans

Spans are individual operations within a trace, forming a tree structure. AgentMark records three span types:

* **`ai.inference`** — the full lifecycle of an LLM call, including model, tokens, cost, and response
* **`ai.toolCall`** — a single tool execution, including name, arguments, and result
* **`ai.stream`** — streaming response metrics such as time to first token and tokens per second

### Span kinds

Every span has a semantic kind that categorizes the operation. Span kinds determine how spans can be [filtered](/observe/filtering-and-search) and how analytics are grouped on [dashboards](/observe/dashboards).

| Kind          | Description                                        |
| ------------- | -------------------------------------------------- |
| **function**  | Generic computation step (default)                 |
| **llm**       | A call to a language model                         |
| **tool**      | An external tool or API call                       |
| **agent**     | An orchestration loop that decides what to do next |
| **retrieval** | A vector database query or document search         |
| **embedding** | A call to an embedding model                       |
| **guardrail** | A content safety or validation check               |

Set span kinds using [`observe()`](/observe/tracing-setup#wrapping-functions-with-observe) or `ctx.span()`.

### Sessions

Sessions group related traces together by session ID. Track multi-turn conversations, agent workflows, and batch processing runs. Each session aggregates cost, tokens, and latency across its traces.

[Learn more about Sessions →](/observe/sessions)

### Scores

Numeric evaluations attached to spans or traces. Set scores programmatically via the SDK using `sdk.score()`, or manually through [annotations](/evaluate/annotations) in the Dashboard.

### Metadata and tags

**Metadata** — Custom key-value pairs attached to traces for context (environment, feature flags, customer tier). Automatically discovered as filter fields.

**Tags** — String labels for categorization (environment, team, feature, release).

[Metadata →](/observe/metadata) · [Tags →](/observe/tags)

## What gets tracked

**Inference spans** — Full prompt execution lifecycle: token usage, costs, response times, model information, completion status.

**Tool calls** — Tool name, parameters, execution duration, success/failure status, return values.

**Streaming metrics** — Time to first token, tokens per second, total streaming duration.

**Sessions** — Group related traces by user interaction, multi-step workflow, or batch run.

**Alerts** — Monitor cost thresholds, latency spikes, error rates, and evaluation scores.

## Quick start

Enable telemetry when formatting your prompts:

```typescript theme={null}
import { client } from './agentmark.client';
import { generateText } from 'ai';

const prompt = await client.loadTextPrompt('greeting.prompt.mdx');
const input = await prompt.format({
  props: { name: 'Alice' },
  telemetry: {
    isEnabled: true,
    functionId: 'greeting-handler',
    // Session/user keys in metadata must be snake_case to group traces (see Sessions)
    metadata: {
      user_id: 'user-123',
      session_id: 'session-abc',
      session_name: 'Customer Support Chat'
    }
  }
});

const result = await generateText(input);
```

For full tracing setup including `AgentMarkSDK`, child spans, `observe()`, and span kinds, see [Tracing Setup](/observe/tracing-setup).

## How data flows

Your application sends telemetry via the AgentMark SDK, which exports OpenTelemetry spans to the AgentMark gateway. The gateway processes and stores the data, powering the traces, metrics, and analytics views.

<Tabs>
  <Tab title="Cloud">
    Spans are exported to the AgentMark Cloud gateway and stored in ClickHouse. View traces, dashboards, alerts, and analytics in the Dashboard.
  </Tab>

  <Tab title="Local">
    When running `npx agentmark dev`, traces are sent to `http://localhost:9418` automatically. View them in the local dev server UI at `http://localhost:3000`.

    The local dev server exposes most of the REST API so the same endpoints work against local data. The `/v1/capabilities` endpoint reports which features a given server supports — `metrics` and `score_analytics` are Cloud-only and return `501 not_available_locally`. Trace export is also Cloud-only (there is no local `/v1/traces/export` route).
  </Tab>
</Tabs>

## Programmatic access

You can query traces, spans, sessions, scores, metrics, datasets, experiments, prompts, and runs programmatically through the [REST API](/api-reference/overview), or from an IDE agent via the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server (which exposes one MCP tool per gateway operation). Use either to build custom integrations, pull data into external tools, or automate monitoring workflows.

Most endpoints are available on both the local dev server and the AgentMark Cloud gateway. The local server returns `501 not_available_locally` for features that require ClickHouse aggregations (`/v1/metrics` and score analytics). Trace export (`/v1/traces/export`) is Cloud-only too, but has no local route — it returns `404`. Use the [`capabilities`](/api-reference/overview) endpoint to check which features a server supports.

```bash theme={null}
# Query traces from the local dev server
curl -fsS "http://localhost:9418/v1/traces?limit=10"

# Get a specific trace with all its spans
curl -fsS "http://localhost:9418/v1/traces/<traceId>"

# Same calls against Cloud — set AGENTMARK_API_KEY + AGENTMARK_APP_ID
curl -fsS "https://api.agentmark.co/v1/traces?limit=10" \
  -H "Authorization: Bearer $AGENTMARK_API_KEY" \
  -H "X-Agentmark-App-Id: $AGENTMARK_APP_ID"

# Check which features are available on a given server
curl -fsS "http://localhost:9418/v1/capabilities"
```

The same operations are available as MCP tools (`list_traces`, `get_trace`, `get_capabilities`, …) when you run the `agentmark-mcp` server alongside your IDE.

## Next steps

<CardGroup>
  <Card title="Tracing setup" icon="code" href="/observe/tracing-setup">
    Instrument your app with the SDK
  </Card>

  <Card title="Traces and logs" icon="chart-line" href="/observe/traces-and-logs">
    View execution timelines in the Dashboard
  </Card>

  <Card title="Sessions" icon="users" href="/observe/sessions">
    Group related traces together
  </Card>

  <Card title="Alerts" icon="bell" href="/observe/alerts">
    Get notified of critical issues
  </Card>

  <Card title="Dashboards" icon="chart-bar" href="/observe/dashboards">
    Analyze usage, performance, and scores
  </Card>

  <Card title="API reference" icon="code" href="/api-reference/overview">
    Query traces, scores, and metrics via REST API
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# PII masking
Source: https://docs.agentmark.co/observe/pii-masking

Redact sensitive data from traces before they leave your application

AgentMark PII masking strips sensitive data from span attributes before traces are exported. Masking runs in your application process, so configured attributes are redacted before the OTel exporter sends them over the network.

You can use a custom mask function, the built-in PII masker, environment variable suppression, or any combination of these approaches. Coverage depends on your configuration — only attributes you target are masked.

***

## How it works

PII masking is implemented as an OpenTelemetry `SpanProcessor` that wraps the export pipeline. When a span finishes, the masking processor intercepts it before the exporter sends data over the network.

<Steps>
  <Step title="Span finishes">
    Your application completes an LLM call or tool invocation. The span is ready for export.
  </Step>

  <Step title="MaskingSpanProcessor intercepts">
    If masking is configured, the processor runs env var suppression first, then your mask function on each sensitive attribute.
  </Step>

  <Step title="Redacted span exported (or dropped)">
    If masking succeeds, the redacted span is forwarded to the exporter. If the mask function throws, the span is **dropped entirely** (fail-closed) and a warning is logged.
  </Step>
</Steps>

This means:

* **Configured attributes are redacted before export.** The processor runs in-memory, in your application, before any network call — so attributes you mask never leave the process in their raw form.
* **Zero overhead when masking is disabled.** The processor is only added to the pipeline when you configure a `mask` function or set env vars.
* **Standard OTel pattern.** The `MaskingSpanProcessor` wraps your existing `BatchSpanProcessor` or `SimpleSpanProcessor` — no forking or patching required.

### Before and after

With `createPiiMasker()` enabled, PII tokens like `[EMAIL]`, `[SSN]`, and `[PHONE]` replace sensitive data in the trace viewer:

<Frame>
  <img alt="Trace with PII masking enabled — sensitive data replaced with tokens like [EMAIL], [SSN], [PHONE]" />
</Frame>

With `AGENTMARK_HIDE_INPUTS=true`, all input attributes show `[REDACTED]` while outputs remain visible:

<Frame>
  <img alt="Trace with input suppression — all inputs show [REDACTED]" />
</Frame>

Here's what the raw span attributes look like with each approach:

**Without masking:**

```json theme={null}
{
  "gen_ai.request.input": "My SSN is 123-45-6789 and email is user@example.com",
  "gen_ai.response.output": "I found your account linked to user@example.com",
  "gen_ai.request.model": "gpt-4o",
  "gen_ai.usage.total_tokens": 150
}
```

**With `createPiiMasker(email=True, ssn=True)`:**

```json theme={null}
{
  "gen_ai.request.input": "My SSN is [SSN] and email is [EMAIL]",
  "gen_ai.response.output": "I found your account linked to [EMAIL]",
  "gen_ai.request.model": "gpt-4o",
  "gen_ai.usage.total_tokens": 150
}
```

**With `AGENTMARK_HIDE_INPUTS=true`:**

```json theme={null}
{
  "gen_ai.request.input": "[REDACTED]",
  "gen_ai.response.output": "I found your account linked to user@example.com",
  "gen_ai.request.model": "gpt-4o",
  "gen_ai.usage.total_tokens": 150
}
```

Notice that `gen_ai.request.model` and `gen_ai.usage.total_tokens` are never masked — these operational attributes contain no user data.

***

## Basic usage

Pass a `mask` function to `AgentMarkSDK`. The function receives each string attribute value and returns the redacted version.

<CodeGroup>
  ```typescript TypeScript theme={null}
  import { AgentMarkSDK } from '@agentmark-ai/sdk';

  const sdk = new AgentMarkSDK({
    apiKey: 'am_...',
    appId: 'app-123',
    mask: (data) => data.replace(/\b\d{3}-\d{2}-\d{4}\b/g, '[SSN]'),
  });
  sdk.initTracing();
  ```

  ```python Python theme={null}
  import re
  from agentmark_sdk import AgentMarkSDK

  sdk = AgentMarkSDK(
      api_key="am_...",
      app_id="app-123",
      mask=lambda data: re.sub(r"\b\d{3}-\d{2}-\d{4}\b", "[SSN]", data),
  )
  sdk.init_tracing()
  ```
</CodeGroup>

The `mask` function is called on every maskable span attribute before the span is handed to the exporter. The function must be synchronous. You have full control over the replacement logic.

***

## Built-in PII masker

AgentMark ships a built-in PII masker that covers common patterns out of the box. Enable the patterns you need:

<CodeGroup>
  ```typescript TypeScript theme={null}
  import { AgentMarkSDK, createPiiMasker } from '@agentmark-ai/sdk';

  const sdk = new AgentMarkSDK({
    apiKey: 'am_...',
    appId: 'app-123',
    mask: createPiiMasker({
      email: true,
      phone: true,
      ssn: true,
      creditCard: true,
      ipAddress: true,
    }),
  });
  sdk.initTracing();
  ```

  ```python Python theme={null}
  from agentmark_sdk import AgentMarkSDK, create_pii_masker, PiiMaskerConfig

  sdk = AgentMarkSDK(
      api_key="am_...",
      app_id="app-123",
      mask=create_pii_masker(PiiMaskerConfig(
          email=True,
          phone=True,
          ssn=True,
          credit_card=True,
          ip_address=True,
      )),
  )
  sdk.init_tracing()
  ```
</CodeGroup>

### Built-in patterns

* **`email`** — Matches email addresses like `user@example.com`. Replaced with `[EMAIL]`.
* **`phone`** — Matches phone numbers like `(555) 123-4567`. Replaced with `[PHONE]`.
* **`ssn`** — Matches Social Security numbers like `123-45-6789`. Replaced with `[SSN]`.
* **`creditCard`** / **`credit_card`** — Matches credit card numbers like `4111 1111 1111 1111`. Replaced with `[CREDIT_CARD]`.
* **`ipAddress`** / **`ip_address`** — Matches IP addresses like `192.168.1.100`. Replaced with `[IP_ADDRESS]`.

All patterns default to `false`. Only patterns you explicitly enable are applied.

<Note>
  TypeScript uses camelCase (`creditCard`, `ipAddress`) while Python uses snake\_case (`credit_card`, `ip_address`) following each language's conventions.
</Note>

***

## Custom patterns

You can add custom patterns alongside the built-in ones. Each entry needs a regex pattern and a replacement string.

<CodeGroup>
  ```typescript TypeScript theme={null}
  import { AgentMarkSDK, createPiiMasker } from '@agentmark-ai/sdk';

  const sdk = new AgentMarkSDK({
    apiKey: 'am_...',
    appId: 'app-123',
    mask: createPiiMasker({
      email: true,
      custom: [
        { pattern: /MRN-\d+/g, replacement: '[MEDICAL_RECORD]' },
        { pattern: /ACCT-[A-Z0-9]+/g, replacement: '[ACCOUNT_ID]' },
      ],
    }),
  });
  sdk.initTracing();
  ```

  ```python Python theme={null}
  import re
  from agentmark_sdk import AgentMarkSDK, create_pii_masker, PiiMaskerConfig, CustomPattern

  sdk = AgentMarkSDK(
      api_key="am_...",
      app_id="app-123",
      mask=create_pii_masker(PiiMaskerConfig(
          email=True,
          custom=[
              CustomPattern(pattern=re.compile(r"MRN-\d+"), replacement="[MEDICAL_RECORD]"),
              CustomPattern(pattern=re.compile(r"ACCT-[A-Z0-9]+"), replacement="[ACCOUNT_ID]"),
          ],
      )),
  )
  sdk.init_tracing()
  ```
</CodeGroup>

Custom patterns run after built-in patterns. Custom patterns can be used on their own without enabling any built-in patterns.

***

## Environment variable suppression

For a zero-code option, set environment variables to suppress all inputs, all outputs, or both:

```bash theme={null}
AGENTMARK_HIDE_INPUTS=true
AGENTMARK_HIDE_OUTPUTS=true
```

When enabled, these replace ALL input or output attribute values with `[REDACTED]`. No code changes are needed.

<Note>
  If both environment variable suppression and a `mask` function are configured, suppression runs first. The mask function then receives the already-suppressed values.
</Note>

***

## Masked attributes reference

AgentMark masks specific span attributes depending on their category.

**Input attributes** (suppressed by `AGENTMARK_HIDE_INPUTS`):

* `gen_ai.request.input` — the prompt or messages sent to the model
* `gen_ai.request.tool_calls` — tool call arguments included in the request

**Output attributes** (suppressed by `AGENTMARK_HIDE_OUTPUTS`):

* `gen_ai.response.output` — the model's text response
* `gen_ai.response.output_object` — structured output from the model

**Metadata attributes** (mask function only, not affected by env vars):

* `agentmark.metadata.*` — custom metadata attached to spans

Operational attributes such as trace IDs, model names, and token counts are never masked. These contain no user data and are required for observability to function.

***

## Error handling

PII masking uses fail-closed behavior. If your `mask` function throws an error, the span is dropped entirely and never exported. This ensures that unmasked data is never sent to the trace backend.

Tracing continues normally for subsequent spans after a mask failure. The dropped span does not affect the rest of the trace pipeline.

<Warning>
  Test your mask function thoroughly before deploying to production. A mask function that throws on unexpected input will cause spans to be silently dropped.
</Warning>

***

## Recipes

### Microsoft Presidio (Python)

[Microsoft Presidio](https://microsoft.github.io/presidio/) uses NLP to detect unstructured PII like person names, addresses, and passport numbers that regex patterns miss. Since Presidio is a Python library, this recipe applies to the Python SDK.

```bash theme={null}
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
```

```python Python theme={null}
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from agentmark_sdk import AgentMarkSDK

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def presidio_mask(data: str) -> str:
    results = analyzer.analyze(text=data, language="en")
    anonymized = anonymizer.anonymize(text=data, analyzer_results=results)
    return anonymized.text

sdk = AgentMarkSDK(
    api_key="am_...",
    app_id="app-123",
    mask=presidio_mask,
)
sdk.init_tracing()
```

Presidio detects 15+ entity types including `PERSON`, `LOCATION`, `US_PASSPORT`, `IBAN_CODE`, and `CRYPTO`. See the [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list.

<Tip>
  For TypeScript applications, use `createPiiMasker()` with custom regex patterns for common PII types. Presidio requires a Python runtime and NLP models (\~500MB) which makes it better suited for Python services.
</Tip>

### Healthcare (HIPAA)

Combine built-in patterns with custom patterns for Protected Health Information:

```python Python theme={null}
import re
from agentmark_sdk import AgentMarkSDK, create_pii_masker, PiiMaskerConfig, CustomPattern

sdk = AgentMarkSDK(
    api_key="am_...",
    app_id="app-123",
    mask=create_pii_masker(PiiMaskerConfig(
        email=True,
        phone=True,
        ssn=True,
        ip_address=True,
        custom=[
            CustomPattern(pattern=re.compile(r"MRN[-\s]?\d{6,10}"), replacement="[MRN]"),
            # DEA numbers are 2 letters followed by 7 digits (e.g. AB1234567).
            CustomPattern(pattern=re.compile(r"\b[A-Z]{2}\d{7}\b"), replacement="[DEA_NUMBER]"),
            CustomPattern(pattern=re.compile(r"\b(?:DOB|dob)[:\s]+\d{1,2}/\d{1,2}/\d{2,4}"), replacement="[DOB]"),
        ],
    )),
)
sdk.init_tracing()
```

### Financial services

For PCI-DSS compliance, enable credit card masking and add patterns for financial identifiers:

```typescript TypeScript theme={null}
import { AgentMarkSDK, createPiiMasker } from '@agentmark-ai/sdk';

const sdk = new AgentMarkSDK({
  apiKey: 'am_...',
  appId: 'app-123',
  mask: createPiiMasker({
    creditCard: true,
    ssn: true,
    custom: [
      { pattern: /\b\d{9}\b/g, replacement: '[ROUTING_NUMBER]' },
      { pattern: /\b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b/g, replacement: '[IBAN]' },
    ],
  }),
});
sdk.initTracing();
```

<Note>
  For maximum compliance assurance, combine a `mask` function with `AGENTMARK_HIDE_INPUTS=true` as a defense-in-depth strategy. The env var acts as a safety net in case a new input attribute is added that the mask function doesn't cover.
</Note>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Sessions
Source: https://docs.agentmark.co/observe/sessions

Group related traces together for multi-turn conversations, workflows, and batch processing

Sessions group related traces together, making it easier to monitor and debug complex workflows. Track entire user interactions or multi-step processes as a single unit.

## What are sessions?

A session represents a logical grouping of traces. Common examples:

* A conversation with a user
* A batch processing job
* A multi-step workflow
* A user's session on your application

## Creating sessions

There are two ways to create sessions: via `span()` / `span_context()` with session options (recommended), or via telemetry metadata on individual prompt calls.

### Using `span()` with session options

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { span } from "@agentmark-ai/sdk";
    import { client } from "./agentmark.client";
    import { generateText } from "ai";

    const sessionId = `session-${Date.now()}`;

    const { result: greeting } = await span(
      {
        name: 'handle-greeting',
        sessionId,
        sessionName: 'Customer Support Chat #12345',
        userId: 'user-123'
      },
      async (ctx) => {
        const prompt = await client.loadTextPrompt('chat.prompt.mdx');
        const input = await prompt.format({
          props: { message: 'Hello!' },
          telemetry: { isEnabled: true }
        });
        return await generateText(input);
      }
    );

    // Later, another trace in the same session
    const { result: followUp } = await span(
      {
        name: 'handle-follow-up',
        sessionId,
        sessionName: 'Customer Support Chat #12345',
        userId: 'user-123'
      },
      async (ctx) => {
        const prompt = await client.loadTextPrompt('chat.prompt.mdx');
        const input = await prompt.format({
          props: { message: 'What can you help me with?' },
          telemetry: { isEnabled: true }
        });
        return await generateText(input);
      }
    );
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    import time
    from agentmark_sdk import span_context, SpanOptions
    from agentmark_pydantic_ai_v0 import run_text_prompt
    from agentmark_client import client

    session_id = f"session-{int(time.time() * 1000)}"

    async with span_context(SpanOptions(
        name="handle-greeting",
        session_id=session_id,
        session_name="Customer Support Chat #12345",
        user_id="user-123",
    )) as ctx:
        prompt = await client.load_text_prompt("chat.prompt.mdx")
        params = await prompt.format(props={"message": "Hello!"})
        greeting = await run_text_prompt(params)

    async with span_context(SpanOptions(
        name="handle-follow-up",
        session_id=session_id,
        session_name="Customer Support Chat #12345",
        user_id="user-123",
    )) as ctx:
        prompt = await client.load_text_prompt("chat.prompt.mdx")
        params = await prompt.format(props={"message": "What can you help me with?"})
        follow_up = await run_text_prompt(params)
    ```
  </Tab>
</Tabs>

### Using telemetry metadata

For cases where you don't need explicit spans, pass session info through telemetry metadata:

<Tabs>
  <Tab title="TypeScript">
    <Note>
      In telemetry metadata, the session/user keys must be **snake\_case** — `session_id`, `session_name`, `user_id`. The gateway only promotes these snake\_case keys to session fields; camelCase keys (`sessionId`, …) are stored as ordinary metadata and will **not** group traces into a session.
    </Note>

    ```typescript theme={null}
    const sessionId = `session-${Date.now()}`;

    async function handleUserMessage(message: string) {
      const prompt = await client.loadTextPrompt('chat.prompt.mdx');
      const input = await prompt.format({
        props: { message },
        telemetry: {
          isEnabled: true,
          functionId: 'chat-handler',
          metadata: { user_id: 'user-123', session_id: sessionId, session_name: 'Customer Support Chat' }
        }
      });
      return await generateText(input);
    }

    await handleUserMessage('Hello!');
    await handleUserMessage('What can you help me with?');
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    session_id = f"session-{int(time.time() * 1000)}"

    async def handle_user_message(message: str):
        prompt = await client.load_text_prompt("chat.prompt.mdx")
        params = await prompt.format(
            props={"message": message},
            telemetry={
                "isEnabled": True,
                "functionId": "chat-handler",
                "metadata": {"user_id": "user-123", "session_id": session_id, "session_name": "Customer Support Chat"}
            }
        )
        return await run_text_prompt(params)

    await handle_user_message("Hello!")
    await handle_user_message("What can you help me with?")
    ```
  </Tab>
</Tabs>

## Viewing sessions

<img alt="Sessions page with search, filters, and sortable columns" />

The Sessions page lists each session with columns for ID, name, user, duration, cost, tokens, and trace count. Search by session ID or name, filter by date range and user, and sort by any column.

Access sessions under the **Sessions** tab in the Dashboard or at `http://localhost:3000` locally.

## Sessions API

You can list sessions and retrieve a session's traces programmatically using the CLI or REST API. Both the local dev server and the AgentMark Cloud gateway expose `/v1/sessions` and `/v1/sessions/{sessionId}/traces`.

<Tabs>
  <Tab title="REST API (Cloud)">
    ```bash theme={null}
    # List sessions
    curl "https://api.agentmark.co/v1/sessions?limit=10" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"

    # List the traces belonging to a session
    curl "https://api.agentmark.co/v1/sessions/session-1712764245/traces" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"
    ```
  </Tab>

  <Tab title="REST API (local)">
    ```bash theme={null}
    # No auth required locally
    curl "http://localhost:9418/v1/sessions?limit=10"

    curl "http://localhost:9418/v1/sessions/session-1712764245/traces"
    ```
  </Tab>
</Tabs>

Both endpoints support pagination with `limit` and `offset`. The list endpoint also supports filtering by `name` and `userId`.

See the [Sessions API reference](/api-reference/overview) for full request and response details.

## Best practices

* **Use consistent session IDs** — `session-${userId}-${Date.now()}`, not `${Math.random()}`
* **Provide descriptive names** — `"Customer Support: Billing Issue #4532"`, not `"Session 1"`
* **Limit session scope** — One ticket, one conversation, one batch job
* **Create new sessions for new interactions** — Don't reuse sessions across unrelated workflows

<CardGroup>
  <Card title="Tracing setup" icon="code" href="/observe/tracing-setup">
    Full span() API reference
  </Card>

  <Card title="Filtering and search" icon="filter" href="/observe/filtering-and-search">
    Find sessions across dimensions
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Tags
Source: https://docs.agentmark.co/observe/tags

Attach string labels to traces for categorization and filtering

Tags are string labels you attach to traces for categorization, filtering, and organization. Use tags to slice trace data by environment, team, feature, experiment, or any other dimension.

## Setting tags

Tags are attached to a span by setting the `agentmark.tags` span attribute. The gateway accepts a JSON array string, a comma-separated string, or a native array.

Set the attribute inside a `span()` callback using `ctx.setAttribute()`. The tag list is aggregated to the parent trace in the Dashboard.

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { span } from "@agentmark-ai/sdk";

    const { result, traceId } = await span(
      {
        name: "user-request",
        metadata: { userId: "user-123" },
      },
      async (ctx) => {
        ctx.setAttribute(
          "agentmark.tags",
          JSON.stringify(["production", "chat-v2", "team-alpha"])
        );

        const prompt = await client.loadTextPrompt("handler.prompt.mdx");
        const input = await prompt.format({
          props: { query: "Hello" },
          telemetry: { isEnabled: true },
        });
        return await generateText(input);
      }
    );
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    import json
    from agentmark_sdk import span_context, SpanOptions

    async with span_context(SpanOptions(
        name="user-request",
        metadata={"userId": "user-123"},
    )) as ctx:
        ctx.set_attribute(
            "agentmark.tags",
            json.dumps(["production", "chat-v2", "team-alpha"]),
        )

        prompt = await client.load_text_prompt("handler.prompt.mdx")
        params = await prompt.format(
            props={"query": "Hello"},
            telemetry={"isEnabled": True},
        )
        result = await run_text_prompt(params)
    ```
  </Tab>
</Tabs>

## Tags on child spans

You can set tags on any span — the gateway aggregates tags across a trace's spans into the trace-level tag list shown in the Dashboard.

```typescript theme={null}
const { result } = await span(
  { name: "multi-step-workflow" },
  async (ctx) => {
    ctx.setAttribute(
      "agentmark.tags",
      JSON.stringify(["production", "search-feature"])
    );

    await ctx.span({ name: "retrieval-step" }, async (spanCtx) => {
      spanCtx.setAttribute("agentmark.tags", JSON.stringify(["rag-v3"]));
    });
  }
);
// Dashboard shows three tags on the trace: production, search-feature, rag-v3
```

## Filtering by tags

Tags appear as a column in the trace list. Filter by navigating to **Traces**, clicking **Filters**, selecting **Tags**, and choosing an operator and value.

## Tags vs metadata

|              | Tags                                            | Metadata                                   |
| ------------ | ----------------------------------------------- | ------------------------------------------ |
| **Format**   | Array of strings                                | Key-value pairs (string → string)          |
| **Best for** | Categorical labels (environment, team, feature) | Unique identifiers (user IDs, request IDs) |
| **Set size** | Small, known set of values                      | Unlimited unique values                    |

<Tip>If you would use it as a label or category, make it a tag. If you would use it as a lookup key, make it metadata.</Tip>

## Limits

* Up to **20 tags** per span (extra tags are dropped by the gateway).
* Each tag is trimmed and must be **1–100 characters**. Longer tags are dropped.
* Empty strings are ignored.

## Best practices

* **Use kebab-case** — `production`, `team-alpha`, `chat-v2` (not `Production`, `team_alpha`)
* **Define tags as constants** to avoid typos
* **Keep the tag set small** — Tags with hundreds of unique values belong in metadata
* **Recommended patterns**: environment (`production`, `staging`), team (`team-alpha`), feature (`chat-v2`), experiment (`exp-new-prompt`), release (`v2.1.0`)

<CardGroup>
  <Card title="Metadata" icon="tag" href="/observe/metadata">
    Key-value pairs for context and debugging
  </Card>

  <Card title="Filtering and search" icon="filter" href="/observe/filtering-and-search">
    Combine tags with other filters
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Traces and logs
Source: https://docs.agentmark.co/observe/traces-and-logs

Monitor and debug your prompts with distributed tracing

AgentMark uses [OpenTelemetry](https://opentelemetry.io/) to provide distributed tracing for your prompt executions. This gives you complete visibility into how your prompts perform in production.

<Note>
  Developers set up tracing in your application. See [Tracing setup](/observe/tracing-setup) for setup instructions.
</Note>

<img alt="Traces panel showing prompt execution timeline with spans, token usage, and response times" />

The Traces panel lists each execution with columns for Name, Status, Latency, Cost, Tokens, Spans, Tags, and Timestamp. Click a row to open the trace detail view with the full span tree and attribute drill-down.

## Understanding traces

A trace represents the complete execution of a prompt, including all its steps, tool calls, and metadata. Each trace contains:

**Execution timeline** — See exactly when each step occurred and how long it took.

**Token usage** — Track input tokens, output tokens, and total tokens consumed.

**Costs** — Monitor spending on a per-request basis.

**Tool calls** — View all tool executions, their parameters, and results.

**Custom metadata** — Add context like user IDs, session IDs, and custom attributes.

**Error information** — Detailed error messages and stack traces when issues occur.

## Collected spans

AgentMark records the following OpenTelemetry spans:

| Span type      | Description                       | Attributes                                                                                                               |
| -------------- | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `ai.inference` | Full length of the inference call | `operation.name`, `ai.operationId`, `ai.prompt`, `ai.response.text`, `ai.response.toolCalls`, `ai.response.finishReason` |
| `ai.toolCall`  | Individual tool executions        | `operation.name`, `ai.operationId`, `ai.toolCall.name`, `ai.toolCall.args`, `ai.toolCall.result`                         |
| `ai.stream`    | Streaming response data           | `ai.response.msToFirstChunk`, `ai.response.msToFinish`, `ai.response.avgCompletionTokensPerSecond`                       |

## Span kinds

Each span carries a semantic kind that categorizes the type of operation it represents. Span kinds affect how spans can be [filtered](/observe/filtering-and-search) and how analytics are grouped on the [dashboard](/observe/dashboards).

| Kind          | Description                                        |
| ------------- | -------------------------------------------------- |
| **function**  | Generic computation step (default)                 |
| **llm**       | A call to a language model                         |
| **tool**      | An external tool or API call                       |
| **agent**     | An orchestration loop that decides what to do next |
| **retrieval** | A vector database query or document search         |
| **embedding** | A call to an embedding model                       |
| **guardrail** | A content safety or validation check               |

Span kinds are set in code by wrapping functions with `observe()` — see [SpanKind values](/observe/tracing-setup#spankind-values) for implementation details.

## LLM span attributes

Each LLM span contains attributes that vary slightly depending on the adapter you use. The table below shows common attributes across integrations:

<Tabs>
  <Tab title="AI SDK (Vercel)">
    | Attribute                   | Description                 |
    | --------------------------- | --------------------------- |
    | `ai.model.id`               | Model identifier            |
    | `ai.model.provider`         | Model provider name         |
    | `ai.usage.promptTokens`     | Number of prompt tokens     |
    | `ai.usage.completionTokens` | Number of completion tokens |
    | `ai.settings.maxRetries`    | Maximum retry attempts      |
    | `ai.telemetry.functionId`   | Function identifier         |
    | `ai.telemetry.metadata.*`   | Custom metadata             |
    | `ai.response.text`          | Response text               |
    | `ai.response.toolCalls`     | Tool calls array            |
    | `ai.response.finishReason`  | Finish reason               |
  </Tab>

  <Tab title="Claude Agent SDK">
    | Attribute                        | Description                                             |
    | -------------------------------- | ------------------------------------------------------- |
    | `gen_ai.request.model`           | Requested model name                                    |
    | `gen_ai.system`                  | AI system identifier (e.g., `anthropic`)                |
    | `gen_ai.usage.input_tokens`      | Number of input tokens                                  |
    | `gen_ai.usage.output_tokens`     | Number of output tokens                                 |
    | `gen_ai.response.output`         | Agent response output                                   |
    | `gen_ai.response.finish_reasons` | Completion finish reasons                               |
    | `gen_ai.tool.name`               | Tool name                                               |
    | `gen_ai.operation.name`          | Operation type (`chat`, `execute_tool`, `invoke_agent`) |
  </Tab>
</Tabs>

All adapters also support custom metadata via `agentmark.metadata.*` attributes.

## Grouping traces

Organize related traces together using custom grouping. This is useful for understanding complex workflows that span multiple prompt executions.

<img alt="Grouped traces view showing a parent trace with nested child traces in the timeline" />

Grouped traces show a parent-child hierarchy in the trace list, with child spans indented under their parent. Use this to model multi-step agent workflows, nested component execution, and parallel processing pipelines.

## Viewing traces

View traces in your local dev server at `http://localhost:3000` or in the AgentMark Dashboard under the **Traces** tab — both render the same trace explorer (execution timeline, span tree, graph view, and per-span attribute drill-down). Each trace shows:

* Complete prompt execution timeline
* Tool calls and their durations
* Token usage and costs
* Custom metadata and attributes
* Error information (if any)
* Graph visualization (when graph metadata is present)
* Manual annotations for quality assessment

## Filtering and search

AgentMark provides powerful filtering across all trace dimensions -- model, status, latency, cost, tokens, metadata, scores, and more. Filters can be combined, saved as views, and shared via URL.

[Learn more about filtering and search](/observe/filtering-and-search)

## Integration

AgentMark works with any application that uses OpenTelemetry. For detailed setup instructions, see [Tracing setup](/observe/tracing-setup).

### MCP trace server

For debugging traces directly from your IDE, AgentMark provides an [MCP server](/sdk-reference/tools/agentmark-mcp) that exposes `list_traces` and `get_trace` tools. This lets you query and inspect traces without leaving your development environment.

## Traces and spans API

You can query traces and spans programmatically using the REST API or the CLI. Both the local dev server and the AgentMark Cloud gateway expose `/v1/traces`, `/v1/traces/{traceId}`, and `/v1/spans`, so you can develop against local data and switch to Cloud without changing your integration. Bulk export (`/v1/traces/export`) is Cloud-only.

<Tabs>
  <Tab title="Local REST">
    ```bash theme={null}
    # List traces from the local dev server
    curl "http://localhost:9418/v1/traces?limit=20"

    # Get a specific trace with its spans
    curl "http://localhost:9418/v1/traces/<traceId>"

    # Query spans across all traces
    curl "http://localhost:9418/v1/spans?limit=50"
    ```
  </Tab>

  <Tab title="REST API (Cloud)">
    ```bash theme={null}
    # List traces with filters. The traces `status` filter accepts OK | ERROR.
    curl "https://api.agentmark.co/v1/traces?status=ERROR&limit=20" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"

    # Get a single trace with all its spans
    curl "https://api.agentmark.co/v1/traces/abc123-trace-id" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"

    # Query spans across all traces
    curl "https://api.agentmark.co/v1/spans?type=GENERATION&model=gpt-4o&limit=50" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"
    ```
  </Tab>

  <Tab title="REST API (local)">
    ```bash theme={null}
    # Same endpoints, no auth required locally
    curl "http://localhost:9418/v1/traces?status=ERROR&limit=20"

    curl "http://localhost:9418/v1/traces/abc123-trace-id"

    curl "http://localhost:9418/v1/spans?type=GENERATION&limit=50"
    ```
  </Tab>
</Tabs>

See the [API reference](/api-reference/overview) for all available endpoints, filters, and response schemas. You can also [create scores](/observe/dashboards#scores-api) for spans and traces programmatically.

### Cross-trace span search

The `GET /v1/spans` endpoint lets you search spans across all traces in your project. Unlike the traces API, which returns traces and their nested spans, the spans endpoint queries individual spans directly -- regardless of which trace they belong to.

This is useful when you need to:

* **Find all LLM calls using a specific model** across your entire project
* **Identify slow operations** by filtering on duration thresholds
* **Audit error spans** across traces without browsing each trace individually
* **Analyze usage patterns** for a particular span type (e.g., all `GENERATION` spans)

**Available filters:**

| Parameter      | Description                                 |
| -------------- | ------------------------------------------- |
| `type`         | Span type: `GENERATION`, `SPAN`, or `EVENT` |
| `status`       | Span status: `UNSET`, `OK`, or `ERROR`      |
| `name`         | Partial match on span name                  |
| `model`        | Partial match on model name                 |
| `min_duration` | Minimum duration in milliseconds            |
| `max_duration` | Maximum duration in milliseconds            |
| `limit`        | Results per page (1-500, default 100)       |
| `offset`       | Pagination offset                           |

<Tabs>
  <Tab title="Local REST">
    ```bash theme={null}
    # Find all error spans
    curl "http://localhost:9418/v1/spans?status=ERROR"

    # Find slow generations (over 5 seconds)
    curl "http://localhost:9418/v1/spans?type=GENERATION&min_duration=5000"

    # Search spans by model
    curl "http://localhost:9418/v1/spans?model=claude&limit=20"
    ```
  </Tab>

  <Tab title="REST API">
    ```bash theme={null}
    # Find all error spans
    curl "https://api.agentmark.co/v1/spans?status=ERROR" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"

    # Find slow generations (over 5 seconds)
    curl "https://api.agentmark.co/v1/spans?type=GENERATION&min_duration=5000" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"

    # Combine filters: slow Claude generations with errors
    curl "https://api.agentmark.co/v1/spans?type=GENERATION&model=claude&status=ERROR&min_duration=3000" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"

    # Paginate through results
    curl "https://api.agentmark.co/v1/spans?type=GENERATION&limit=100&offset=100" \
      -H "Authorization: Bearer am_live_abc123" \
      -H "X-Agentmark-App-Id: app_abc123"
    ```
  </Tab>
</Tabs>

Each span in the response includes its `traceId`, so you can drill into the full trace for any span that matches your search.

## Best practices

* **Use meaningful IDs** — Choose descriptive function IDs for easy filtering and debugging.
* **Add context** — Include relevant metadata like user IDs, session IDs, and business context.
* **Monitor regularly** — Check traces frequently to catch issues early.
* **Set up alerts** — Configure alerts for cost, latency, or error thresholds.
* **Analyze patterns** — Use the Dashboard's filtering to identify trends and patterns.

## Next steps

<CardGroup>
  <Card title="Sessions" icon="users" href="/observe/sessions">
    Group related traces together
  </Card>

  <Card title="Alerts" icon="bell" href="/observe/alerts">
    Get notified of critical issues
  </Card>

  <Card title="Annotations" icon="tag" href="/evaluate/annotations">
    Manually label and score traces
  </Card>

  <Card title="Tracing setup" icon="code" href="/observe/tracing-setup">
    Integrate observability in your app
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Tracing setup
Source: https://docs.agentmark.co/observe/tracing-setup

Instrument your application with the AgentMark SDK to capture traces, spans, and metrics

AgentMark uses [OpenTelemetry](https://opentelemetry.io/) to collect distributed traces and metrics for your prompt executions. This page covers everything from basic setup to advanced tracing patterns.

## Install the SDK

<Tabs>
  <Tab title="TypeScript">
    ```bash theme={null}
    npm install @agentmark-ai/sdk
    ```
  </Tab>

  <Tab title="Python">
    ```bash theme={null}
    pip install agentmark-sdk
    ```
  </Tab>
</Tabs>

## Initialize tracing

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { AgentMarkSDK } from "@agentmark-ai/sdk";
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { openai } from "@ai-sdk/openai";
    import { generateText } from "ai";

    const sdk = new AgentMarkSDK({
      apiKey: process.env.AGENTMARK_API_KEY,
      appId: process.env.AGENTMARK_APP_ID,
      baseUrl: process.env.AGENTMARK_BASE_URL  // defaults to https://api.agentmark.co
    });

    // Initialize tracing
    const tracer = sdk.initTracing();

    // Configure client
    const modelRegistry = new VercelAIModelRegistry();
    modelRegistry.registerProviders({ openai });

    const client = createAgentMarkClient({
      loader: sdk.getApiLoader(),
      modelRegistry
    });

    // Load and run prompt with telemetry
    const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
    const input = await prompt.format({
      props: { name: 'Alice' },
      telemetry: {
        isEnabled: true,
        functionId: "greeting-function",
        metadata: {
          userId: "123",
          environment: "production"
        }
      }
    });

    const result = await generateText(input);

    // Shutdown tracer (only for short-running scripts)
    await tracer.shutdown();
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    import os
    from agentmark_sdk import AgentMarkSDK
    from agentmark_pydantic_ai_v0 import run_text_prompt
    from agentmark_client import client

    sdk = AgentMarkSDK(
        api_key=os.environ["AGENTMARK_API_KEY"],
        app_id=os.environ["AGENTMARK_APP_ID"],
    )

    # Initialize tracing
    tracer = sdk.init_tracing()

    # Load and run prompt with telemetry
    prompt = await client.load_text_prompt("greeting.prompt.mdx")
    params = await prompt.format(
        props={"name": "Alice"},
        telemetry={
            "isEnabled": True,
            "functionId": "greeting-function",
            "metadata": {
                "userId": "123",
                "environment": "production",
            },
        },
    )

    result = await run_text_prompt(params)

    # Shutdown tracer (only for short-running scripts)
    await tracer.shutdown()
    ```
  </Tab>
</Tabs>

<Tabs>
  <Tab title="TypeScript">
    <Tip>
      For local development with `npx agentmark dev`, traces are sent to `http://localhost:9418` automatically. Pass `disableBatch: true` for short-running scripts:

      ```typescript theme={null}
      const tracer = sdk.initTracing({ disableBatch: true });
      ```
    </Tip>
  </Tab>

  <Tab title="Python">
    <Tip>
      For local development with `npx agentmark dev`, traces are sent to `http://localhost:9418` automatically. Pass `disable_batch=True` for short-running scripts:

      ```python theme={null}
      tracer = sdk.init_tracing(disable_batch=True)
      ```
    </Tip>
  </Tab>
</Tabs>

## Collected spans

AgentMark records these OpenTelemetry spans:

| Span type      | Description                   | Key attributes                                                                                       |
| -------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------- |
| `ai.inference` | Full inference call lifecycle | `ai.model.id`, `ai.prompt`, `ai.response.text`, `ai.usage.promptTokens`, `ai.usage.completionTokens` |
| `ai.toolCall`  | Individual tool executions    | `ai.toolCall.name`, `ai.toolCall.args`, `ai.toolCall.result`                                         |
| `ai.stream`    | Streaming response metrics    | `ai.response.msToFirstChunk`, `ai.response.msToFinish`, `ai.response.avgCompletionTokensPerSecond`   |

## Span attributes

Each span contains detailed attributes:

**Model information**: `ai.model.id` (e.g., `"gpt-4o-mini"`), `ai.model.provider` (e.g., `"openai"`)

**Token usage**: `ai.usage.promptTokens`, `ai.usage.completionTokens`

**Telemetry metadata**: `ai.telemetry.functionId`, `ai.telemetry.metadata.*`

**Response details**: `ai.response.text`, `ai.response.toolCalls`, `ai.response.finishReason`

## Grouping operations into a span

Use `span()` (TypeScript) or `span_context()` (Python) to wrap a block of work as a single parent span. Nested SDK calls automatically attach as children.

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { span } from "@agentmark-ai/sdk";

    const { result, traceId } = await span(
      { name: 'user-request-handler' },
      async (ctx) => {
        const prompt = await client.loadTextPrompt('handler.prompt.mdx');
        const input = await prompt.format({
          props: { query: 'What is AgentMark?' },
          telemetry: { isEnabled: true }
        });

        return await generateText(input);
      }
    );

    console.log('Trace ID:', traceId);
    const output = await result;
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    from agentmark_sdk import span_context, SpanOptions
    from agentmark_pydantic_ai_v0 import run_text_prompt
    from agentmark_client import client

    async with span_context(SpanOptions(name="user-request-handler")) as ctx:
        prompt = await client.load_text_prompt("handler.prompt.mdx")
        params = await prompt.format(
            props={"query": "What is AgentMark?"},
            telemetry={"isEnabled": True},
        )

        result = await run_text_prompt(params)

    print(f"Trace ID: {ctx.trace_id}")
    ```
  </Tab>
</Tabs>

### SpanOptions

| Option                  | Type                     | Required | Description                                      |
| ----------------------- | ------------------------ | -------- | ------------------------------------------------ |
| `name`                  | `string`                 | Yes      | Name for the span                                |
| `metadata`              | `Record<string, string>` | No       | Custom key-value metadata (strings only)         |
| `sessionId`             | `string`                 | No       | Group traces into a [session](/observe/sessions) |
| `sessionName`           | `string`                 | No       | Human-readable session name                      |
| `userId`                | `string`                 | No       | Associate trace with a user                      |
| `datasetRunId`          | `string`                 | No       | Link to a dataset run                            |
| `datasetRunName`        | `string`                 | No       | Human-readable dataset run name                  |
| `datasetItemName`       | `string`                 | No       | Specific dataset item name                       |
| `datasetExpectedOutput` | `string`                 | No       | Expected output for evaluation                   |
| `datasetPath`           | `string`                 | No       | Path to the dataset file                         |

### SpanResult

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    interface SpanResult<T> {
      result: Promise<T>;  // The result of your callback (as a Promise)
      traceId: string;     // The trace ID for correlation
    }
    ```

    <Warning>
      `result` is `Promise<T>`, not `T`. You need to `await` it to get the resolved value.
    </Warning>
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    async with span_context(SpanOptions(name="my-operation")) as ctx:
        print(ctx.trace_id)  # Available immediately
        result = await my_async_function()
    ```
  </Tab>
</Tabs>

### Creating child spans

Use `ctx.span()` inside a callback to create child spans under the current parent:

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { span } from "@agentmark-ai/sdk";

    const { result, traceId } = await span(
      { name: 'multi-step-workflow' },
      async (ctx) => {
        await ctx.span({ name: 'validate-input' }, async (spanCtx) => {
          spanCtx.setAttribute('input.length', 42);
        });

        const output = await ctx.span({ name: 'process-request' }, async (spanCtx) => {
          const prompt = await client.loadTextPrompt('process.prompt.mdx');
          const input = await prompt.format({
            props: { query: 'process this' },
            telemetry: { isEnabled: true }
          });
          return await generateText(input);
        });

        await ctx.span({ name: 'format-response' }, async (spanCtx) => {
          spanCtx.addEvent('formatting-complete');
        });

        return output;
      }
    );
    ```

    `ctx.span()` accepts `{ name: string; metadata?: Record<string, string> }`. Use `observe()` (below) if you need to set a `SpanKind` on a span.
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    from agentmark_sdk import span_context, SpanOptions
    from agentmark_pydantic_ai_v0 import run_text_prompt
    from agentmark_client import client

    async with span_context(SpanOptions(name="multi-step-workflow")) as ctx:
        async with ctx.span("validate-input") as span_ctx:
            span_ctx.set_attribute("input.length", 42)

        async with ctx.span("process-request") as span_ctx:
            prompt = await client.load_text_prompt("process.prompt.mdx")
            params = await prompt.format(
                props={"query": "process this"},
                telemetry={"isEnabled": True},
            )
            output = await run_text_prompt(params)

        async with ctx.span("format-response") as span_ctx:
            span_ctx.add_event("formatting-complete")
    ```
  </Tab>
</Tabs>

## Wrapping functions with `observe()`

`observe()` wraps an async function with automatic input/output capture AND lets you set a `SpanKind`. Unlike `span()` / `ctx.span()` which create inline spans, `observe()` produces a reusable function so every call is automatically traced.

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { observe, SpanKind } from "@agentmark-ai/sdk";

    const searchWeb = observe(
      async (query: string) => {
        const response = await fetch(`https://api.search.com?q=${query}`);
        return response.json();
      },
      { name: "search-web", kind: SpanKind.TOOL }
    );

    // Every call is now automatically traced
    const results = await searchWeb("AgentMark tracing");
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    from agentmark_sdk import observe, SpanKind

    @observe(name="search-web", kind=SpanKind.TOOL)
    async def search_web(query: str) -> dict:
        async with httpx.AsyncClient() as http:
            response = await http.get(f"https://api.search.com?q={query}")
            return response.json()

    # Every call is now automatically traced
    results = await search_web("AgentMark tracing")
    ```
  </Tab>
</Tabs>

### `observe()` options

| Option                               | Type       | Description                                                                |
| ------------------------------------ | ---------- | -------------------------------------------------------------------------- |
| `name`                               | `string`   | Display name for the span (defaults to function name)                      |
| `kind`                               | `SpanKind` | Type of operation (defaults to `SpanKind.FUNCTION`)                        |
| `captureInput` / `capture_input`     | `boolean`  | Record function arguments (default: `true`)                                |
| `captureOutput` / `capture_output`   | `boolean`  | Record return value (default: `true`)                                      |
| `processInputs` / `process_inputs`   | `function` | Transform arguments before recording (useful for redacting sensitive data) |
| `processOutputs` / `process_outputs` | `function` | Transform return value before recording                                    |

Observed functions automatically attach to the active trace context — they nest correctly inside `span()` / `span_context()` without extra wiring.

### SpanKind values

| Kind                 | Description                                |
| -------------------- | ------------------------------------------ |
| `SpanKind.FUNCTION`  | Generic computation step (default)         |
| `SpanKind.LLM`       | A call to a language model                 |
| `SpanKind.TOOL`      | An external tool or API call               |
| `SpanKind.AGENT`     | An orchestration loop                      |
| `SpanKind.RETRIEVAL` | A vector database query or document search |
| `SpanKind.EMBEDDING` | A call to an embedding model               |
| `SpanKind.GUARDRAIL` | A content safety or validation check       |

### Using `SpanKind` in a pipeline

To set `SpanKind` on individual steps of a pipeline, wrap each step with `observe()` and call the wrapped functions inside `span()`:

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    import { span, observe, SpanKind } from "@agentmark-ai/sdk";

    const searchKB = observe(
      async (query: string) => vectorDb.query({ query, topK: 5 }),
      { name: 'search-knowledge-base', kind: SpanKind.RETRIEVAL }
    );

    const guardrail = observe(
      async (question: string) => moderationService.check(question),
      { name: 'check-content-policy', kind: SpanKind.GUARDRAIL }
    );

    const generateAnswer = observe(
      async (question: string, context: unknown) => {
        const prompt = await client.loadTextPrompt('answer.prompt.mdx');
        const input = await prompt.format({
          props: { question, context },
          telemetry: { isEnabled: true }
        });
        return generateText(input);
      },
      { name: 'generate-answer', kind: SpanKind.LLM }
    );

    const { result } = await span(
      { name: 'rag-pipeline' },
      async () => {
        const docs = await searchKB(userQuestion);
        await guardrail(userQuestion);
        return generateAnswer(userQuestion, docs);
      }
    );
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    from agentmark_sdk import span_context, SpanOptions, observe, SpanKind

    @observe(name="search-knowledge-base", kind=SpanKind.RETRIEVAL)
    async def search_kb(query: str):
        return await vector_db.query(query=query, top_k=5)

    @observe(name="check-content-policy", kind=SpanKind.GUARDRAIL)
    async def guardrail(question: str):
        await moderation_service.check(question)

    @observe(name="generate-answer", kind=SpanKind.LLM)
    async def generate_answer(question: str, context) -> str:
        prompt = await client.load_text_prompt("answer.prompt.mdx")
        params = await prompt.format(
            props={"question": question, "context": context},
            telemetry={"isEnabled": True},
        )
        return await run_text_prompt(params)

    async with span_context(SpanOptions(name="rag-pipeline")):
        docs = await search_kb(user_question)
        await guardrail(user_question)
        answer = await generate_answer(user_question, docs)
    ```
  </Tab>
</Tabs>

## Scoring traces

Use `sdk.score()` to attach quality scores to traces or spans:

<Tabs>
  <Tab title="TypeScript">
    ```typescript theme={null}
    const { result, traceId } = await span(
      { name: 'scored-workflow' },
      async (ctx) => {
        return output;
      }
    );

    await sdk.score({
      resourceId: traceId,
      name: 'correctness',
      score: 0.95,
      label: 'correct',
      reason: 'Output matches expected result'
    });
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    async with span_context(SpanOptions(name="scored-workflow")) as ctx:
        output = await my_async_function()

    await sdk.score(
        resource_id=ctx.trace_id,
        name="correctness",
        score=0.95,
        label="correct",
        reason="Output matches expected result",
    )
    ```
  </Tab>
</Tabs>

## Best practices

* **Use meaningful function IDs** — `"customer-support-greeting"`, not `"func1"`
* **Add relevant metadata** — userId, environment, query parameters
* **Always enable telemetry in production** — monitor performance and set up alerts
* **Shutdown tracer for short scripts** — call `tracer.shutdown()` before the process exits

## Next steps

<CardGroup>
  <Card title="Sessions" icon="users" href="/observe/sessions">
    Group related traces together
  </Card>

  <Card title="Metadata" icon="tag" href="/observe/metadata">
    Add custom context to traces
  </Card>

  <Card title="Tags" icon="tags" href="/observe/tags">
    Categorize traces with labels
  </Card>

  <Card title="PII masking" icon="shield-halved" href="/observe/pii-masking">
    Redact sensitive data from traces
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# CLI reference
Source: https://docs.agentmark.co/sdk-reference/cli/commands

Complete reference for all AgentMark CLI commands

The AgentMark CLI (`@agentmark-ai/cli`) provides tools for developing, testing, and building your AI prompts.

## Installation

```bash theme={null}
# Use directly with npx (no global install needed)
npx agentmark <command>

# Or install globally
npm install -g @agentmark-ai/cli
agentmark <command>
```

## Environment variables

The CLI automatically loads environment variables from a `.env` file in the current working directory. This happens before any command execution, so you can store API keys and configuration there.

```bash theme={null}
# .env
OPENAI_API_KEY=sk-...
AGENTMARK_API_KEY=...
AGENTMARK_APP_ID=...
```

See [Environment variables](/configure/environment-variables) for the full list.

## Update notifications

The CLI checks for updates asynchronously when you run commands. If a newer version is available, you'll see a notification after your command completes. This check is non-blocking and won't slow down your workflow.

To disable update checks, set the environment variable:

```bash theme={null}
export AGENTMARK_NO_UPDATE_NOTIFIER=1
```

***

## Commands

### agentmark dev

Start the local development environment — API server, webhook server, and the local dev UI app. When the project is linked (`agentmark link`), traces from local runs automatically forward to AgentMark Cloud.

```bash theme={null}
npx agentmark dev [options]
```

**Options:**

| Option                    | Description                                                                                          | Default |
| ------------------------- | ---------------------------------------------------------------------------------------------------- | ------- |
| `--api-port <number>`     | API server port                                                                                      | `9418`  |
| `--webhook-port <number>` | Webhook server port                                                                                  | `9417`  |
| `--app-port <number>`     | Local dev UI port                                                                                    | `3000`  |
| `--no-forward`            | Disable trace forwarding to AgentMark Cloud (forwarding is on by default when the project is linked) | forward |
| `--no-ui`                 | Skip the UI app (API + webhook only) — for CI / headless / test use                                  | UI on   |

<Note>
  The `--remote` flag and WebSocket Connect feature were removed in `@agentmark-ai/cli` 0.13.0. For programmatic Cloud access (what `--remote` used to enable), run the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server or call the gateway REST API directly.
</Note>

**Project detection:**

* **TypeScript projects**: `agentmark.client.ts` in the project root
* **Python projects**: `pyproject.toml`, `agentmark_client.py`, or `.agentmark/dev_server.py`

**Dev server entry points (TypeScript):**

The CLI looks for dev server files in this order:

1. `dev-server.ts` — custom override (project root)
2. `dev-entry.ts` — default location (project root)
3. `.agentmark/dev-entry.ts` — legacy location

**Python virtual environment:**

For Python projects, the CLI auto-detects `.venv/` or `venv/` directories.

**Example:**

```bash theme={null}
# Default — API + webhook + UI on 9418/9417/3000
npx agentmark dev

# Custom ports
npx agentmark dev --api-port 9500 --webhook-port 9501

# CI / headless — no UI app
npx agentmark dev --no-ui

# Linked project, but don't forward traces to Cloud
npx agentmark dev --no-forward
```

***

### agentmark login

Authenticate with AgentMark Cloud via browser OAuth. The CLI opens your default browser to complete the login flow, then stores credentials locally for subsequent commands.

```bash theme={null}
npx agentmark login [options]
```

**Options:**

| Option                | Description                                                                                                                                           | Default                                                 |
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `--base-url <url>`    | AgentMark Cloud URL                                                                                                                                   | `$AGENTMARK_PLATFORM_URL` or `https://app.agentmark.co` |
| `--print-url`         | Print the auth URL instead of opening a browser (for SSH'd shells, CI runners, or IDE-embedded agents)                                                | open browser                                            |
| `--json`              | Emit a single line of JSON on completion instead of human text — useful for wrapper scripts that need to capture `user_id` / `email` programmatically | human text                                              |
| `--timeout <seconds>` | How long to wait for the browser handoff before failing                                                                                               | `120` (2 minutes)                                       |

Stored credentials are used automatically by `agentmark link` and by the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server. You can override stored credentials by setting the `AGENTMARK_API_KEY` environment variable, which takes precedence over the cached session bearer.

***

### agentmark logout

Clear stored CLI authentication credentials and revoke any dev API keys created during `agentmark link`.

```bash theme={null}
npx agentmark logout [options]
```

**Options:**

| Option             | Description                                                                                                                                       | Default                                                 |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `--base-url <url>` | AgentMark Cloud URL                                                                                                                               | `$AGENTMARK_PLATFORM_URL` or `https://app.agentmark.co` |
| `--json`           | Emit a single line of JSON on completion instead of human text. Shape: `{"logged_out": true, "was_logged_in": <bool>, "revoked_dev_key": <bool>}` | human text                                              |

***

### agentmark link

Link your local project to an app in AgentMark Cloud. The CLI prompts you to select an app from your account (or use `--app-id` to skip the prompt), then stores the app ID and a dev API key in your local project configuration.

```bash theme={null}
npx agentmark link [options]
```

**Options:**

| Option             | Description                                                                                                                                                                                            | Default                                                 |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------- |
| `--app-id <uuid>`  | App ID to link (skips interactive selection)                                                                                                                                                           | —                                                       |
| `--base-url <url>` | AgentMark Cloud URL                                                                                                                                                                                    | `$AGENTMARK_PLATFORM_URL` or `https://app.agentmark.co` |
| `--json`           | Emit a single line of JSON on completion (e.g. for CI to capture the linked appId). Shape: `{"linked": true, "appId": "...", "appName": "...", "tenantId": "...", "orgName": "...", "baseUrl": "..."}` | human text                                              |

After linking, `agentmark dev` automatically forwards traces from local prompt runs to the linked app — no flag needed. The linked `appId` is read from `.agentmark/dev-config.json` (per-developer, gitignored). The forwarder authenticates with the session bearer from `~/.agentmark/auth.json` (auto-refreshed); `AGENTMARK_API_KEY` overrides if set.

***

### agentmark run-prompt

Run a single prompt file with test props.

```bash theme={null}
npx agentmark run-prompt <filepath> [options]
```

**Arguments:**

| Argument   | Description                    |
| ---------- | ------------------------------ |
| `filepath` | Path to the `.prompt.mdx` file |

**Options:**

| Option                | Description                                | Default                 |
| --------------------- | ------------------------------------------ | ----------------------- |
| `--server <url>`      | Webhook server URL                         | `http://localhost:9417` |
| `--props <json>`      | Props as JSON string                       | -                       |
| `--props-file <path>` | Path to JSON or YAML file containing props | -                       |

**Example:**

```bash theme={null}
# Run with inline props
npx agentmark run-prompt ./agentmark/greeting.prompt.mdx --props '{"name": "Alice"}'

# Run with props from file
npx agentmark run-prompt ./agentmark/greeting.prompt.mdx --props-file ./test-props.yaml

# Run against a remote server
npx agentmark run-prompt ./agentmark/greeting.prompt.mdx --server https://my-webhook.example.com
```

***

### agentmark run-experiment

Run an experiment against its dataset, with evaluations by default.

```bash theme={null}
npx agentmark run-experiment <filepath> [options]
```

**Arguments:**

| Argument   | Description                                            |
| ---------- | ------------------------------------------------------ |
| `filepath` | Path to the `.prompt.mdx` file with test configuration |

**Options:**

| Option                    | Description                                                                                                                    | Default                 |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------------- |
| `--server <url>`          | Webhook server URL                                                                                                             | `http://localhost:9417` |
| `--skip-eval`             | Skip running evals even if they exist                                                                                          | `false`                 |
| `--format <format>`       | Output format: `table`, `csv`, `json`, `jsonl`, `junit`                                                                        | `table`                 |
| `--threshold <percent>`   | Fail if pass rate is below threshold (0-100)                                                                                   | -                       |
| `--sample <percent>`      | Sample N% of dataset rows randomly (1-100)                                                                                     | -                       |
| `--rows <spec>`           | Select specific rows by index/range (e.g. `0,3-5,9`)                                                                           | -                       |
| `--split <spec>`          | Train/test split (e.g. `train:80`, `test:80`)                                                                                  | -                       |
| `--seed <number>`         | Seed for reproducible sampling/splitting                                                                                       | -                       |
| `--truncate <chars>`      | Truncate table cell content to N chars (0 = no limit)                                                                          | `1000`                  |
| `--concurrency <number>`  | Dataset rows to run in parallel                                                                                                | `20`                    |
| `--baseline-commit <ref>` | Git ref (or tree hash) of a prior run to compare against; enables the regression gate via `test_settings.regression_tolerance` | -                       |

**Example:**

```bash theme={null}
# Run experiment with table output
npx agentmark run-experiment ./agentmark/qa-bot.prompt.mdx

# Run experiment with JSON output, skip evals
npx agentmark run-experiment ./agentmark/qa-bot.prompt.mdx --format json --skip-eval

# Run with CI threshold (fails if <80% pass rate)
npx agentmark run-experiment ./agentmark/qa-bot.prompt.mdx --threshold 80

# Emit JUnit XML for CI gating (GitHub Actions, GitLab, Jenkins, etc.)
npx agentmark run-experiment ./agentmark/qa-bot.prompt.mdx --format junit > results.xml

# Gate against a baseline run — fails rows whose scorer regresses beyond
# test_settings.regression_tolerance relative to the baseline commit
npx agentmark run-experiment ./agentmark/qa-bot.prompt.mdx --baseline-commit main
```

***

### agentmark generate-types

Generate TypeScript type definitions from your prompt schemas.

```bash theme={null}
npx agentmark generate-types [options]
```

**Options:**

| Option                      | Description                               | Default      |
| --------------------------- | ----------------------------------------- | ------------ |
| `-l, --language <language>` | Target language                           | `typescript` |
| `--local <port>`            | Local server port to fetch prompts from   | -            |
| `--root-dir <path>`         | Root directory containing agentmark files | -            |

**Output:**

The command outputs TypeScript definitions to stdout. Redirect to a file:

```bash theme={null}
npx agentmark generate-types --root-dir ./agentmark > agentmark.types.ts
```

**Generated types include:**

* Input types based on `input_schema`
* Output types based on the model's `schema`
* A mapping of prompt paths to their respective types
* Tool argument types

**Example:**

```bash theme={null}
# Generate from local files
npx agentmark generate-types --root-dir ./agentmark > agentmark.types.ts

# Generate from local dev server
npx agentmark generate-types --local 9418 > agentmark.types.ts
```

See [Type safety](/sdk-reference/typescript/type-safety) for usage examples.

***

### agentmark generate-schema

Generate a JSON Schema file for `.prompt.mdx` frontmatter. This enables IDE validation (squiggles) for fields like `model_name` in your prompt files.

```bash theme={null}
npx agentmark generate-schema [options]
```

**Options:**

| Option                  | Description      | Default      |
| ----------------------- | ---------------- | ------------ |
| `-o, --out <directory>` | Output directory | `.agentmark` |

**Example:**

```bash theme={null}
npx agentmark generate-schema
npx agentmark generate-schema --out ./schemas
```

***

### agentmark build

Build prompts into pre-compiled JSON files for static loading with `FileLoader`.

```bash theme={null}
npx agentmark build [options]
```

**Options:**

| Option                  | Description      | Default          |
| ----------------------- | ---------------- | ---------------- |
| `-o, --out <directory>` | Output directory | `dist/agentmark` |

**Requirements:**

* An `agentmark.json` config file must exist in the current directory
* Prompts are read from the directory specified by `agentmarkPath` in the config

**Output Structure:**

```
dist/agentmark/
  manifest.json           # Build manifest with all prompts
  greeting.prompt.json    # Compiled prompt (mirrors source structure)
  nested/
    helper.prompt.json
```

**Example:**

```bash theme={null}
# Build with default output directory
npx agentmark build

# Build to custom directory
npx agentmark build --out ./build/prompts
```

See [Loaders](/sdk-reference/typescript/loaders) for using built prompts with `FileLoader`.

***

### agentmark pull-models

Pull and configure models from a provider. Runs interactively by default; pass `--provider` + `--models` to skip the prompts (e.g. for CI).

```bash theme={null}
npx agentmark pull-models [options]
```

**Options:**

| Option              | Description                                                           | Default |
| ------------------- | --------------------------------------------------------------------- | ------- |
| `--provider <name>` | Provider key (skips the interactive picker)                           | prompt  |
| `--models <csv>`    | Comma-separated model IDs to add (skips the interactive multi-select) | prompt  |

With both `--provider` and `--models` set, the command runs fully non-interactively and is safe for CI.

This command opens an interactive prompt (when no flags are passed) to:

1. Select a model provider
2. Choose models to enable
3. Update your local configuration

***

## Programmatic gateway access (for agents and scripts)

The `agentmark api` CLI command was retired in favor of two protocol-level surfaces that stay in lock-step with the gateway's OpenAPI spec:

* **IDE agents (Claude Code, Cursor, VS Code, Zed):** run the [`agentmark-mcp`](/sdk-reference/tools/agentmark-mcp) MCP server. It fetches the gateway's OpenAPI spec at startup and exposes one MCP tool per operation (e.g. `create_app`, `list_traces`, `start_app_git_connect`). The agent calls those tools directly; no CLI invocation needed.
* **CI / shell scripts:** call the gateway REST API with `curl` and an `AGENTMARK_API_KEY`. The MCP tools are generated from the same OpenAPI spec, so request shapes are identical — only the transport differs.

Both targets honor the same auth chain: `AGENTMARK_API_KEY` env var first, then the session bearer from `~/.agentmark/auth.json` (written by `agentmark login`).

***

## Configuration files

### `agentmark.json`

Project configuration file in your project root. See [Project config](/configure/project-config) for the full schema.

```json theme={null}
{
  "agentmarkPath": ".",
  "version": "2.0.0",
  "mdxVersion": "1.0"
}
```

| Field           | Description                                                                                                         |
| --------------- | ------------------------------------------------------------------------------------------------------------------- |
| `agentmarkPath` | Base path for agentmark files (contains the `agentmark/` directory) — use `"."` for the canonical layout, not `"/"` |
| `version`       | Configuration version                                                                                               |
| `mdxVersion`    | MDX syntax version                                                                                                  |

### `.agentmark/dev-config.json`

Auto-generated local development configuration (gitignored):

```json theme={null}
{
  "createdAt": "2026-04-15T10:30:00.000Z",
  "appPort": 3000,
  "forwarding": {
    "appId": "app_xxxxx",
    "appName": "my-app",
    "orgName": "my-org",
    "tenantId": "tenant_xxxxx",
    "apiKey": "am_dev_xxxxx",
    "apiKeyId": "key_xxxxx",
    "expiresAt": "2026-05-15T10:30:00.000Z",
    "baseUrl": "https://app.agentmark.co"
  }
}
```

This file stores:

* `appPort` — local dev server UI port (updated when dev server starts).
* `forwarding` — linked app metadata (app ID, dev API key, token expiry) used by `agentmark dev` when forwarding traces to AgentMark Cloud. Populated by `agentmark link` and cleared by `agentmark logout`.

The configuration expires after 30 days and is regenerated on the next `agentmark link`.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Python client setup
Source: https://docs.agentmark.co/sdk-reference/python/client-setup

Install and configure the AgentMark Python SDK with Pydantic AI

The AgentMark client is configured in `agentmark_client.py`. It connects your prompts to AI models, tools, and prompt loading. This file is auto-generated by `npm create agentmark@latest` when you select Python.

## Installation

```bash theme={null}
pip install agentmark-pydantic-ai-v0 agentmark-prompt-core
```

<Note>
  **Package vs import names:**

  * `agentmark-prompt-core` → `from agentmark.prompt_core import ApiLoader, FileLoader`
  * `agentmark-pydantic-ai-v0` → `from agentmark_pydantic_ai_v0 import ...`

  `ApiLoader` ships with `agentmark-prompt-core` — there's no separate `agentmark-loader-api` PyPI package.
</Note>

## Configuration

The Python adapter does **not** ship a default registry — register providers explicitly. The `"<provider>:<model>"` string format tells Pydantic AI which provider to use at runtime:

```python agentmark_client.py theme={null}
import os
from dotenv import load_dotenv
from agentmark.prompt_core import ApiLoader
from agentmark_pydantic_ai_v0 import (
    create_pydantic_ai_client,
    PydanticAIModelRegistry,
)

load_dotenv()

model_registry = PydanticAIModelRegistry()
model_registry.register_models(
    ["gpt-4o", "gpt-4o-mini"],
    lambda name, opts=None: f"openai:{name}",
)
model_registry.register_models(
    ["claude-sonnet-4-20250514"],
    lambda name, opts=None: f"anthropic:{name}",
)

if os.getenv("NODE_ENV") == "development":
    loader = ApiLoader.local(
        base_url=os.getenv("AGENTMARK_BASE_URL", "http://localhost:9418")
    )
else:
    loader = ApiLoader.cloud(
        api_key=os.environ["AGENTMARK_API_KEY"],
        app_id=os.environ["AGENTMARK_APP_ID"],
    )

client = create_pydantic_ai_client(
    model_registry=model_registry,
    loader=loader,
)
```

## Model registry

`PydanticAIModelRegistry.register_models(pattern, creator)` accepts an exact string, a `re.Pattern`, or a list of strings. The creator returns either a `"<provider>:<model>"` string or a Pydantic AI `Model` instance. Use `set_default(creator)` for a fallback:

```python theme={null}
import re
from agentmark_pydantic_ai_v0 import PydanticAIModelRegistry

model_registry = PydanticAIModelRegistry()

# Exact matches
model_registry.register_models(
    ["gpt-4o", "gpt-4o-mini"],
    lambda name, opts=None: f"openai:{name}",
)

# Regex pattern
model_registry.register_models(
    re.compile(r"^claude-"),
    lambda name, opts=None: f"anthropic:{name}",
)

# Fallback for unmatched names
model_registry.set_default(lambda name, opts=None: name)
```

Model names in the registry must match the `model_name` in your prompt frontmatter.

## Prompt loading

The loader determines how prompts are fetched at runtime:

```python theme={null}
from agentmark.prompt_core import ApiLoader

# Local — loads from dev server (development)
loader = ApiLoader.local(base_url="http://localhost:9418")

# Cloud — loads from the AgentMark HTTP API (production)
loader = ApiLoader.cloud(
    api_key=os.environ["AGENTMARK_API_KEY"],
    app_id=os.environ["AGENTMARK_APP_ID"],
)
```

Prompts are cached in-process in-memory for repeated loads within a process.

## Running prompts

```python theme={null}
from agentmark_client import client
from agentmark_pydantic_ai_v0 import run_text_prompt

prompt = await client.load_text_prompt("greeting.prompt.mdx")
params = await prompt.format(props={"name": "Alice"})

result = await run_text_prompt(params)
print(result.output)
```

## Dev server

Start the Python dev server for local development:

```bash theme={null}
npx agentmark dev
```

This starts the local API server on port 9418 and the local dev server UI on port 3000. See [Dev server](/sdk-reference/python/dev-server) for configuration options.

## Evals

You can register evaluation functions to score prompt outputs during experiments. Pass an `evals` dictionary of plain functions:

```python agentmark_client.py theme={null}
from agentmark.prompt_core import EvalParams, EvalResult

evals = {
    "exact_match": lambda params: {
        "passed": params["output"] == params.get("expectedOutput"),
    },
}

client = create_pydantic_ai_client(
    model_registry=model_registry,
    loader=loader,
    evals=evals,
)
```

Score schemas are defined separately in `agentmark.json` and synced to AgentMark Cloud. Eval functions are connected to scores by name.

See [Evaluations](/evaluate/writing-evals) for the full guide on writing eval functions and configuring score schemas.

## Full reference

For all configuration options (including tools, MCP, and the Claude Agent SDK Python adapter), see [Client config](/configure/client-config).

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Python dev server
Source: https://docs.agentmark.co/sdk-reference/python/dev-server

Running the AgentMark development server with Python

The AgentMark CLI automatically detects and runs Python projects with the appropriate dev server configuration.

## Starting the dev server

```bash theme={null}
npx agentmark dev
```

The CLI detects Python projects and spawns the Python webhook server alongside the API server and UI.

## Project detection

The CLI identifies Python projects by checking for:

1. `pyproject.toml` - Python project manifest
2. `agentmark_client.py` - AgentMark client configuration
3. `.agentmark/dev_server.py` - Auto-generated entry point

If any of these files exist, the CLI runs in Python mode.

## Virtual environment detection

The CLI automatically detects and uses virtual environments:

```text theme={null}
Priority order:
1. .venv/bin/python (or .venv\Scripts\python.exe on Windows)
2. venv/bin/python (or venv\Scripts\python.exe on Windows)
3. System python
```

When a virtual environment is found, the CLI prints:

```text theme={null}
Using virtual environment: .venv/
```

## Entry point resolution

The dev server entry point is resolved in this order:

| Location                   | Description                      |
| -------------------------- | -------------------------------- |
| `dev_server.py`            | Custom dev server (project root) |
| `.agentmark/dev_server.py` | Auto-generated server            |

### Custom dev server

Create a `dev_server.py` in your project root to customize the webhook server:

```python dev_server.py theme={null}
"""Custom webhook server for AgentMark development."""

import argparse
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent))

from agentmark_pydantic_ai_v0 import create_webhook_server
from agentmark_client import client

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--webhook-port", type=int, default=9417)
    parser.add_argument("--api-server-port", type=int, default=9418)
    args = parser.parse_args()

    create_webhook_server(client, args.webhook_port, args.api_server_port)
```

### Auto-generated server

When you run `npm create agentmark@latest`, an entry point is created at `.agentmark/dev_server.py`:

```python .agentmark/dev_server.py theme={null}
"""Auto-generated webhook server for AgentMark development."""

import argparse
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent.parent))

from agentmark_pydantic_ai_v0 import create_webhook_server
from agentmark_client import client

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--webhook-port", type=int, default=9417)
    parser.add_argument("--api-server-port", type=int, default=9418)
    args = parser.parse_args()

    create_webhook_server(client, args.webhook_port, args.api_server_port)
```

## Environment variables

The dev server sets the following environment variables:

| Variable                  | Value                         | Description                     |
| ------------------------- | ----------------------------- | ------------------------------- |
| `PYTHONDONTWRITEBYTECODE` | `1`                           | Prevents `__pycache__` creation |
| `PYTHONUNBUFFERED`        | `1`                           | Ensures real-time output        |
| `AGENTMARK_BASE_URL`      | `http://localhost:{api_port}` | API server URL for telemetry    |

## Server architecture

When you run `npx agentmark dev`, three servers start:

```text theme={null}
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   API Server    │────▶│ Webhook Server  │────▶│    UI Server    │
│   (port 9418)   │     │   (port 9417)   │     │   (port 3000)   │
│                 │     │                 │     │                 │
│  Telemetry API  │     │ Python Process  │     │    Next.js      │
│  Trace Storage  │     │ Prompt Executor │     │    Dashboard    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
```

The Python webhook server:

* Receives prompt execution requests from the CLI
* Uses your `agentmark_client.py` configuration
* Executes prompts via the configured adapter (Pydantic AI or Claude Agent SDK)
* Returns streaming or non-streaming responses

## Port configuration

Override default ports with CLI options:

```bash theme={null}
npx agentmark dev --webhook-port 8080 --api-port 8081 --app-port 8082
```

| Option           | Default | Description         |
| ---------------- | ------- | ------------------- |
| `--webhook-port` | 9417    | Webhook server port |
| `--api-port`     | 9418    | API server port     |
| `--app-port`     | 3000    | UI server port      |

## Webhook handler

The webhook server implements two event types:

### prompt-run

Executes a single prompt:

```json theme={null}
{
  "type": "prompt-run",
  "data": {
    "ast": { ... },
    "options": {
      "shouldStream": true
    },
    "customProps": { ... }
  }
}
```

### dataset-run

Executes a prompt across a dataset:

```json theme={null}
{
  "type": "dataset-run",
  "data": {
    "ast": { ... },
    "experimentId": "exp-123",
    "datasetPath": "./datasets/test.yaml"
  }
}
```

## Running prompts

With the dev server running, execute prompts from another terminal:

```bash theme={null}
npx agentmark run-prompt ./agentmark/<your-prompt>.prompt.mdx
```

Or run experiments:

```bash theme={null}
npx agentmark run-experiment ./agentmark/<your-prompt>.prompt.mdx
```

<Tip>
  Need a working starter? See [Example prompts](/build/example-prompts) — four copy-paste recipes (object, text+tools, image, speech) you can drop into your `agentmark/` directory.
</Tip>

## Troubleshooting

### Virtual environment not found

If you see "python not found" errors:

```bash theme={null}
# Create a virtual environment
python -m venv .venv

# Activate it
source .venv/bin/activate  # macOS/Linux
.venv\Scripts\activate     # Windows

# Install dependencies directly — `pip install -e ".[dev]"` has known
# "resolution too deep" issues in scaffolded projects, so install the
# packages you need explicitly.
pip install agentmark-sdk agentmark-pydantic-ai-v0 agentmark-prompt-core python-dotenv
```

### Module not found

Ensure dependencies are installed in the correct virtual environment:

```bash theme={null}
pip install agentmark-pydantic-ai-v0 agentmark-prompt-core python-dotenv
```

### Port already in use

If ports are busy, specify alternative ports:

```bash theme={null}
npx agentmark dev --webhook-port 9500 --api-port 9501
```

### `agentmark_client.py` not found

The CLI requires `agentmark_client.py` in your project root:

```bash theme={null}
# Create a new project
npm create agentmark@latest

# Or manually create agentmark_client.py
```

## Adapter-specific considerations

### Pydantic AI

The Pydantic AI dev server uses `aiohttp` for async HTTP handling:

```python theme={null}
from agentmark_pydantic_ai_v0 import create_webhook_server

create_webhook_server(client, webhook_port=9417, api_server_port=9418)
```

### Claude Agent SDK

The Claude Agent SDK dev server handles agentic execution:

```python theme={null}
from agentmark_claude_agent_sdk_v0 import create_webhook_server

create_webhook_server(client, webhook_port=9417, api_server_port=9418)
```

## Next steps

<CardGroup>
  <Card title="Python overview" icon="python" href="/getting-started/quickstart">
    Python SDK and project setup
  </Card>

  <Card title="Pydantic AI" icon="cube" href="/integrations/python/pydantic-ai">
    Type-safe LLM interactions
  </Card>

  <Card title="Claude Agent SDK" icon="microchip" href="/integrations/typescript/claude-agent-sdk">
    Agentic task execution
  </Card>

  <Card title="Running prompts" icon="play" href="/build/running-prompts">
    Execute prompts from CLI
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# AI editor integration
Source: https://docs.agentmark.co/sdk-reference/tools/agentmark-mcp

Using AgentMark within your AI-powered code editors

AgentMark is compatible with a variety of AI-powered code editors through the Model Context Protocol (MCP). This page covers the **docs MCP** (`agentmark-docs`) — a remote server that teaches your editor how to author AgentMark files.

<Note>
  To **query** live traces and data from your editor (rather than author files), use the gateway MCP server instead — see [MCP trace server](/sdk-reference/tools/mcp-trace-server).
</Note>

<img alt="AI Editors" />

## Getting started

When you run `npm create agentmark@latest`, you will be prompted to select an AI code editor.
We'll automatically provide the necessary configuration for your editor.

Once you've finished, you'll be able to create/update AgentMark files directly from your AI chat interfaces.

## Manual setup

Add the following configuration to your AI code editor settings:

```json theme={null}
{
  "mcpServers": {
    "agentmark-docs": {
      "url": "https://docs.agentmark.co/mcp"
    }
  }
}
```

## Related documentation

<CardGroup>
  <Card title="MCP Tools in Prompts" icon="wrench" href="/build/mcp">
    Use MCP tools directly within your AgentMark prompts
  </Card>

  <Card title="Traces and Logs" icon="chart-line" href="/observe/tracing-setup">
    Debug prompt execution with OpenTelemetry tracing
  </Card>
</CardGroup>


# MCP Trace Server
Source: https://docs.agentmark.co/sdk-reference/tools/mcp-trace-server

Query AgentMark traces and data from your AI editor over MCP

The `@agentmark-ai/mcp-server` package exposes the AgentMark gateway to AI-powered editors over the [Model Context Protocol](https://modelcontextprotocol.io). Point it at your local `agentmark dev` server or at AgentMark Cloud, and your AI assistant can list traces, drill into spans, check capabilities, write scores, and call any other gateway operation — without leaving your editor.

<Note>
  This is different from the docs MCP described in [AI editor integration](/sdk-reference/tools/agentmark-mcp), which is a remote server that helps your editor **author** `.prompt.mdx` files. This package connects to your **gateway** (local or Cloud) to query and debug live data.
</Note>

## How tools are generated

The server does not ship a fixed, hand-written tool list. On startup it reads the gateway's OpenAPI contract from `/v1/openapi.json` and registers **one MCP tool per (non-deprecated) endpoint**. The tool name is the operation's `operationId` in snake\_case, and each tool's input is the endpoint's path + query + body parameters flattened into a single object.

Both the local dev server and the Cloud gateway serve the same OpenAPI contract, so the same tools are available against either — only the configured URL differs.

Representative tools (the exact set tracks the gateway's current API):

| Tool               | Backing endpoint           |
| ------------------ | -------------------------- |
| `list_traces`      | `GET /v1/traces`           |
| `get_trace`        | `GET /v1/traces/{traceId}` |
| `list_spans`       | `GET /v1/spans`            |
| `get_capabilities` | `GET /v1/capabilities`     |
| `list_sessions`    | `GET /v1/sessions`         |
| `create_score`     | `POST /v1/scores`          |

See the [API reference](/api-reference/overview) for the full list of operations — every one of them is exposed as a tool.

## Configuration

The server talks to exactly one URL. Set it with `AGENTMARK_API_URL`.

| Variable               | Default                    | Description                                                                   |
| ---------------------- | -------------------------- | ----------------------------------------------------------------------------- |
| `AGENTMARK_API_URL`    | `https://api.agentmark.co` | Gateway URL — set to `http://localhost:9418` for the local dev server         |
| `AGENTMARK_API_KEY`    | –                          | API key for authentication (required for Cloud; local dev is unauthenticated) |
| `AGENTMARK_TIMEOUT_MS` | `30000`                    | Per-request timeout in milliseconds                                           |

## Editor setup

Run the server with `npx` — there's nothing to install. `npm create agentmark@latest` wires this up for you (as the `agentmark` and `agentmark-local` entries); the configs below are the manual equivalent.

<Tabs>
  <Tab title="Local dev server">
    Point at your running `agentmark dev` server. Add to `.mcp.json` (Claude Code), `.cursor/mcp.json` (Cursor), or your editor's MCP config:

    ```json theme={null}
    {
      "mcpServers": {
        "agentmark-local": {
          "command": "npx",
          "args": ["-y", "@agentmark-ai/mcp-server"],
          "env": {
            "AGENTMARK_API_URL": "http://localhost:9418"
          }
        }
      }
    }
    ```
  </Tab>

  <Tab title="AgentMark Cloud">
    Point at the Cloud gateway and supply an API key:

    ```json theme={null}
    {
      "mcpServers": {
        "agentmark": {
          "command": "npx",
          "args": ["-y", "@agentmark-ai/mcp-server"],
          "env": {
            "AGENTMARK_API_KEY": "your-api-key"
          }
        }
      }
    }
    ```

    `AGENTMARK_API_URL` defaults to `https://api.agentmark.co`, so you only need to set it for staging or self-hosted gateways.
  </Tab>
</Tabs>

<Tip>
  Register **both** entries to work across local and Cloud in one session. MCP clients namespace tools by server name, so your assistant calls `agentmark-local/list_traces` for local traces and `agentmark/list_traces` for Cloud.
</Tip>

## Querying traces

A typical debugging flow: ask your assistant to list recent traces, then drill into one.

* `list_traces` accepts the same query parameters as `GET /v1/traces` — `limit`, `offset`, `status`, `user_id`, `model`, `session_id`, `dataset_run_id`, `name`, `tag`, and date filters. Pagination is offset-based.
* `get_trace` takes the `traceId` path parameter plus an optional `fields` query value (e.g. `fields=graph`) and returns the trace with its spans.

Because the tools mirror the REST API one-to-one, the [API reference](/api-reference/overview) is the source of truth for every tool's parameters and response shape.

## Error handling

Tool calls that fail return an MCP error result — `{ isError: true, content: [{ type: "text", text: "..." }] }` — with the underlying HTTP status or message in the text. There is no separate error-code enum to handle.

## Requirements

For local debugging:

1. Run `npx agentmark dev` to start the local dev server (API on port `9418`).
2. Execute prompts to generate traces.
3. Ask your AI editor to query and debug them via the `agentmark-local` tools.

## Programmatic usage

You can run the server from code:

```typescript theme={null}
import { createMCPServer, runServer } from '@agentmark-ai/mcp-server';

// Run with stdio transport (for MCP clients)
await runServer();

// Or create a server instance for a custom transport
const server = await createMCPServer();
```

## Related documentation

<CardGroup>
  <Card title="AI editor integration" icon="wand-magic-sparkles" href="/sdk-reference/tools/agentmark-mcp">
    The docs MCP for authoring AgentMark files
  </Card>

  <Card title="Traces and logs" icon="chart-line" href="/observe/tracing-setup">
    Learn about AgentMark tracing
  </Card>

  <Card title="API reference" icon="terminal" href="/api-reference/overview">
    Every gateway operation, one per MCP tool
  </Card>
</CardGroup>

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Troubleshooting
Source: https://docs.agentmark.co/sdk-reference/troubleshooting

Solutions to common errors and issues when using AgentMark

Common errors you may encounter when using AgentMark and how to resolve them. Every error string below is a literal match from the source — search your logs for the exact text.

## Connection and authentication

### Cannot connect to local dev server

**Error** (from Node fetch): `fetch failed — cause: { code: 'ECONNREFUSED', address: '127.0.0.1', port: 9418, ... }`

**Cause:** The local dev server isn't running on the port your `ApiLoader.local()` is pointing at.

**Solution:**

1. Start the dev server in a separate terminal:
   ```bash theme={null}
   npx agentmark dev
   ```
2. Verify it's reachable at `http://localhost:9418`.
3. If using a custom port:
   ```bash theme={null}
   npx agentmark dev --api-port 9500
   ```
   ```typescript theme={null}
   ApiLoader.local({ baseUrl: "http://localhost:9500" })
   ```

### Not authorized (401)

**Error** (gateway response): `{"error":"Not authorized","status":401}`

**Cause:** Your `AGENTMARK_API_KEY` is missing, revoked, or doesn't match the `X-Agentmark-App-Id` / `AGENTMARK_APP_ID` you sent.

**Solution:**

1. Verify the key in the [AgentMark Dashboard](https://app.agentmark.co/settings/api-keys).
2. Make sure the app ID matches the key's scope (keys are app-scoped).
3. Update your `.env`:
   ```bash theme={null}
   AGENTMARK_API_KEY=sk_agentmark_...
   AGENTMARK_APP_ID=app_...
   ```
4. Restart your application so the new values are picked up.

## Model registry

### Model not registered

**Error** (TS SDK): `No model function found for: 'gpt-4o'. Register it with .registerModels() or use provider/model format with .registerProviders().`

**Cause:** The model name in your prompt frontmatter isn't registered in your client.

**Solution (TypeScript):**

```typescript theme={null}
const modelRegistry = new VercelAIModelRegistry();

// Option 1: register providers, then use "openai/gpt-4o" in prompts
modelRegistry.registerProviders({ openai });

// Option 2: register a specific model
modelRegistry.registerModels(["gpt-4o"], (name) => openai(name));
```

**Solution (Python):** Python adapters don't ship a default registry — you must register explicitly:

```python theme={null}
model_registry = PydanticAIModelRegistry()
model_registry.register_models(
    ["gpt-4o"],
    lambda name, opts=None: f"openai:{name}",
)
```

### `registerModels` array of regex rejects at type-check

**Error**: `Type 'RegExp' is not assignable to type 'string'. [TS2322]`

**Cause:** `registerModels` accepts `string | RegExp | Array<string>` — an `Array<RegExp>` is not a valid overload.

**Solution:** Wrap the regex directly, without the surrounding array:

```typescript theme={null}
// Wrong
modelRegistry.registerModels([/^gpt-/], (name) => openai(name));

// Right
modelRegistry.registerModels(/^gpt-/, (name) => openai(name));
```

### AI SDK v5 tool fails type-check

**Error**: `No overload matches this call. [TS2769]` on a `tool({ parameters: z.object(...) })` call.

**Cause:** AI SDK v4 used `parameters:`; v5 renamed it to `inputSchema:`.

**Solution:** Use `inputSchema:` in v5 projects:

```typescript theme={null}
const weatherTool = tool({
  description: "Get weather for a location",
  inputSchema: z.object({ location: z.string() }),
  execute: async ({ location }) => `Weather in ${location}: 72°F`,
});
```

Mastra continues to use the `ai` v4 `tool()` helper with `parameters:` — see [Mastra integration](/integrations/typescript/mastra).

## MCP servers

### MCP server not registered

**Error** (TS AI SDK adapter): `MCP server 'docs' not registered. Available servers: ...`

**Cause:** The prompt references `mcp://docs/...` but no server named `docs` is configured on `createAgentMarkClient`.

**Solution:** Add the server to your client config:

```typescript theme={null}
export const client = createAgentMarkClient({
  loader,
  modelRegistry,
  mcpServers: {
    docs: { url: "https://example.com/mcp" },
  },
});
```

### Claude Agent TS uses camelCase `mcpServers`, Python uses snake\_case

The TypeScript adapter reads `mcpServers` (camelCase); the Python adapter reads `mcp_servers` (snake\_case). The wrong casing is silently ignored — no tools will be available from the MCP server at runtime.

## Prompts and files

### File not found

**Error** (CLI run-prompt): `File not found: /absolute/path/to/prompt.prompt.mdx`

**Cause:** The path passed to `run-prompt` doesn't resolve to an existing file.

**Solution:** Pass a path relative to your project root (e.g. `agentmark/greeting.prompt.mdx`).

### Pre-built prompt not found

**Error** (FileLoader): `Pre-built prompt not found: /path/to/dist/agentmark/greeting.prompt.mdx.json. Run 'agentmark build' to compile your prompts.`

**Cause:** `FileLoader` reads compiled JSON from the `--out` directory of `agentmark build`. The prompt either hasn't been built or the output directory doesn't match.

**Solution:**

```bash theme={null}
npx agentmark build --out dist/agentmark
```

Then make sure your `FileLoader` points at the same directory:

```typescript theme={null}
const loader = new FileLoader("./dist/agentmark");
```

### agentmarkPath `/` breaks build

**Error:** `AgentMark directory not found: /agentmark. Check your agentmark.json configuration.`

**Cause:** `"agentmarkPath": "/"` in `agentmark.json` resolves to the filesystem root. The scaffolder writes `"."` — the relative path from the project root.

**Solution:** Change `agentmark.json`:

```json theme={null}
{
  "agentmarkPath": "."
}
```

### Invalid YAML frontmatter

**Error** (from the YAML parser): `end of the stream or a document separator is expected (2:12)` — line and column vary by the issue.

**Cause:** Frontmatter YAML syntax error — missing space after `:`, incorrect indentation, unquoted strings with special characters.

**Solution:**

```yaml theme={null}
# Wrong — no space after colon
model_name:gpt-4o

# Right
model_name: gpt-4o

# Wrong — lost indentation under the parent key
text_config:
model_name: gpt-4o

# Right
text_config:
  model_name: gpt-4o
```

### Unterminated TemplateDX expression

**Error**: `Unexpected end of file in expression, expected a corresponding closing brace for '{'`

**Cause:** An open `{` or tag without its matching close.

**Solution:** Verify every `{expression}` has a closing `}` and every `<Tag>` has its closing `</Tag>`. See [TemplateDX syntax](/templatedx/syntax).

## Datasets and experiments

### Dataset file not found

Check that the `dataset:` path in your prompt's `test_settings` resolves from the prompt file's directory:

```yaml theme={null}
test_settings:
  dataset: ./datasets/test.jsonl
```

If the file exists but the path is wrong, the error surfaces as `ENOENT` from Node's `fs.createReadStream`. The fix is to correct the relative path.

### Invalid JSONL

Each line must be a complete JSON object:

```jsonl theme={null}
{"input": {"name": "Alice"}, "expected_output": "Hello Alice"}
{"input": {"name": "Bob"}, "expected_output": "Hello Bob"}
```

Validate with `cat dataset.jsonl | jq -c '.'` — any line it can't parse is the broken one.

## Types

### Props don't match prompt input

When `createAgentMarkClient<AgentMarkTypes>()` is typed, TS flags prop shape mismatches. The generated types reflect the prompt's `input_schema` frontmatter — regenerate after editing:

```bash theme={null}
npx agentmark generate-types --root-dir ./agentmark > agentmark.types.ts
```

## CLI

### `agentmark` command not found

**Error:** `command not found: agentmark`

**Solution:** Use `npx` (the CLI doesn't need a global install):

```bash theme={null}
npx agentmark dev
npx agentmark run-prompt path/to/prompt.prompt.mdx
```

If you want the shorter form, add a script to `package.json`:

```json theme={null}
{
  "scripts": {
    "dev": "agentmark dev"
  }
}
```

### Port already in use

**Error:** `EADDRINUSE: address already in use :::9418`

**Solution:** Use a different port, or kill the existing process:

```bash theme={null}
npx agentmark dev --api-port 9500

# or find and kill the holder
lsof -i :9418
kill -9 <PID>
```

### Disable CLI update banner

Set `AGENTMARK_NO_UPDATE_NOTIFIER=1` in your environment to suppress the version-upgrade banner.

## Still having issues?

1. Ensure your packages are current:

   ```bash theme={null}
   npm update @agentmark-ai/cli @agentmark-ai/ai-sdk-v5-adapter
   ```

2. Open an issue on [GitHub](https://github.com/agentmark-ai/agentmark/issues) with the exact error string, your SDK version, and a minimal reproduction.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# TypeScript client setup
Source: https://docs.agentmark.co/sdk-reference/typescript/client-setup

Install and configure the AgentMark TypeScript SDK with your preferred adapter

The AgentMark client is configured in `agentmark.client.ts`. It connects your prompts to AI models, tools, evaluations, and prompt loading. This file is auto-generated by `npm create agentmark@latest` — you can customize it after setup.

## Choose your adapter

<Tabs>
  <Tab title="AI SDK (Vercel)">
    ```bash theme={null}
    npm install @agentmark-ai/ai-sdk-v5-adapter @agentmark-ai/loader-api @ai-sdk/openai
    ```

    ```typescript agentmark.client.ts theme={null}
    import {
      createAgentMarkClient,
      VercelAIModelRegistry,
    } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { ApiLoader } from "@agentmark-ai/loader-api";
    import { openai } from "@ai-sdk/openai";

    const loader =
      process.env.NODE_ENV === "development"
        ? ApiLoader.local({
            baseUrl: process.env.AGENTMARK_BASE_URL || "http://localhost:9418",
          })
        : ApiLoader.cloud({
            apiKey: process.env.AGENTMARK_API_KEY!,
            appId: process.env.AGENTMARK_APP_ID!,
          });

    const modelRegistry = new VercelAIModelRegistry();
    modelRegistry.registerProviders({ openai });
    // Or, register individual models:
    // modelRegistry.registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name));

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
    });
    ```

    Use with Vercel AI SDK functions: `generateText()`, `generateObject()`, `streamText()`, `streamObject()`.
  </Tab>

  <Tab title="Claude Agent SDK">
    ```bash theme={null}
    npm install @agentmark-ai/claude-agent-sdk-v0-adapter @agentmark-ai/loader-api
    ```

    ```typescript agentmark.client.ts theme={null}
    import {
      createAgentMarkClient,
      ClaudeAgentModelRegistry,
    } from "@agentmark-ai/claude-agent-sdk-v0-adapter";
    import { ApiLoader } from "@agentmark-ai/loader-api";

    const loader =
      process.env.NODE_ENV === "development"
        ? ApiLoader.local({
            baseUrl: process.env.AGENTMARK_BASE_URL || "http://localhost:9418",
          })
        : ApiLoader.cloud({
            apiKey: process.env.AGENTMARK_API_KEY!,
            appId: process.env.AGENTMARK_APP_ID!,
          });

    const modelRegistry = ClaudeAgentModelRegistry.createDefault();

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
      adapterOptions: {
        permissionMode: "bypassPermissions",
        maxTurns: 20,
      },
    });
    ```

    **Adapter options:**

    | Option            | Description                                                      |
    | ----------------- | ---------------------------------------------------------------- |
    | `permissionMode`  | `'default'`, `'acceptEdits'`, `'bypassPermissions'`, or `'plan'` |
    | `maxTurns`        | Maximum number of agent turns                                    |
    | `maxBudgetUsd`    | Spending limit per run                                           |
    | `cwd`             | Working directory for the agent                                  |
    | `allowedTools`    | Whitelist of tool names                                          |
    | `disallowedTools` | Blacklist of tool names                                          |
  </Tab>

  <Tab title="Mastra">
    ```bash theme={null}
    npm install @agentmark-ai/mastra-v0-adapter @agentmark-ai/loader-api @ai-sdk/openai
    ```

    ```typescript agentmark.client.ts theme={null}
    import {
      createAgentMarkClient,
      MastraModelRegistry,
    } from "@agentmark-ai/mastra-v0-adapter";
    import { ApiLoader } from "@agentmark-ai/loader-api";
    import { openai } from "@ai-sdk/openai";

    const loader =
      process.env.NODE_ENV === "development"
        ? ApiLoader.local({
            baseUrl: process.env.AGENTMARK_BASE_URL || "http://localhost:9418",
          })
        : ApiLoader.cloud({
            apiKey: process.env.AGENTMARK_API_KEY!,
            appId: process.env.AGENTMARK_APP_ID!,
          });

    const modelRegistry = new MastraModelRegistry()
      .registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name));

    export const client = createAgentMarkClient({
      loader,
      modelRegistry,
    });
    ```
  </Tab>
</Tabs>

## Model registry

`VercelAIModelRegistry` supports two registration styles. **Providers** is the canonical scaffolder pattern — model IDs in prompt frontmatter written as `"<provider>/<model>"` (e.g. `"openai/gpt-4o"`) auto-resolve:

```typescript theme={null}
const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai, anthropic });
```

`registerModels(pattern, creator)` is the alternative when you need per-model configuration. `pattern` is `string | RegExp | Array<string>` — an `Array<RegExp>` is **not** a valid overload:

```typescript theme={null}
const modelRegistry = new VercelAIModelRegistry();

modelRegistry.registerModels(["gpt-4o", "gpt-4o-mini"], (name) => openai(name));
modelRegistry.registerModels(/^claude-/, (name) => anthropic(name));  // single regex, not wrapped
modelRegistry.registerModels(["dall-e-3"], (name) => openai.image(name));
modelRegistry.registerModels(["tts-1-hd"], (name) => openai.speech(name));
```

Model names must match the `model_name` in your prompt frontmatter.

## Prompt loading

The loader determines how prompts are fetched at runtime:

```typescript theme={null}
// Local — loads from dev server (development)
const loader = ApiLoader.local({
  baseUrl: "http://localhost:9418",
});

// Cloud — loads from the AgentMark HTTP API (production)
const loader = ApiLoader.cloud({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
});
```

Prompts are cached in-process in-memory for repeated loads — no extra network round trips for the same path within a process.

## Adding tools

Pass tools to the client via the `tools` option. **AI SDK v5 uses `inputSchema`** (v4 used `parameters`; mixing them fails type-check with `TS2769`):

```typescript theme={null}
import { tool } from "ai";
import { z } from "zod";

export const client = createAgentMarkClient({
  loader,
  modelRegistry,
  tools: {
    calculate: tool({
      description: "Performs arithmetic calculations",
      inputSchema: z.object({
        expression: z.string(),
      }),
      execute: async ({ expression }) => {
        return { result: Function(`"use strict"; return (${expression})`)() };
      },
    }),
  },
});
```

See [Tools and agents](/build/tools-and-agents) for more.

## Full reference

For all configuration options including evals, MCP servers, and advanced loader options, see [Client config](/configure/client-config).

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Loaders
Source: https://docs.agentmark.co/sdk-reference/typescript/loaders

Load prompts from different sources using ApiLoader and FileLoader

AgentMark provides two loader implementations for fetching prompts: `ApiLoader` for API-based loading and `FileLoader` for static file loading.

## Overview

| Loader       | Package                     | Use case                                              |
| ------------ | --------------------------- | ----------------------------------------------------- |
| `ApiLoader`  | `@agentmark-ai/loader-api`  | Cloud deployment or local development with dev server |
| `FileLoader` | `@agentmark-ai/loader-file` | Self-hosted/static deployment with pre-built prompts  |

***

## ApiLoader

The `ApiLoader` fetches prompts from the AgentMark API (Cloud) or a local development server.

### Installation

```bash theme={null}
npm install @agentmark-ai/loader-api
```

### Cloud mode (production)

Use Cloud mode when deploying to production with AgentMark Cloud:

```typescript theme={null}
import { ApiLoader } from "@agentmark-ai/loader-api";

const loader = ApiLoader.cloud({
  apiKey: process.env.AGENTMARK_API_KEY!,
  appId: process.env.AGENTMARK_APP_ID!,
  baseUrl: "https://api.agentmark.co", // optional, this is the default
});
```

**Configuration:**

| Option    | Type     | Required | Description                                        |
| --------- | -------- | -------- | -------------------------------------------------- |
| `apiKey`  | `string` | Yes      | Your AgentMark API key                             |
| `appId`   | `string` | Yes      | Your AgentMark application ID                      |
| `baseUrl` | `string` | No       | API base URL (default: `https://api.agentmark.co`) |

### Local mode (development)

Use local mode during development with the `npx agentmark dev` server:

```typescript theme={null}
import { ApiLoader } from "@agentmark-ai/loader-api";

const loader = ApiLoader.local({
  baseUrl: "http://localhost:9418",
});
```

**Configuration:**

| Option    | Type     | Required | Description                                                                                                 |
| --------- | -------- | -------- | ----------------------------------------------------------------------------------------------------------- |
| `baseUrl` | `string` | Yes      | Local dev server URL. `npx agentmark dev` binds to `9418` by default (override with the `--api-port` flag). |

### Usage with client

```typescript theme={null}
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
import { ApiLoader } from "@agentmark-ai/loader-api";
import { openai } from "@ai-sdk/openai";

// Choose loader based on environment
const loader = process.env.NODE_ENV === "production"
  ? ApiLoader.cloud({
      apiKey: process.env.AGENTMARK_API_KEY!,
      appId: process.env.AGENTMARK_APP_ID!,
    })
  : ApiLoader.local({
      baseUrl: "http://localhost:9418",
    });

const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai });

const client = createAgentMarkClient({
  loader,
  modelRegistry,
});

// Load and use prompts
const prompt = await client.loadTextPrompt("greeting.prompt.mdx");
```

### Caching

The `ApiLoader` includes built-in caching. You can customize caching behavior when loading prompts:

```typescript theme={null}
// With custom cache TTL
const ast = await loader.load("prompt.prompt.mdx", "text", {
  cache: { ttl: 1000 * 60 * 5 }, // 5 minutes
});

// Disable caching
const ast = await loader.load("prompt.prompt.mdx", "text", {
  cache: false,
});
```

### Loading datasets

The `ApiLoader` can also stream datasets for experiments:

```typescript theme={null}
const stream = await loader.loadDataset("my-dataset.jsonl");

const reader = stream.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  console.log(value.input, value.expected_output);
}
```

***

## FileLoader

The `FileLoader` loads pre-built prompts from JSON files generated by `npx agentmark build`. Use this for self-hosted deployments where you don't want runtime API calls.

### Installation

```bash theme={null}
npm install @agentmark-ai/loader-file
```

### Building prompts

First, compile your prompts using the CLI:

```bash theme={null}
npx agentmark build --out ./dist/agentmark
```

This creates JSON files containing pre-parsed ASTs:

```
dist/agentmark/
  manifest.json
  greeting.prompt.json
  nested/
    helper.prompt.json
```

### Usage

```typescript theme={null}
import { FileLoader } from "@agentmark-ai/loader-file";

// Point to the build output directory
const loader = new FileLoader("./dist/agentmark");
```

**Configuration:**

| Parameter  | Type     | Description                                              |
| ---------- | -------- | -------------------------------------------------------- |
| `builtDir` | `string` | Path to the directory containing built prompt JSON files |

### Path resolution

The `FileLoader` accepts prompt paths with or without extensions:

```typescript theme={null}
// All of these work:
await client.loadTextPrompt("greeting");
await client.loadTextPrompt("greeting.prompt");
await client.loadTextPrompt("greeting.prompt.mdx");
```

### Usage with client

```typescript theme={null}
import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
import { FileLoader } from "@agentmark-ai/loader-file";
import { openai } from "@ai-sdk/openai";

const loader = new FileLoader("./dist/agentmark");

const modelRegistry = new VercelAIModelRegistry();
modelRegistry.registerProviders({ openai });

const client = createAgentMarkClient({
  loader,
  modelRegistry,
});

// Load pre-built prompts
const prompt = await client.loadTextPrompt("greeting");
const input = await prompt.format({ props: { name: "Alice" } });
```

### Loading datasets

The `FileLoader` can also load dataset files (`.jsonl`):

```typescript theme={null}
const stream = await loader.loadDataset("test-data.jsonl");

const reader = stream.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  console.log(value.input, value.expected_output);
}
```

### Security

The `FileLoader` includes path traversal protection:

* Rejects absolute paths
* Validates that resolved paths stay within the base directory
* Prevents access to files outside the build directory

***

## Choosing a loader

| Scenario                                       | Recommended loader  |
| ---------------------------------------------- | ------------------- |
| Production with AgentMark Cloud                | `ApiLoader.cloud()` |
| Local development                              | `ApiLoader.local()` |
| Self-hosted / edge deployment                  | `FileLoader`        |
| Serverless functions (cold-start optimization) | `FileLoader`        |
| Air-gapped environments                        | `FileLoader`        |

### Trade-offs

**ApiLoader (Cloud)**

* Prompts managed in AgentMark Cloud
* Real-time updates without redeployment
* Requires network connectivity to AgentMark
* Built-in caching

**ApiLoader (local)**

* Fast development iteration
* Hot reloading with `npx agentmark dev`
* No AgentMark authentication required

**FileLoader**

* Zero network latency
* Works offline / air-gapped
* Requires rebuild for prompt changes
* Smaller bundle (no API client code)

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Type safety
Source: https://docs.agentmark.co/sdk-reference/typescript/type-safety

Generate TypeScript types from your prompt schemas for compile-time validation and autocomplete.

AgentMark provides type safety through JSON Schema definitions in your prompt files. `npx agentmark generate-types` compiles those schemas to TypeScript types you can pass to `createAgentMarkClient<T>()` for compile-time validation and IDE autocomplete.

## Defining types

Define input and output types in your prompt files using JSON Schema:

```jsx math/addition.prompt.mdx theme={null}
---
name: math/addition
object_config:
  model_name: openai/gpt-4o
  schema:
    type: object
    properties:
      answer:
        type: number
        description: "The sum of the two numbers"
    required: [answer]
input_schema:
  type: object
  properties:
    a:
      type: number
      description: "First number to add"
    b:
      type: number
      description: "Second number to add"
  required: [a, b]
---

<System>You compute sums.</System>
<User>What is {props.a} + {props.b}?</User>
```

## Generating types

Generate TS types with the CLI:

```bash theme={null}
npx agentmark generate-types --root-dir ./agentmark > agentmark.types.ts
```

For the prompt above, the generator emits:

```typescript agentmark.types.ts theme={null}
// Auto-generated types from AgentMark
// Do not edit this file directly

interface Math$AdditionIn {
  a: number;
  b: number;
}

interface Math$AdditionOut {
  answer: number;
}

type Math$Addition = {
  kind: 'object';
  input:  Math$AdditionIn;
  output: Math$AdditionOut;
};

export default interface AgentmarkTypes {
  "math/addition.prompt.mdx": Math$Addition,
  "math/addition.prompt": Math$Addition,
  "math/addition": Math$Addition
}
```

Note two things:

1. The wrapper (`Math$Addition`) is a `type` alias, while `Math$AdditionIn` / `Math$AdditionOut` are `interface`s.
2. Each prompt gets **three** key aliases — the full path with `.prompt.mdx`, the path ending in `.prompt`, and the bare name. Any of the three resolves to the same type when you call `loadTextPrompt` / `loadObjectPrompt`.

## Using generated types

<Tabs>
  <Tab title="AI SDK v5">
    ```tsx theme={null}
    import PromptTypes from './agentmark.types';
    import { createAgentMarkClient, VercelAIModelRegistry } from "@agentmark-ai/ai-sdk-v5-adapter";
    import { ApiLoader } from "@agentmark-ai/loader-api";
    import { openai } from "@ai-sdk/openai";
    import { generateObject, tool } from "ai";
    import { z } from "zod";

    const loader = ApiLoader.local({ baseUrl: "http://localhost:9418" });

    const modelRegistry = new VercelAIModelRegistry();
    modelRegistry.registerProviders({ openai });

    // AI SDK v5 uses `inputSchema`, not `parameters` (that's v4).
    const sumTool = tool({
      description: "Add two numbers together",
      inputSchema: z.object({
        a: z.number().describe("First number"),
        b: z.number().describe("Second number"),
      }),
      execute: async ({ a, b }) => ({ answer: a + b }),
    });

    const client = createAgentMarkClient<PromptTypes>({
      loader,
      modelRegistry,
      tools: { sum: sumTool },
    });

    // TypeScript enforces correct types
    const prompt = await client.loadObjectPrompt("math/addition.prompt.mdx");
    const input = await prompt.format({
      props: {
        a: 5,   // must be number per input_schema
        b: 3,
      },
    });

    const result = await generateObject(input);
    const answer: number = result.object.answer;  // type-safe
    ```
  </Tab>

  <Tab title="Pydantic AI (Python)">
    ```python theme={null}
    from agentmark_pydantic_ai_v0 import (
        create_pydantic_ai_client,
        PydanticAIModelRegistry,
        run_object_prompt,
    )
    from agentmark.prompt_core import FileLoader

    loader = FileLoader("./dist/agentmark")

    model_registry = PydanticAIModelRegistry()
    model_registry.register_models(
        ["gpt-4o"],
        lambda name, opts=None: f"openai:{name}",
    )

    client = create_pydantic_ai_client(
        model_registry=model_registry,
        loader=loader,
    )

    # Python type safety comes from Pydantic models
    # generated from the prompt's JSON Schema.
    prompt = await client.load_object_prompt("math/addition.prompt.mdx")
    params = await prompt.format(props={"a": 5, "b": 3})

    result = await run_object_prompt(params)
    print(result.output.answer)
    ```

    <Note>
      Python type safety is provided by the Pydantic models the adapter auto-generates from each prompt's JSON Schema. There's no separate type-stubs package to install.
    </Note>
  </Tab>
</Tabs>

## Benefits

1. **Compile-time safety** — TypeScript flags prop-shape mismatches before runtime.
2. **IDE support** — autocomplete and inline descriptions on `props` and outputs.
3. **Consistent interfaces** — the same `PromptTypes` drives both the loader and the caller.
4. **Documentation** — JSON Schema descriptions flow through to TS JSDoc comments.
5. **Validation** — the adapter validates inputs and structured outputs against the schema at runtime.

## Best practices

1. Define both `input_schema` and `object_config.schema` (when the prompt is an object prompt) in your prompt files.
2. Use descriptive property names and add `description` fields — they become JSDoc comments.
3. Mark required properties using `required`.
4. Regenerate types after any schema edit: `npx agentmark generate-types --root-dir ./agentmark > agentmark.types.ts`.
5. Commit `agentmark.types.ts` to version control so CI type-checks the contract.

<div>
  <h3>Have Questions?</h3>
  <p>We're here to help! Choose the best way to reach us:</p>

  <ul>
    <li>
      Email us at <a href="mailto:hello@agentmark.co">[hello@agentmark.co](mailto:hello@agentmark.co)</a> for support
    </li>

    <li>
      Schedule an <a href="https://cal.com/ryan-randall/enterprise">Enterprise Demo</a> to learn about our business solutions
    </li>
  </ul>
</div>


# Components
Source: https://docs.agentmark.co/templatedx/components

Create reusable template fragments in TemplateDX for shared prompt logic and structure.

Components in TemplateDX look like JSX but behave differently at runtime: they are **inlined at bundle time**, not invoked as a render function. When you `import Blog from './blog.mdx'` and use `<Blog>`, TemplateDX's bundler substitutes the imported file's AST directly into the parent at parse time — there is no React render cycle, no hooks, no event handlers. Just template composition.

## Constraints

* **Only default imports are supported.** `import { Thing } from './file.mdx'` throws at bundle time — named imports are explicitly rejected per `bundler.ts:185-196`.
* **Imported files must be `.mdx`.** The bundler recognizes `.mdx` specifically; other extensions aren't inlined.
* **Use `{props.*}` for data, `{props.children}` for body content.** Both are populated by the bundler from the parent's JSX attributes and child nodes.
* **No React runtime.** Hooks, event handlers, refs — none of these do anything inside a TemplateDX component. If you find yourself reaching for them, you probably want a custom [tag plugin](/templatedx/tags) instead.

## Example

Given `blog.mdx`:

```mdx blog.mdx theme={null}
# {props.title}

{props.children}
```

And the parent template:

```mdx index.mdx theme={null}
import Blog from './blog.mdx';

# Example

<Blog title="Turtles">
  Turtles are really cool...
</Blog>
```

TemplateDX renders:

```markdown theme={null}
# Example

# Turtles

Turtles are really cool...
```

The `<Blog>` tag is replaced with the contents of `blog.mdx`, with `props.title` resolved to `"Turtles"` and `props.children` resolved to the child body.

## When to use a component vs. a tag

* **Components** — for static template fragments you want to compose. Think partials: a shared `<SystemPrompt>` header, a reusable `<FewShotExamples>` block.
* **[Tags](/templatedx/tags)** — for logic that needs runtime behavior: conditional rendering (`<If>`), iteration (`<ForEach>`), raw passthrough (`<Raw>`), or custom extensions you implement as a `TagPlugin`.


# Expressions
Source: https://docs.agentmark.co/templatedx/expressions

Use JavaScript-like expressions in TemplateDX templates for dynamic content and conditional logic.

TemplateDX expressions run inside `{...}` braces or JSX attributes (`<If condition={...}>`). The evaluator supports a deliberately limited JavaScript subset — literals, property access, a fixed set of operators, and calls to registered filters.

## Literals

| Literal  | Example                            |
| -------- | ---------------------------------- |
| Strings  | `"How are you?"`, `'How are you?'` |
| Numbers  | `40`, `30.123`                     |
| Arrays   | `[1, 2, "array"]`                  |
| Objects  | `{ one: 1, two: 2 }`               |
| Booleans | `true`, `false`                    |

All five forms render in `{}` and can be used anywhere an expression is valid:

```jsx theme={null}
{[1, 2, 3].length}          {/* renders 3 */}
{{ name: "Alice" }.name}    {/* renders Alice */}
```

## Property access

Read nested properties with dot notation or bracket syntax:

```jsx theme={null}
{props.user.name}
{props['user-email']}
{items[0].title}
```

Missing nested keys via `MemberExpression` render as empty string; see [Variables](/templatedx/variables) for details.

## Operators

### Arithmetic

```jsx theme={null}
{ 2 + 3 }     {/* 5  */}
{ 10 - 4 }    {/* 6  */}
{ 10 / 2 }    {/* 5  */}
{ 10 % 3 }    {/* 1  */}
{ 5 * 4 }     {/* 20 */}
```

`+` also concatenates strings: `{"Hi " + props.name}`.

### Comparisons

| Operator | Description           |
| -------- | --------------------- |
| `==`     | Loose equality        |
| `!=`     | Loose inequality      |
| `>`      | Greater than          |
| `>=`     | Greater than or equal |
| `<`      | Less than             |
| `<=`     | Less than or equal    |

<Warning>
  **Strict equality (`===` / `!==`) is not supported.** Only the loose operators above are in the evaluator's operator table. Use `==` / `!=`.
</Warning>

```jsx theme={null}
<If condition={props.numUsers > 10}>
  Content for more than 10 users
</If>

<If condition={props.score >= 75}>
  Content for scores 75 and above
</If>
```

### Logical

```jsx theme={null}
<If condition={props.isActive && props.hasAccess}>
  Content for active users with access
</If>

<If condition={props.isAdmin || props.isModerator}>
  Content for admins or moderators
</If>

<If condition={!props.isBanned}>
  Content for users who are not banned
</If>
```

## Filter calls

The only function calls allowed inside expressions are calls to registered [filters](/templatedx/filters):

```jsx theme={null}
{ upper(props.name) }          {/* ALICE */}
{ truncate(props.bio, 100) }   {/* first 100 chars + "..." */}
```

<Warning>
  **Method calls (`props.name.toUpperCase()`) throw at render time** — the evaluator's CallExpression handler explicitly rejects anything that isn't a registered filter. Use the `upper` filter instead.
</Warning>

## Example usage

```jsx theme={null}
<If condition={(props.age >= 18 && props.isMember) || props.hasGuestPass}>
  Welcome to the event!
</If>
<ElseIf condition={props.age >= 18 && !props.isMember}>
  Please consider becoming a member to enjoy full benefits.
</ElseIf>
<Else>
  Sorry, you must be at least 18 years old to attend.
</Else>
```

This branches on three conditions: adult members or guest-pass holders get a welcome; adult non-members get a prompt to join; everyone else is told they can't attend.


# FAQ
Source: https://docs.agentmark.co/templatedx/faq

Frequently asked questions about TemplateDX syntax, setup, and usage.

### Is TemplateDX compliant with CommonMark?

Yes — TemplateDX supports a superset of CommonMark. On top of CommonMark, TemplateDX adds [tags](/templatedx/tags), [filters](/templatedx/filters), [variables](/templatedx/variables), and [components](/templatedx/components).

**Note:** TemplateDX's parser does **not** enable GitHub-flavored Markdown extensions — it uses `remarkParse + remarkMdx + remarkFrontmatter`, not `remark-gfm`. GFM-only features (strikethrough, task lists, GFM tables with cell alignment) are parsed as plain text, not as structured AST nodes.

### What's the difference between MDX and TemplateDX?

MDX is a **document format** — Markdown that can embed JSX — that compiles to JavaScript. TemplateDX is a **prompt-rendering engine** that shares the surface syntax but parses to an AST, evaluates expressions and tag plugins against a `props` object, and serializes back to Markdown. MDX targets JSX runtimes; TemplateDX targets LLM prompts.

### Can you support output formats other than Markdown?

We're focused on Markdown today. The `stringify` step is pluggable in principle, so emitting HTML or other formats is a reasonable extension — PRs welcome at [github.com/agentmark-ai/agentmark](https://github.com/agentmark-ai/agentmark).

### How does TemplateDX relate to AgentMark?

TemplateDX is the templating engine underneath [AgentMark](https://github.com/agentmark-ai/agentmark). AgentMark uses TemplateDX for prompt rendering and builds on its tag plugin ecosystem.

### What languages are supported?

* **TypeScript / JavaScript** — full `parse` / `transform` / `stringify` pipeline via `@agentmark-ai/templatedx`.
* **Python** — `transform` + `stringify` via `agentmark-templatedx`. Python consumes ASTs produced by the TS parser (native Python parsing is still TS-only).


# Filters
Source: https://docs.agentmark.co/templatedx/filters

Transform and format template values with built-in TemplateDX filter functions.

TemplateDX provides a set of built-in filters that you can use to manipulate and transform data within your templates. Filters are functions that take an input value and return a transformed output.

## Built-in Filters

### abs

The `abs` filter returns the absolute value of a number.

**Syntax**

```tsx theme={null}
abs(number_value)
```

**Parameters**

* `number_value` (number): The input number.

**Example**

```tsx theme={null}
abs(-42)
```

**Output:**

```
42
```

### capitalize

The `capitalize` filter capitalizes the first character of a string.

**Syntax**

```tsx theme={null}
capitalize(string_value)
```

**Parameters**

* `string_value` (string): The input string to be capitalized.

**Example**

```tsx theme={null}
capitalize("hello world")
```

**Output:**

```
Hello world
```

### dump

The `dump` filter serializes a JavaScript object into a JSON string.

**Syntax**

```tsx theme={null}
dump(object_value)
```

**Parameters**

* `object_value` (any): The input object to be serialized.

**Example**

```tsx theme={null}
dump({ name: "TemplateDX", version: "1.0" })
```

**Output:**

```
{"name":"TemplateDX","version":"1.0"}
```

### join

The `join` filter joins elements of an array into a single string, separated by a specified separator.

**Syntax**

```tsx theme={null}
join(array_value, separator)
```

**Parameters**

* `array_value` (any\[]): The input array.
* `separator` (string, optional): The string to separate the array elements. Defaults to `", "`.

**Example**

```tsx theme={null}
join(["apple", "banana", "cherry"], ", ")
```

**Output:**

```
apple, banana, cherry
```

### lower

The `lower` filter converts a string to lowercase letters.

**Syntax**

```tsx theme={null}
lower(string_value)
```

**Parameters**

* `string_value` (string): The input string to be converted to lowercase.

**Example**

```tsx theme={null}
lower("HELLO WORLD")
```

**Output:**

```
hello world
```

### replace

The `replace` filter replaces all occurrences of a specified substring with a new substring.

**Syntax**

```tsx theme={null}
replace(string_value, search, replace)
```

**Parameters**

* `string_value` (string): The input string.
* `search` (string): The substring to search for.
* `replace` (string): The substring to replace with.

**Example**

```tsx theme={null}
replace("Hello World", "World", "TemplateDX")
```

**Output:**

```
Hello TemplateDX
```

### round

The `round` filter rounds a number to a specified number of decimal places.

**Syntax**

```tsx theme={null}
round(number_value, decimals)
```

**Parameters**

* `number_value` (number): The input number to be rounded.
* `decimals` (number, optional): The number of decimal places to round to. Defaults to `0`.

**Example**

```tsx theme={null}
round(3.14159, 2)
```

**Output:**

```
3.14
```

### truncate

The `truncate` filter truncates a string to a specified length and appends an ellipsis (`...`) if necessary.

**Syntax**

```tsx theme={null}
truncate(string_value, length)
```

**Parameters**

* `string_value` (string): The input string to be truncated.
* `length` (number): The maximum length of the output string.

**Example**

```tsx theme={null}
truncate("The quick brown fox jumps over the lazy dog", 20)
```

**Output:**

```
The quick brown fox ...
```

`truncate` takes the first `length` characters and appends `...`, so the output is `length + 3` characters total.

### upper

The `upper` filter converts a string to uppercase letters.

**Syntax**

```tsx theme={null}
upper(string_value)
```

**Parameters**

* `string_value` (string): The input string to be converted to uppercase.

**Example**

```tsx theme={null}
upper("hello world")
```

**Output:**

```
HELLO WORLD
```

### urlencode

The `urlencode` filter encodes a string to be safe for use in URLs.

**Syntax**

```tsx theme={null}
urlencode(string_value)
```

**Parameters**

* `string_value` (string): The input string to be URL-encoded.

**Example**

```tsx theme={null}
urlencode("Hello World!")
```

**Output:**

```
Hello%20World!
```

`urlencode` uses `encodeURIComponent`, which leaves `! * ' ( )` unencoded by spec.

## Creating Custom Filters

You can create custom filters by registering them with the `FilterRegistry`.

### FilterRegistry.register

Register a custom filter function using the static `register` method:

```typescript theme={null}
import { FilterRegistry } from '@agentmark-ai/templatedx';

FilterRegistry.register(name, filterFunction);
```

**Parameters**

* `name` (string): The name used to call the filter in templates.
* `filterFunction` (FilterFunction): The function that performs the transformation.

### FilterFunction Type

The `FilterFunction` type signature is:

```typescript theme={null}
type FilterFunction<
  Input = any,
  Output = any,
  Args extends any[] = any[]
> = (input: Input, ...args: Args) => Output;
```

* `input` - The first argument is always the value being filtered.
* `...args` - Additional arguments passed to the filter.

### Example: Custom Filter

Here's an example of creating a custom `reverse` filter that reverses a string:

```typescript theme={null}
import { FilterRegistry, FilterFunction } from '@agentmark-ai/templatedx';

const reverse: FilterFunction<string, string> = (input) => {
  if (typeof input !== 'string') return input;
  return input.split('').reverse().join('');
};

FilterRegistry.register('reverse', reverse);
```

Usage in template:

```tsx theme={null}
{reverse("hello")}
```

Output:

```
olleh
```

### Example: Filter with Arguments

Filters can accept additional arguments. Here's a `pad` filter that pads a string to a specified length:

```typescript theme={null}
import { FilterRegistry, FilterFunction } from '@agentmark-ai/templatedx';

const pad: FilterFunction<string, string, [number, string?]> = (
  input,
  length,
  char = ' '
) => {
  if (typeof input !== 'string') return input;
  return input.padStart(length, char);
};

FilterRegistry.register('pad', pad);
```

Usage in template:

```tsx theme={null}
{pad("42", 5, "0")}
```

Output:

```
00042
```


# Overview
Source: https://docs.agentmark.co/templatedx/introduction

TemplateDX is an extensible templating engine for AI prompts, built on Markdown and JSX.

## What is TemplateDX?

TemplateDX is a declarative, extensible, and composable templating engine built on Markdown and JSX. It was originally developed by [AgentMark](https://github.com/agentmark-ai/agentmark) to improve the developer experience of building with large language models (LLMs).

TemplateDX looks a lot like [MDX](https://mdxjs.com), but the runtime is purpose-built for prompts — it adds its own tag plugins (`<If>`, `<ElseIf>`, `<Else>`, `<ForEach>`, `<Raw>`), a filter registry for expression evaluation, and a bundler that inlines imported components at bundle time. MDX is a document format that compiles to JSX; TemplateDX is a prompt-rendering engine that shares the surface syntax.

## Why extend Markdown?

TemplateDX extends Markdown's familiar syntax to support complex, structured content. Markdown works well for basic content but lacks the flexibility needed for templating, composable components, and organized content. TemplateDX adds custom components and templating primitives, enabling document composability, conditional rendering, and variable interpolation while preserving Markdown's readability.

## What does it look like?

A TemplateDX file is a `.mdx` document that combines Markdown, frontmatter, imports, JSX components, and tags:

```jsx theme={null}
---
name: Markdown with JSX and Frontmatter
author: Ryan
---

import SomeMarkdownComponent from './my-md-component.md';
import SomeMDXComponent from './my-mdx-component.mdx';

# Hello World

> TemplateDX uses Markdown to make it readable.

## Table

| Item              | In stock | Price |
| :---------------- | :------: | ----: |
| Python Hat        |   True   | 23.99 |
| SQL Hat           |   True   | 23.99 |
| Codecademy Tee    |  False   | 19.99 |
| Codecademy Hoodie |  False   | 42.99 |

## Components

<SomeMarkdownComponent />

<SomeMDXComponent title="Demo" />

## Tags

<If condition={props.isAwesome}>
  **TemplateDX is awesome!**
</If>
```

The `<If>` tag, `{props.isAwesome}` expression, component imports, and frontmatter are all evaluated by TemplateDX at `transform()` time — see [Syntax](/templatedx/syntax) for the full list of primitives.


# Quickstart
Source: https://docs.agentmark.co/templatedx/quickstart

Install TemplateDX and render your first prompt template.

Install TemplateDX and render a `.mdx` template against a `props` object. The snippets below use the **Node** API (`load` + `transform` + `stringify`) — if you need to bundle templates into a web app, wire up your bundler's MDX loader separately.

## Install

<Tabs>
  <Tab title="npm">
    ```bash theme={null}
    npm install @agentmark-ai/templatedx
    ```
  </Tab>

  <Tab title="yarn">
    ```bash theme={null}
    yarn add @agentmark-ai/templatedx
    ```
  </Tab>

  <Tab title="pnpm">
    ```bash theme={null}
    pnpm add @agentmark-ai/templatedx
    ```
  </Tab>
</Tabs>

## Render a template (Node)

Given `my-template.mdx`:

```mdx my-template.mdx theme={null}
Hi {props.name}, welcome to TemplateDX.
```

Run it:

```typescript theme={null}
import { load, transform, stringify } from "@agentmark-ai/templatedx";

const ast = await load("./my-template.mdx");
const rendered = await transform(ast, { name: "Jim" });
console.log(stringify(rendered));
```

Output:

```
Hi Jim, welcome to TemplateDX.
```

* `load(path)` — parses the `.mdx` file from disk into an AST.
* `transform(ast, props)` — evaluates expressions and tag plugins against `props`.
* `stringify(ast)` — serializes the rendered AST back to a Markdown/text string.

## Render a template (bundled / web)

When your `.mdx` is bundled at build time (Next.js, Vite, etc.) and the default export is an AST, you can skip `load`:

```typescript theme={null}
import { transform, stringify } from "@agentmark-ai/templatedx";
import MyTemplate from "./my-template.mdx"; // requires an MDX bundler/loader

const rendered = await transform(MyTemplate, { name: "Jim" });
console.log(stringify(rendered));
```

Bundling `.mdx` into a JS module requires an MDX bundler or loader appropriate to your framework — TemplateDX does not ship one.

## Next steps

<CardGroup>
  <Card title="Syntax" icon="code" href="/templatedx/syntax">
    Tags, expressions, components, and XML passthrough
  </Card>

  <Card title="Variables" icon="brackets-curly" href="/templatedx/variables">
    Props and scope resolution
  </Card>

  <Card title="Custom tags" icon="puzzle-piece" href="/templatedx/tags">
    Write your own `<Tag>` plugins
  </Card>

  <Card title="Custom filters" icon="filter" href="/templatedx/filters">
    Extend expression evaluation with filters
  </Card>
</CardGroup>


# Syntax Overview
Source: https://docs.agentmark.co/templatedx/syntax

Complete syntax reference for TemplateDX

TemplateDX combines Markdown and JSX to create powerful, type-safe templates for LLM prompts. This page provides a comprehensive overview of the syntax.

## Basic Structure

Every TemplateDX template is a `.mdx` file with optional frontmatter and a mix of Markdown and JSX:

```jsx theme={null}
---
title: My Prompt Template
description: A brief description
---

# Your prompt content here

You are a helpful assistant specializing in {props.domain}.

<If condition={props.showInstructions}>
## Instructions
{props.instructions}
</If>
```

## Frontmatter

Frontmatter is optional metadata at the top of your file:

```yaml theme={null}
---
title: Customer Support Prompt
description: Template for customer support interactions
version: 1.0
---
```

## Variables

Access dynamic data using curly braces:

```jsx theme={null}
{props.userName}           // Simple variable
{props.user.email}         // Nested property
{props.items[0]}           // Array access
```

A missing nested property renders as an empty string (there's no error to guard against, so `?.` and `.` behave identically).

[Learn more about variables →](/templatedx/variables)

## Expressions

Evaluate expressions inline — arithmetic, comparison, and logical operators, property access, and **registered filter calls**:

```jsx theme={null}
{props.score * 2}
{props.score >= 90}
{props.items.length}
{upper(props.name)}
{join(props.tags, ", ")}
```

JavaScript method calls (`props.name.toUpperCase()`) and the ternary operator (`a ? b : c`) are **not** supported — use a [filter](/templatedx/filters) or an [`<If>`](#control-flow) tag instead.

[Learn more about expressions →](/templatedx/expressions)

## Control Flow

### Conditionals

Use `<If>`, `<ElseIf>`, and `<Else>` tags:

```jsx theme={null}
<If condition={props.userType === "premium"}>
  You have access to premium features.
</If>
<ElseIf condition={props.userType === "standard"}>
  You have access to standard features.
</ElseIf>
<Else>
  You have access to basic features.
</Else>
```

### Loops

Use `<ForEach>` to iterate over arrays:

```jsx theme={null}
<ForEach arr={props.items}>
  {(item, index) => (
    <>
      {index + 1}. {item.name} - {item.description}
    </>
  )}
</ForEach>
```

[Learn more about control flow →](/templatedx/tags)

## Filters

Transform data with built-in filters. TemplateDX ships 10 filters out of the box: `abs`, `capitalize`, `dump`, `join`, `lower`, `replace`, `round`, `truncate`, `upper`, `urlencode`.

```jsx theme={null}
{upper(props.status)}              {/* ACTIVE */}
{lower(props.email)}               {/* user@example.com */}
{capitalize(props.name)}           {/* Note: only capitalizes first char — doesn't lowercase the rest */}
{truncate(props.content, 100)}     {/* First 100 chars + "..." */}
{join(props.tags, ", ")}           {/* tag1, tag2, tag3 */}
```

[Learn more about filters →](/templatedx/filters)

## Components

Create reusable template components:

```jsx theme={null}
import SystemRole from './system-role.mdx';
import Examples from './examples.mdx';

<SystemRole role="expert" domain={props.domain} />

## Task

{props.taskDescription}

<Examples data={props.examples} />
```

[Learn more about components →](/templatedx/components)

## Raw output

Use `<Raw>` to emit the enclosed content as literal source text — expressions inside `<Raw>` are **not** evaluated. The plugin re-serializes its children through Markdown, so you get exactly what you wrote.

```jsx theme={null}
<Raw>
  {props.variableName}
</Raw>
```

Output: `{props.variableName}` (literal text, not the value of `props.variableName`).

## XML tags

Lowercase XML tags are preserved as-is in the output, making them ideal for prompt engineering patterns like `<examples>`, `<context>`, and `<instructions>`. A fixed allow-list of HTML tags (per `supported-tags.ts`) also passes through unchanged. Expressions inside these tags are evaluated normally.

```jsx theme={null}
<User>
<examples>
<example>What is 2+2? The answer is 4.</example>
<example>What is 3+3? The answer is 6.</example>
</examples>

Now answer: What is {props.a}+{props.b}?
</User>
```

Output:

```
<examples>
<example>What is 2+2? The answer is 4.</example>
<example>What is 3+3? The answer is 6.</example>
</examples>

Now answer: What is 5+5?
```

XML tags support attributes and nesting:

```jsx theme={null}
<User>
<context type="system">You are an expert assistant.</context>
<instructions>
<rule>Be concise</rule>
<rule>Cite sources</rule>
</instructions>
</User>
```

<Note>
  Only **lowercase** tags are treated as XML passthrough. PascalCase tags like `<User>`, `<System>`, and `<ForEach>` are reserved for TemplateDX built-in tags and components. Variables and expressions inside XML tags are still evaluated normally.
</Note>

## Comments

Use JSX-style comments:

```jsx theme={null}
{/* This is a comment */}

{/**
  * Multi-line comment
  * for documentation
  */}
```

## Fragments

Use fragments to group elements without adding markup:

```jsx theme={null}
<>
  First line
  Second line
</>
```

## Markdown Support

TemplateDX supports all standard Markdown:

```markdown theme={null}
# Heading 1
## Heading 2

**Bold text**
*Italic text*

- Bullet list
- Item 2

1. Numbered list
2. Item 2

[Link text](https://example.com)

`inline code`

\`\`\`javascript
// Code block
const x = 10;
\`\`\`
```

## Whitespace

TemplateDX preserves whitespace in your templates:

```jsx theme={null}
Line 1
Line 2

Paragraph with blank line above
```

## Escaping literal braces

To emit literal `{` or `}` in output, wrap the content in `<Raw>`:

```jsx theme={null}
<Raw>{`{literal-braces}`}</Raw>
```

## Type safety

See [Type safety](/sdk-reference/typescript/type-safety) for the canonical flow — `npx agentmark generate-types` emits `AgentmarkTypes` from your prompt schemas. That's the codegen pipeline AgentMark actually uses at runtime.

TemplateDX itself doesn't evaluate JSDoc or TypeScript types; comments are stripped at bundle time (`bundler.ts:150` `removeComments`).

## Best practices

1. **Use descriptive variable names** — `props.customerName` instead of `props.n`.
2. **Keep templates modular** — break large templates into components.
3. **Generate types** — run `npx agentmark generate-types` and pass the result to `createAgentMarkClient<AgentmarkTypes>()` for compile-time prop validation.
4. **Use conditionals wisely** — make prompts adapt to context.
5. **Leverage filters** — transform data at the template level instead of reshaping it in your application code.

## Complete Example

Here's a full example combining multiple features:

```jsx theme={null}
---
title: Product Review Analysis
---

{/**
  * @typedef Props
  * @property {string} productName
  * @property {Array<{author: string, rating: number, comment: string}>} reviews
  * @property {string} analysisType
  */}

# Product Review Analysis for {props.productName}

You are an expert product analyst. Analyze the following customer reviews and provide insights.

## Reviews ({props.reviews.length} total)

<ForEach arr={props.reviews}>
  {(review, index) => (
    <>
      ### Review {index + 1}
      **Rating**: {review.rating}/5
      **Author**: {capitalize(review.author)}

      "{truncate(review.comment, 200)}"

      ---
    </>
  )}
</ForEach>

## Analysis Instructions

<If condition={props.analysisType === "sentiment"}>
  Focus on overall sentiment and emotional tone in the reviews.
</If>
<ElseIf condition={props.analysisType === "features"}>
  Identify the most mentioned product features and customer opinions about them.
</ElseIf>
<Else>
  Provide a comprehensive analysis covering sentiment, features, and improvement suggestions.
</Else>

Please provide your analysis in a structured format.
```

## Next Steps

* [Variables](/templatedx/variables) - Learn about variable access
* [Expressions](/templatedx/expressions) - JavaScript expressions
* [Tags](/templatedx/tags) - Control flow and special operations
* [Filters](/templatedx/filters) - Data transformation
* [Components](/templatedx/components) - Reusable templates


# Editor integration
Source: https://docs.agentmark.co/templatedx/syntax-highlighting

Set up syntax highlighting, schema validation, and MCP integration for TemplateDX files in VS Code, Cursor, Zed, and Claude Code.

TemplateDX uses `.mdx` files, so any MDX-aware editor gives you syntax highlighting for free. AgentMark layers on top of that with a JSON Schema for frontmatter (enables `model_name` autocomplete) and MCP configs that let your editor's AI query AgentMark docs + traces.

## Syntax highlighting

Install an MDX extension — TemplateDX files are recognized as MDX.

| Editor                          | Extension                                                                                                                                                                                |
| ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **VS Code / Cursor / Windsurf** | Search for "MDX" in the extensions panel (`unifiedjs.vscode-mdx`) — also available via [Open VSX](https://open-vsx.org/extension/unifiedjs/vscode-mdx) for VSCodium, Cursor, and Gitpod. |
| **JetBrains IDEs**              | [MDX plugin](https://plugins.jetbrains.com/plugin/14944-mdx)                                                                                                                             |
| **Zed**                         | Built-in MDX support.                                                                                                                                                                    |

## Frontmatter autocomplete via JSON Schema

Run the AgentMark CLI to generate a JSON Schema for your prompt frontmatter:

```bash theme={null}
npx agentmark generate-schema
```

This writes `.agentmark/prompt.schema.json`, classifying your configured models into `text_config`, `object_config`, `image_config`, and `speech_config` blocks with model-name enum autocomplete. Most editors (VS Code, Cursor, JetBrains) will pick up the schema automatically when it's referenced from your `.prompt.mdx` frontmatter.

## MCP: let your editor AI query AgentMark

When you scaffold an AgentMark project with `npm create agentmark@latest`, the scaffolder writes MCP server configs tailored to your editor:

* `agentmark-docs` — lets the editor's AI query AgentMark documentation
* `agentmark-local` — lets the editor's AI query your local trace data

Per-editor setup:

| Editor          | Config file          | Shape                                                                  |
| --------------- | -------------------- | ---------------------------------------------------------------------- |
| **VS Code**     | `.vscode/mcp.json`   | `{ "servers": { ... } }`                                               |
| **Cursor**      | `.cursor/mcp.json`   | `{ "mcpServers": { ... } }`                                            |
| **Zed**         | `.zed/settings.json` | `{ "context_servers": { ... } }`                                       |
| **Claude Code** | `.mcp.json`          | `{ "mcpServers": { ... } }` — requires `"type": "http"` on URL servers |

See the scaffolder source for the exact per-editor config, or rerun `npm create agentmark@latest -- --client <vscode|cursor|zed|claude-code>` to regenerate.

## Related

<CardGroup>
  <Card title="Docs MCP" icon="book" href="/sdk-reference/tools/agentmark-mcp">
    Per-IDE setup for `agentmark-docs`
  </Card>

  <Card title="Trace MCP server" icon="bug" href="/sdk-reference/tools/mcp-trace-server">
    Local trace debugging via `agentmark-local`
  </Card>
</CardGroup>


# Tags
Source: https://docs.agentmark.co/templatedx/tags

Use built-in and custom tags in TemplateDX to control template flow and transform data.

Tags are JSX elements backed by a plugin. TemplateDX ships five built-in tags (`<If>`, `<ElseIf>`, `<Else>`, `<ForEach>`, `<Raw>`) and exposes two ways to register your own: a global static API and a per-instance API on `TemplateDX`.

## Creating custom tags (TypeScript)

### Extend TagPlugin

Create a class that extends `TagPlugin` and implements `transform`:

```typescript theme={null}
import { Node } from 'mdast';
import { TagPlugin, PluginContext } from '@agentmark-ai/templatedx';

interface MyTagProps {
  prefix?: string;
}

class MyTagPlugin extends TagPlugin<MyTagProps> {
  async transform(
    props: MyTagProps,
    children: Node[],
    context: PluginContext
  ): Promise<Node[] | Node> {
    const { nodeHelpers } = context;

    const content = nodeHelpers.toMarkdown({
      type: 'root',
      children: children,
    });

    const prefix = props.prefix ?? '> ';
    const prefixedContent = content
      .split('\n')
      .map(line => prefix + line)
      .join('\n');

    return [{
      type: 'text',
      value: prefixedContent,
    } as Node];
  }
}
```

### The transform method

`transform(props, children, context)` must return a `Promise<Node | Node[]>`.

`context: PluginContext` exposes:

| Field                   | Type                    | Description                                                                                                |
| ----------------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------- |
| `nodeHelpers`           | `NodeHelpers`           | AST utilities (see below)                                                                                  |
| `scope`                 | `Scope`                 | Current variable scope; read via `scope.get(key)`, create a child with `scope.createChild({ key: value })` |
| `createNodeTransformer` | `(scope: Scope) => any` | Factory for recursively transforming child nodes under a new scope                                         |
| `tagName`               | `string`                | Name of the tag being processed — set by the transformer from `node.name`                                  |

<Note>
  `tagName` is typed as `string`, but the conditional plugin (`tag-plugins/conditional.ts`) defensively guards `if (!tagName) throw ...`. Built-in tags can rely on `tagName` being set; advanced consumers that invoke a plugin's `transform` directly (outside the transformer) should pass a valid name.
</Note>

#### nodeHelpers surface

| Member                      | Purpose                                                                              |
| --------------------------- | ------------------------------------------------------------------------------------ |
| `isMdxJsxElement(node)`     | `true` for flow or text JSX elements                                                 |
| `isMdxJsxFlowElement(node)` | `true` for block-level JSX                                                           |
| `isMdxJsxTextElement(node)` | `true` for inline JSX                                                                |
| `isParentNode(node)`        | `true` if the node has a `children` array                                            |
| `toMarkdown(node)`          | Serialize an AST node back to Markdown/text                                          |
| `hasFunctionBody(node)`     | `true` for expression nodes containing an arrow function (e.g. `<ForEach>` callback) |
| `getFunctionBody(node)`     | Returns `{ body, argumentNames }` for arrow-function expression nodes                |
| `NODE_TYPES`                | Constants for AST node-type strings (`TEXT`, `PARAGRAPH`, `MDX_JSX_FLOW_ELEMENT`, …) |

### Register the plugin (static or instance API)

The registry exposes **both** a static API (process-wide) and an instance API (scoped to a `TemplateDX` engine).

**Static (global) registration** — the simplest path; everything using the default `transform`/`stringify` exports will see it:

```typescript theme={null}
import { TagPluginRegistry } from '@agentmark-ai/templatedx';

TagPluginRegistry.register(new MyTagPlugin(), ['MyTag', 'Prefix']);
```

**Instance (scoped) registration** — use when you want plugins isolated per engine (e.g. a server handling multiple tenants with different tag sets):

```typescript theme={null}
import { TemplateDX } from '@agentmark-ai/templatedx';

const engine = new TemplateDX({ includeBuiltins: true });
engine.registerTagPlugin(new MyTagPlugin(), ['MyTag']);

const rendered = await engine.transform(ast, { /* props */ });
```

`new TemplateDX({ includeBuiltins: true })` copies the static built-ins (`If`, `ElseIf`, `Else`, `ForEach`, `Raw`) into the instance; pass `false` to start empty.

### Use the tag in a template

```tsx theme={null}
<MyTag prefix="// ">
  This is some content
  that will be prefixed
</MyTag>
```

### Example: Quote tag

```typescript theme={null}
import { Node, Root } from 'mdast';
import { TagPlugin, PluginContext, TagPluginRegistry } from '@agentmark-ai/templatedx';

interface QuoteProps {
  author?: string;
}

class QuotePlugin extends TagPlugin<QuoteProps> {
  async transform(
    props: QuoteProps,
    children: Node[],
    context: PluginContext
  ): Promise<Node[]> {
    const { nodeHelpers } = context;

    const content = nodeHelpers.toMarkdown({
      type: 'root',
      children: children,
    } as Root);

    let result = content
      .split('\n')
      .map(line => '> ' + line)
      .join('\n');

    if (props.author) {
      result += `\n> \n> -- ${props.author}`;
    }

    return [{
      type: 'text',
      value: result,
    } as Node];
  }
}

TagPluginRegistry.register(new QuotePlugin(), ['Quote']);
```

Usage:

```tsx theme={null}
<Quote author="Albert Einstein">
  Imagination is more important than knowledge.
</Quote>
```

Output:

```
> Imagination is more important than knowledge.
>
> -- Albert Einstein
```

## Creating custom tags (Python)

`agentmark-templatedx` (Python) mirrors the TS surface. Subclass `TagPlugin`, implement `async def transform`, and register via the static (`register_global`) or instance API.

```python theme={null}
from templatedx import TagPlugin, TagPluginRegistry
from templatedx.tag_plugin import PluginContext

class QuotePlugin(TagPlugin):
    async def transform(self, props, children, context: PluginContext):
        content = context.node_helpers.to_markdown(children)
        author = props.get("author")
        body = "\n".join(f"> {line}" for line in content.split("\n"))
        if author:
            body += f"\n> \n> -- {author}"
        return [{"type": "text", "value": body}]

# Global (static) registration
TagPluginRegistry.register_global(QuotePlugin(), ["Quote"])
```

The Python `PluginContext` dataclass has `node_helpers` (snake\_case mirror of the TS `NodeHelpers` surface), `create_node_transformer`, `scope`, and `tag_name`.

For instance-scoped registration, construct the engine and use `register_tag_plugin`:

```python theme={null}
from templatedx import TemplateDX

engine = TemplateDX()  # always copies global built-ins on init
engine.register_tag_plugin(QuotePlugin(), ["Quote"])

result = await engine.transform(ast, {"name": "Alice"})
```

## Built-in tags

### ForEach

The `ForEach` tag loops over an array.

**Syntax**

```tsx theme={null}
<ForEach arr={props.arr}>
  {(item, index) => ...}
</ForEach>
```

**Parameters**

* `arr: Array<T>` — an array of items you want to iterate on
* `children: (item: T, index: number) => any` — a callback function for each item

**Example**

```tsx theme={null}
<ForEach arr={[1, 2]}>
  {(item, index) => (
    <>
      * item: {item}, index: {index}
    </>
  )}
</ForEach>
```

**Output**

```
* item: 1, index: 0
* item: 2, index: 1
```

### Conditionals

The `If`, `ElseIf`, and `Else` tags let you conditionally output content.

**Syntax**

```tsx theme={null}
<If condition={props.boolA}>
  ...
</If>
<ElseIf condition={props.boolB}>
  ...
</ElseIf>
<Else>
 ...
</Else>
```

**Parameters**

`If` / `ElseIf`:

* `condition: boolean` — the condition to check
* `children: Node` — the node to render if the condition is true

`Else`:

* `children: Node` — the content to render if no previous condition was met

**Example**

```tsx theme={null}
<If condition={1 + 1 == 3}>
  1 + 1 is not 3
</If>
<ElseIf condition={1 + 1 == 2}>
  1 + 1 is 2
</ElseIf>
<Else>
  Fallback
</Else>
```

**Output**

```
1 + 1 is 2
```

### Raw

The `Raw` tag outputs its children verbatim, without expression interpolation.

**Syntax**

```tsx theme={null}
<Raw>
  ...
</Raw>
```

**Parameters**

* `children: Node` — the raw text

**Example**

```tsx theme={null}
<Raw>
  {props.name}
</Raw>
```

**Output**

```
{props.name}
```


# Type safety
Source: https://docs.agentmark.co/templatedx/type-safety

Editor-level type hints for TemplateDX templates. For runtime type safety, use the AgentMark type generator.

<Note>
  **For production type safety, see [Type safety in the SDK Reference](/sdk-reference/typescript/type-safety).** That page documents `npx agentmark generate-types`, which produces `AgentmarkTypes` from your prompt schemas and feeds into `createAgentMarkClient<AgentmarkTypes>()`. That's the pipeline AgentMark actually uses at runtime.

  This page is a supplement for **editor-level hints** on raw `.mdx` files opened outside an AgentMark project (e.g. via the vscode-mdx extension). TemplateDX itself does not evaluate TypeScript types — comments are stripped at bundle time per `bundler.ts:150`.
</Note>

## Editor setup for standalone `.mdx` files

If you're editing `.mdx` files outside an AgentMark project and want props autocomplete from your editor, add JSDoc `@typedef` comments:

```mdx hello.mdx theme={null}
{/**
  * @typedef Props
  * @property {string} name - Who to greet.
  */
}

# Hello {props.name}
```

The vscode-mdx extension (and similar editor plugins) read these comments to populate autocomplete. AgentMark's runtime does not use them — they're stripped before rendering.

## Custom filters and tags (editor-level)

To get autocomplete for custom filters and tags you've registered, create a `types/global.d.ts`:

```typescript theme={null}
import type { BaseMDXProvidedComponents, FilterFunction } from '@agentmark-ai/templatedx';

interface MyCustomTagProps {
  label: string;
  max?: number;
}

declare global {
  const myCustomFilter: FilterFunction<string, string>;

  interface MDXProvidedComponents extends BaseMDXProvidedComponents {
    MyCustomTag: React.FC<MyCustomTagProps>;
  }
}

export {};
```

Again, these types only help the editor — at runtime your filter/tag is looked up by name in the registry.

## Production type safety

For compile-time enforcement of prompt props + outputs in your application code, use the AgentMark type generator:

```bash theme={null}
npx agentmark generate-types --root-dir ./agentmark > agentmark.types.ts
```

Then:

```typescript theme={null}
import type AgentmarkTypes from "./agentmark.types";

const client = createAgentMarkClient<AgentmarkTypes>({ loader, modelRegistry });
const prompt = await client.loadObjectPrompt("my/prompt.mdx");
// props and output are now fully typed from the prompt's input_schema + object_config.schema
```

See [SDK Reference → Type safety](/sdk-reference/typescript/type-safety) for the full flow including generator output format.


# Variables
Source: https://docs.agentmark.co/templatedx/variables

Pass dynamic data into TemplateDX templates via the props object.

TemplateDX doesn't have variable declarations. At render time, the caller passes a `props` object to `transform()`; your template reads those values with `{props.*}` expressions. Tag plugins can introduce additional scoped variables (e.g. `<ForEach>` gives you the loop iterand).

## Accessing variables

### Dot notation

Read nested properties with dot notation, same as JavaScript:

```jsx theme={null}
{props.username}
{props.user.firstName} {props.user.lastName}
```

### Bracket syntax

Bracket syntax works for dynamic or hyphenated keys:

```jsx theme={null}
{props['user-name']}
{props['user-email']}
```

## Undefined-variable behavior

* **Missing nested properties** (via `MemberExpression`) render as empty string:
  ```jsx theme={null}
  {props.user.address.street}   // renders '' if `address` or `street` is missing
  ```
* **Bare undefined identifiers** (no `props.` prefix) resolve to the literal string `"undefined"` — the evaluator treats them as undeclared names, not optional chains:
  ```jsx theme={null}
  {someGlobal}   // renders "undefined" if `someGlobal` isn't in scope
  ```

Only variables provided by the caller's `props`, introduced by a tag plugin's scope (`<ForEach>`, `<If>`), or registered filters (see [Filters](/templatedx/filters)) are accessible. JavaScript globals are not.

## Examples

### Defined variable

```jsx theme={null}
{props.username}
```

```
Alice
```

### Nested properties

```jsx theme={null}
{props.user.firstName} {props.user.lastName}
```

```
Alice Johnson
```

### Bracket syntax for dynamic properties

```jsx theme={null}
{props['user-email']}
```

```
alice.johnson@example.com
```

### Undefined nested property

```jsx theme={null}
{props.user.address.street}
```

(renders empty string)