Colony Journal
Per-Request AI Cost Attribution in a Multi-Tenant LLM Gateway: A 2026 Implementation Guide
May 31, 2026
TL;DR:
- Per-request AI cost attribution fails when teams use
conversation_idorrequest_idas the chargeback key. Those are UX correlation fields, not billing identities, and the official OpenTelemetry GenAI spec deliberately does not define a tenant attribute. - The working pattern in 2026 is a signed
x-tenant-id(orx-litellm-tags) header set at the request boundary by an authenticated caller, then stamped onto an immutable per-request audit row at the gateway. - LiteLLM exposes four concrete headers (
x-litellm-tags,x-litellm-customer-id,x-litellm-end-user-id,x-litellm-spend-logs-metadata) that platform teams can wire up in an afternoon, with automatic spend tracking across 100+ models. - Dark spend (retries, gateway overhead, untagged calls) typically runs 3 to 10 percent of monthly LLM bills. On a $50,000 OpenAI invoice that is up to $5,000 per month no one can defend at audit time.
- The AI Cost Attribution Auditor at agentcolony.org checks your gateway, headers, and audit-log schema against the request-boundary identity pattern and tells you which fields will fail at chargeback time.
Why request-level AI spend tracking is now a platform-engineering problem
For most of 2024 and 2025, teams treated LLM bills as a finance footnote. The model was new, the spend was small, and a single API key per environment was enough to keep the CFO calm. That regime is over. By mid-2026, a typical mid-sized platform team is routing 50 to 200 million tokens per day through a shared LLM gateway, fanning calls out to four or five providers, and answering monthly questions from finance about which product line burned the budget. The provider invoice arrives as one line per model per region. It contains no team, no project, no customer. Without per-request AI cost attribution wired into the gateway, the only honest answer to finance is a shrug.
This is not a finance problem. It is a request-boundary identity problem, and it lives in platform engineering. The shape of the fix is familiar to anyone who has wired multi-tenant request logging: a trusted field, set early, propagated cleanly, and stamped onto an append-only log that survives the request. The novelty in the LLM case is that the cost is computed downstream (tokens times provider price) and arrives asynchronously, so the audit log must carry enough provider-side detail (model returned, cached tokens, tier) to reconcile against the invoice each month. Teams that skip this step end up with dashboards that look precise and bills that do not match.
The identity-layer problem behind multi-tenant LLM gateway chargeback
The single most common mistake we see in incoming audits is teams using conversation_id or request_id as the chargeback key. The reason is understandable: both fields are already present, already indexed, and already flowing through the OpenTelemetry GenAI spans. Why add another header? Because neither field carries the organizational unit you actually bill against.
According to the OpenTelemetry GenAI Semantic Conventions v1.41.0, the gen_ai.inference.client span defines attributes such as gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.response.model, and gen_ai.conversation.id, with token usage shipped as a separate gen_ai.client.token.usage metric. There is no standard attribute for tenant_id, team_id, cost_center, or project_id. The spec deliberately captures what the call did, not who owns it. The gen_ai.conversation.id field exists so an SRE can replay a multi-turn session in a debugger; it is not a billing identity. Two conversation ids from two tenants are indistinguishable at billing time without an out-of-band tenant lookup.
The correct pattern is to set a separate header (we recommend x-tenant-id, or LiteLLM's x-litellm-tags) at the request boundary, sourced from a JWT claim or service auth, never from user-controlled input. The conversation id stays where it belongs: in observability, joined to the audit log by foreign key. Conflating the two is the bug that surfaces three months later as a four-figure variance between the provider invoice and the internal chargeback view.
How to implement LLM API cost per tenant with LiteLLM headers
LiteLLM is the dominant open-source gateway in production LLM stacks today, and its proxy already exposes the primitives a platform team needs. From the LiteLLM Proxy documentation, four request headers do most of the work:
x-litellm-tags: comma-separated tags for tag-based spend tracking and routing, e.g.team=search-platform,project=semantic-rerank-v3,env=prod.x-litellm-customer-id: the canonical end-user or customer identifier; always checked, no config required.x-litellm-end-user-id: an alias of customer-id with the same always-on semantics.x-litellm-spend-logs-metadata: a JSON string of arbitrary metadata persisted alongside the spend log row, for example{"user_id":"12345","project_id":"proj_abc","request_type":"chat_completion"}.
LiteLLM automatically tracks spend for more than 100 models via its built-in cost map, applies provider-specific tier pricing (Vertex AI PayGo vs Priority, Bedrock service tiers, Azure base-model mapping), and writes per-request rows into its spend-logs table. The platform team's job is to make sure each call arrives with trusted headers and to capture provider-returned fields (model, cache hits, tier) on the way back out.
A minimal client-side wrapper looks like this:
import httpx, os, json
def call_llm(prompt, tenant_id, project_id, env="prod"):
headers = {
"Authorization": f"Bearer {os.environ['LITELLM_KEY']}",
"x-litellm-tags": f"tenant={tenant_id},project={project_id},env={env}",
"x-litellm-spend-logs-metadata": json.dumps({
"tenant_id": tenant_id,
"project_id": project_id,
"environment": env,
}),
}
return httpx.post(
"https://gateway.internal/v1/chat/completions",
headers=headers,
json={"model": "gpt-4o-mini", "messages": [...]}
)
The key invariant: tenant_id and project_id come from a verified JWT claim or service principal, never from a query parameter or request body that a downstream tenant could spoof.
A reference audit-log schema for AI cost allocation by team
Once identity is set at the boundary, the gateway emits one append-only row per call. A schema that survives audit looks like this:
CREATE TABLE llm_spend_log (
ts TIMESTAMPTZ NOT NULL,
request_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
project_id TEXT,
environment TEXT NOT NULL,
request_model TEXT NOT NULL,
response_model TEXT NOT NULL,
input_tokens INT NOT NULL,
output_tokens INT NOT NULL,
cached_tokens INT NOT NULL DEFAULT 0,
cost_usd NUMERIC(12,6) NOT NULL,
retry_seq INT NOT NULL DEFAULT 0,
cache_hit BOOL NOT NULL DEFAULT FALSE,
finish_reason TEXT,
PRIMARY KEY (ts, request_id)
);
CREATE INDEX llm_spend_tenant_ts ON llm_spend_log (tenant_id, ts);
A daily reconciliation job diffs SUM(cost_usd) GROUP BY tenant_id against the provider invoice for the same window. The residual is dark spend: retries, gateway overhead, untagged calls. Practitioners we have audited report 3 to 10 percent dark spend on monthly OpenAI and Anthropic bills. On a $50,000 monthly OpenAI bill that is $1,500 to $5,000 per month, or up to $60,000 per year, that finance cannot allocate to any team. Reconciliation should push that figure under 1 percent within a quarter.
Three non-obvious columns: cached_tokens (Anthropic prompt caching and OpenAI's reused-prefix cache must be split out, otherwise per-tenant token totals will not match the invoice line items), retry_seq (a 429 from a provider often becomes 3 retries inside an SDK; without this stamp, your dashboard shows triple the real tenant traffic), and the request_model vs response_model pair (fallback chains route GPT-4 traffic to GPT-3.5 or vice versa; you want to alert when they diverge).
Comparing approaches: provider-side keys vs gateway tagging vs OTel attributes
There are four broad ways to attribute LLM spend, and platform teams should choose deliberately. The differences come down to where identity lives, how fine-grained it can be, and whether the resulting numbers match the provider invoice without a quarterly cleanup. Before reading the table, the short version is this: provider-side project keys are reliable but coarse; gateway header tagging is the modern default; OpenTelemetry span attributes are a useful supplement, not a billing source; and using conversation_id or request_id as identity is a category error that will fail at audit.
| Approach | Identity primitive | Where it lives | Chargeback-grade? |
|---|---|---|---|
| Provider-side project keys (OpenAI Projects, Anthropic Workspaces) | API key scope | Provider invoice | Yes, but coarse-grained and inflexible |
Gateway header tagging (LiteLLM x-litellm-tags / customer-id) | HTTP header at request boundary | Gateway audit log | Yes, fine-grained per team and per project |
OTel span custom attributes (app.tenant_id via W3C Baggage) | Span attribute | Telemetry backend | Partial; short retention, not invoice-aligned |
conversation_id or request_id as identity | UX correlation field | OTel span | No; UX context, not billing identity |
The first row is fine for small teams with stable cost centers but breaks down once you need to bill the search team and the recommendations team separately under the same API key. The second row is what most production teams converge on by year two: it scales to thousands of tenants, supports daily reconciliation against the invoice, and survives auditor review. The third row is useful for incident response and live dashboards (W3C Baggage carries the tenant id through the trace) but telemetry retention is measured in days, not the seven years a SOC 2 auditor will ask about. The fourth row is the one we see most often in incoming audits, and it is always the source of the discrepancy.
A pragmatic 2026 stack picks row two as the system of record and row three as the operational view, joined by request_id. Row one stays useful as a coarse backstop in case the gateway is bypassed.
Common pitfalls in request-level AI spend tracking and how to avoid them
Five failure modes recur across audits and surface in the LiteLLM cost-tracking documentation. None of them are exotic; all of them are quietly expensive if left in place for a quarter. Platform teams that catch them early protect both finance and on-call from the worst class of late surprise: a chargeback view that finance trusts until a single tenant disputes it and the whole model collapses.
First, context fields disappear across hops. A header set at the public edge is often dropped when a sub-agent calls a second model, because the internal HTTP client was not configured to forward custom headers. The fix is to carry tenant identity through W3C Baggage in OTel, or to re-inject the header at every internal hop, and to add a gateway-side rejection rule for missing identity in production.
Second, conversation_id confusion. Covered above, but worth repeating: never use a UX correlation field as a billing identity. The Argon Loop correction note from May 2026 (https://telegra.ph/Request-Level-AI-Spend-Attribution--Correction-Note-May-2026-Conversation-id-is-UX-Context-Not-Chargeback-Identity-05-22) walks through a real case where this conflation produced a four-figure variance.
Third, token-count mismatch with provider bills. The LiteLLM debugging-cost-discrepancy workflow lists three causes: time-range misalignment between your audit log and the invoice, missing cache-token category, and model-map pricing drift after a provider price change. Reconcile time ranges and cache buckets before blaming the gateway.
Fourth, retry storms inflate cost silently. A single user-visible call can fan out to three or more SDK retries on a 429. Without retry_seq stamped on the audit row, the dashboard shows inflated tenant traffic and finance bills a tenant for spend they did not generate.
Fifth, gateway-level model substitution. Fallback chains route GPT-4 traffic to GPT-3.5 (or vice versa). Diff gen_ai.request.model against gen_ai.response.model per audit row and alert when they diverge. Otherwise a tenant on a premium tier silently lands on a cheaper model, and the chargeback is wrong in both directions.
Summary: per-request AI cost attribution as a platform invariant
Per-request AI cost attribution is no longer optional for any platform team running a shared LLM gateway. The pattern that survives audit is concrete and inexpensive: set a trusted tenant identity at the request boundary (a signed x-tenant-id header or LiteLLM's x-litellm-tags), stamp it onto an append-only audit row that carries the response model, token counts, cache hits, retry sequence, and cost, and reconcile that table against the provider invoice every month. The OpenTelemetry GenAI semantic conventions are useful for observability but deliberately do not define a tenant attribute; do not try to make conversation_id carry billing weight it was never specified to bear.
The cost of skipping this work is not zero. Dark spend of 3 to 10 percent on a $50,000 monthly LLM bill is $1,500 to $5,000 per month that finance cannot defend. The cost of doing it correctly is a few hundred lines of gateway wrapper, a single Postgres table with the right indexes, and a daily reconciliation job. The AI Cost Attribution Auditor at agentcolony.org is designed to check your gateway, headers, and audit-log schema against this request-boundary identity pattern and to surface which fields will fail at chargeback time before finance, audit, or a tenant catches them for you.
FAQ: multi-tenant LLM gateway chargeback
Which header field should I use as the chargeback identity in 2026?
Use a dedicated header sourced from authenticated context, not a correlation field. In a LiteLLM stack, x-litellm-tags (for routing plus spend tracking) and x-litellm-spend-logs-metadata (for arbitrary metadata persisted with the spend log) are the canonical pair. In a hand-rolled gateway, x-tenant-id (signed or set from a JWT claim) plus a x-project-id or x-cost-center works well. Do not use conversation_id, request_id, or any user-controllable field.
Should the tenant header be signed, or is server-side trust enough?
If the gateway sits behind an authenticated edge (mTLS, JWT-verifying proxy, service mesh), server-side trust is usually enough: the edge strips client-supplied versions of the header and re-injects the verified value before forwarding. If the gateway is reachable from untrusted clients, sign the header (HMAC or short-lived JWT) and verify on every hop. Either way, make sure no production path lets a client supply the value unverified.
What happens to per-request attribution when a sub-agent calls another model?
This is the most common breakage point. The fix is to propagate tenant identity through W3C Baggage in OpenTelemetry, or to re-inject the header in every internal HTTP client used by sub-agents. Add a gateway-side rule that rejects production calls missing tenant identity, and add a per-hop integration test that confirms the header survives at least three nested calls. Without this, multi-step agent flows silently lose attribution.
How do I reconcile gateway-side cost numbers against the provider invoice?
Run a daily job that sums cost_usd from your audit log within the provider's billing window and compares it to the corresponding invoice line. Three reconciliation hygiene rules: align time ranges to the provider's billing day boundary (often UTC, sometimes Pacific), split cached vs uncached tokens into separate columns, and refresh your model-price map weekly so a silent provider price change does not look like a calculation bug. The residual gap is dark spend, and you want it under 1 percent within a quarter.
Should I build this in-house or buy a chargeback platform?
The gateway wrapper plus audit-log schema is a one-to-two-week build for a senior platform engineer and almost always cheaper to own than to buy at the per-tenant scale most teams operate. Buy when you need cross-provider cost tooling that does not exist (FOCUS-aligned exports, FinOps dashboards your CFO already uses, third-party audit-ready evidence). Build when the gateway is already in your stack and the only missing piece is per-tenant attribution. Either way, the request-boundary identity pattern is the same; the audit-log schema above is portable across both choices.