About recoveryCompare recovery

Colony Journal

Multi-Tenant AI Gateway Cost Attribution: A Field Guide to Per-Workflow_id Billing

May 29, 2026

TL;DR

  • Five-field minimum schema for chargeback: tenant_id, workflow_id, request_id, user_id, model plus token usage. Propagate it on every hop or the bill becomes a guess.
  • The OpenTelemetry GenAI semconv 1.41.0 and provider-native metadata fields (OpenAI user/metadata, Anthropic metadata.user_id, Bedrock cost allocation tags) are the two compatible carriers. Use both at the same time.
  • Most attribution leakage is a propagation gap, not a schema gap: retries strip headers, queues drop trace context, agent sub-calls bypass the gateway, batch endpoints reject metadata, mid-stream disconnects log usage=null.
  • Define unattributed spend rate as tokens-with-null-workflow_id divided by total tokens. Anything over 2 percent is a chargeback defect the platform team is silently eating.
  • Reconcile internal gateway totals against the provider invoice every month. Delta over 1 percent is your headline attribution leakage KPI.

Multi-tenant AI gateway cost attribution sounds like a schema problem and turns out to be a propagation problem. Every platform team running a shared LiteLLM, Portkey, Helicone, or in-house proxy across five-plus internal teams or external tenants eventually hits the same wall: the OpenAI or Anthropic invoice arrives as one number, the internal log says it should split a dozen ways, and 15 to 30 percent of the rows have a null tenant_id. The team that built the gateway is the team whose budget the unattributed tokens land on. This post is a working guide to fixing that, written for FinOps engineers and platform leads at companies with five or more teams sharing one AI gateway.

The minimum viable schema for per-tenant LLM cost billing

Workflow_id AI cost tracking starts with a small, recursive set of fields that has to ride with the call from the originating workflow, through the gateway, to the provider, and back into the cost log. Across LiteLLM, Portkey, Helicone, and OpenTelemetry GenAI, the same five fields keep showing up:

  • tenant_id: the billable entity, a customer org or internal team.
  • workflow_id: the business unit of work justifying the spend. One support ticket, one batch eval run, one agent task. This is the field FinOps charges back against.
  • request_id: unique per HTTP call to the gateway. It is the foreign key that lets a row in the provider invoice match a row in your internal log.
  • user_id or end_user_id: the human or downstream agent that triggered the call.
  • model plus provider-reported usage.input_tokens, usage.output_tokens, and cached_tokens.

Without all five, shared AI infrastructure cost allocation cannot be done cleanly: tenant_id alone collapses workflow detail, workflow_id alone breaks customer-level chargeback, and a missing request_id makes invoice reconciliation impossible.

Two carriers: OpenTelemetry GenAI and provider-native metadata

There are two conventions for carrying this context, and the practical answer is to emit both.

The first is the OpenTelemetry GenAI semantic conventions, currently at 1.41.0 (May 2026), which standardise attributes like gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.id. Your tenant and workflow IDs ride on the span as resource attributes or W3C baggage (tenant.id, workflow.id, service.namespace). Datadog, Honeycomb, and Grafana all ingest this schema, which means once the spans are emitted you get attribution dashboards almost for free.

The second is provider-native metadata. OpenAI accepts a top-level user string and a metadata object that flows into the Usage export. Anthropic accepts metadata.user_id. Bedrock honours AWS cost allocation tags on each Invoke. The LiteLLM virtual keys pattern wraps these so a key issued per team carries a metadata blob like {"user": "team-payments", "workflow_id": "ticket-42"} that the proxy persists and exposes via /key/info, /team/info, /user/info, and /end_user/info endpoints.

Comparing the carriers head to head

AspectOpenTelemetry GenAI semconvOpenAI / Anthropic metadataBedrock cost allocation tagsLiteLLM virtual keys
Standard ownerOpenTelemetry SIGProviderAWSOpen-source proxy
Where storedYour trace storeProvider Usage exportAWS Cost and Usage ReportProxy DB + headers
Tenant fieldCustom attribute (tenant.id)user stringtag valuekey + metadata
Workflow fieldCustom attribute (workflow.id)metadata.workflow_idtag valuemetadata.workflow_id
Survives provider retriesOnly if baggage is propagatedYesYesDepends on router
Reconciles to invoiceVia request_id joinDirect from Usage exportDirect from CUR lineVia /spend endpoints
Setup costMedium (collector + semconv)Low (param per call)Medium (IAM + tagging)Low if already proxied

For most platform teams the right pattern is: emit both, treat the OTel span as the operational source of truth, and treat the provider usage export as the ground-truth reconciler.

Enforcement gaps: where the workflow_id disappears mid-hop

LLM cost chargeback per team breaks not because the schema is wrong but because the field is empty in 15 to 30 percent of rows by the time it lands in the warehouse. Five repeatable failure modes show up across teams.

Retry strips metadata. The gateway retries on a 5xx and builds a fresh request envelope without re-attaching the metadata header. The second attempt logs workflow_id=null. It shows up in the trace as a child span with the same gen_ai.response.id parent but missing the resource attribute. The LiteLLM issue tracker has multiple reports of this around router retry paths.

Async queue handoff. A worker pulls a job off SQS or Kafka, but the trace context was never serialised into the message. The LLM call starts a fresh trace and the workflow_id baggage is gone. The fix is W3C traceparent plus baggage propagation in the message envelope.

Agent sub-calls. A LangGraph or LangChain agent makes a tool-call that itself calls an LLM through the SDK directly, bypassing the gateway metadata. The cost lands attributed to the agent's parent user_id only, so two workflows sharing a sub-agent are indistinguishable in the bill.

Batch and embedding endpoints. Batch and embedding endpoints often do not accept the same user or metadata fields. A batch eval that consumes 30 percent of the month's tokens lands as tenant=unknown. The practical workaround is to split batches per tenant key so the key itself encodes the tenant.

Streaming response close. If the client disconnects mid-stream, some gateways log usage=null and the call drops out of the spend ledger entirely. Fix it server-side by capturing usage at stream close, not from the client.

Diagnosing the leak with one trace query

Once you know the failure modes you can find them. The diagnostic is mechanical: in your trace store, filter spans where gen_ai.usage.input_tokens > 0 AND the resource attribute workflow.id is null. Tokens in that set divided by total tokens for the month is your unattributed spend rate. Anything over roughly 2 percent is a chargeback defect that someone is silently eating, almost always the platform team that owns the gateway.

According to the OpenTelemetry GenAI semantic conventions 1.41.0, tenant and workflow identifiers remain generic span attributes rather than first-class GenAI fields, which is exactly why every org rolls its own convention and why the field is so easy to lose on a retry path or a queue boundary. Pick a name (workflow.id is the conventional choice), document it, and bolt it on as a required baggage propagator in every service that talks to the gateway. Audit it weekly with the null-attribute query above.

Reconciling the gateway log against the provider invoice

Even with a clean per-request log, the monthly cycle needs a reconciliation step. Providers round, aggregate, and apply credits differently than the gateway's local meter. The pattern that works:

  1. Export the OpenAI Usage API, the Anthropic Usage export, or the AWS Cost and Usage Report tagged by Bedrock line for the month.
  2. Group the internal gateway log by tenant_id by model by day. Sum cost_usd computed from posted price.
  3. Reconcile: provider total minus internal total should be under 1 percent. The delta is your attribution leakage, and it is the headline FinOps KPI for this work.

The concrete justification number platform teams quote: OpenAI GPT-4o May 2026 list pricing is $2.50 per million input tokens and $10.00 per million output tokens. A single gateway serving five teams at 200M tokens per month is a roughly $1.5K monthly bill, and 20 percent unattributed leaves $300 per month of platform tax that someone has to eat. At enterprise scale, 10B tokens per month at roughly $75K, the same 20 percent is $15K per month of unallocated cost, more than enough to fund the attribution project on its own.

Summary

Multi-tenant AI gateway cost attribution is a propagation problem with a schema sticker on it. The five-field schema (tenant_id, workflow_id, request_id, user_id, model plus token usage) is necessary but not sufficient. The real work is in carrying that context across every hop: retries, queues, agent sub-calls, batch endpoints, and streaming response close. Emit OpenTelemetry GenAI spans and provider-native metadata at the same time. Diagnose with the unattributed spend rate query. Reconcile monthly against the provider invoice. The platform team that does this turns an opaque "we share one OpenAI key across N teams and have no way to chargeback" into a clean per-team line item, and stops eating the leakage out of its own budget.

FAQ

How do I implement per-workflow_id cost attribution on a shared LiteLLM proxy?

Issue a virtual key per team via the LiteLLM /key/generate endpoint, attach a metadata object with workflow_id on every call, and use the /spend/logs endpoint to pull per-key, per-team, and per-user_id rollups. Confirm the metadata makes it through router retries by spot-checking a 5xx-retried request in the proxy log. If the retried span is missing workflow.id, patch the router middleware to re-attach the original metadata blob before reissue.

What is an unattributed spend rate and what is a healthy threshold?

Unattributed spend rate is the share of tokens where workflow_id (or tenant_id) is null. Compute it monthly from your trace store. Below 2 percent is operationally normal noise from streaming disconnects and edge retries. Above 2 percent indicates a real propagation defect: a retry path, a queue handoff, or an agent sub-call bypassing the gateway. Treat it like any other SLO and assign the gateway team a budget.

Why is workflow_id empty after a retry in my AI gateway logs?

The gateway is building a fresh outbound request on retry and dropping the original metadata header. Fix it in the gateway's retry middleware by re-attaching the original metadata blob, or by carrying the IDs as W3C baggage propagated by the OpenTelemetry SDK so the next outbound span inherits them automatically. Verify with a synthetic test that forces a 5xx and inspects both spans.

Can I rely solely on the OpenAI Usage export for tenant-level LLM cost billing?

Only if you set the user string and metadata fields on every call, AND you do not use batch or embedding endpoints, since those often reject those fields. For most multi-tenant gateways the safest approach is to emit OpenTelemetry GenAI spans as the operational source of truth and reconcile against the provider Usage export monthly. The reconciliation delta is the metric you report to finance.

How do I chargeback Bedrock invocations across multiple tenants on a shared role?

Use AWS cost allocation tags on the Invoke call (tenant, workflow) and activate them in the Billing console. Tagged lines then show up in the Cost and Usage Report by tenant, ready to join against your internal gateway log on request_id. Without tags, all Bedrock spend on a shared role lands undistinguishable in the consolidated bill, and you are back to apportioning it by guess.