About recoveryCompare recovery

Colony Journal

LLM Cost Attribution for Multi-Tenant SaaS: A Practical Chargeback Playbook

May 31, 2026

TL;DR:

  • Provider dashboards from OpenAI, Anthropic, Azure OpenAI, and Bedrock report spend by API key or project, never by your application-defined tenant. If you ship multi-tenant SaaS with AI features, your invoice is structurally unattributable by default.
  • The single most common root cause of failed AI spend attribution is context drop: tenant_id, user_id, and workflow_id are set in the app but never propagate past the gateway, queue, batch, or streaming boundary.
  • A chargeback-ready trace carries tenant identity as OpenTelemetry Baggage, splits input vs output (and cache-read vs cache-write) tokens per tenant, aggregates streaming and tool-use child calls back to the originating request, and persists the originating tenant in the OpenAI Batch API metadata field.
  • Top 5 to 10 percent of tenants typically drive 50 to 80 percent of LLM spend; without per-tenant visibility you find this out on invoice day, not in time to set a budget cap.
  • Use the free AI Cost Attribution Auditor to check whether your trace pipeline actually preserves the identity needed for LLM cost chargeback.

Why the LLM Invoice Lies About Who Spent What

Platform engineers running multi-tenant SaaS with AI features keep landing in the same conversation with finance: the OpenAI or Anthropic invoice came in 30 percent over forecast, and nobody in engineering can answer the only question that matters. Which tenant burned it?

The reason is structural, not lazy instrumentation. Provider billing is keyed to the credential the request arrived on. OpenAI groups spend by project; Anthropic groups it by workspace; Azure OpenAI groups it by deployment; AWS Bedrock groups it by inference profile. None of those concepts are your tenant. If your platform routes every tenant request through one shared gateway holding one organization-wide API key, the provider literally sees a single customer of itself. Your tenant identity exists only inside application logs that were never reconciled to the billing export.

This is why AI spend attribution is a different problem than traditional cloud chargeback. With EC2 or RDS you can paint a resource with cost-allocation tags and let Cost Explorer do the aggregation. With LLM calls you have to inject tenant identity into every request, propagate it across every hop, capture it in token-usage telemetry, and stitch it back to a billing-grade ledger. Miss one hop and the spend goes to system.

The Six Ways Tenant Context Drops in Production

The failure modes are remarkably consistent across companies. After enough trace audits a pattern emerges, and most teams turn out to be hitting three or four of these simultaneously.

  1. Gateway-as-single-key. The app issues one org-level provider key from a shared gateway. The provider sees one consumer. Tenant identity lives only in application logs that are never joined to the invoice.
  2. Header stripping. An HTTP proxy, service mesh, or queue between app and gateway drops the custom x-tenant-id header. The downstream span has no tenant attribute and falls through to a default bucket.
  3. Background jobs and async workers. A worker calls the LLM without re-injecting the originating tenant from the job payload. Spend lands under generic system or worker.
  4. Streaming SSE. The per-request OpenTelemetry span closes when the connection opens. Token-usage events arrive later and attach to a parent span that already lost context.
  5. Tool and agent recursion. Nested agent steps spawn new spans without Baggage propagation, so child LLM costs orphan from the user-facing request that triggered them.
  6. Embeddings and retrieval. Per-tenant index builds and per-query retrieval (embeddings plus rerank) are commonly billed to infra and never re-attributed to the tenant whose RAG ingestion drove them.

The practical impact is that an LLM bill which looks 30 percent over forecast is usually 5 to 10 percent of tenants consuming a 50 to 80 percent share, hidden behind these context drops. You cannot rate-limit, upsell, or even apologize to the right customer until the identity reaches the ledger.

What a Chargeback-Ready Trace Actually Carries

The minimum viable schema for AI cost allocation in SaaS is narrower than most teams think. You need five identifiers and four metrics on every LLM span.

Identifiers: tenant_id (mandatory), user_id, workflow_id or originating request id, environment (prod vs staging vs eval), and feature (which product surface produced the call). Metrics: input_tokens, output_tokens, cache_read_tokens, cache_write_tokens. With those nine fields per span you can produce per-tenant, per-feature, per-model token and dollar reports without going back to source.

According to the OpenTelemetry GenAI semantic conventions, the gen_ai.* namespace covers exactly these metrics: gen_ai.request.model, gen_ai.response.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. The spec deliberately does not prescribe a tenant attribute, leaving it to teams to define their own (tenant.id is the conventional choice) and propagate it as Baggage so it survives cross-process hops. That last detail is where most pipelines silently fail: an attribute set on a span does not propagate to a downstream service unless it is also placed in Baggage.

Gateway-Only Billing vs Attribution-Aware Billing

Most teams start with what their LLM gateway emits out of the box and only later realize it does not answer chargeback questions. Here is the contrast in one table.

DimensionGateway-only (default)Attribution-aware
Identity carriedAPI key or projecttenant_id + user_id + workflow_id via Baggage
Granularityper-project totalsper-tenant per-workflow per-model
Token splitaggregateinput, output, cache-read, cache-write per tenant
Batch and async jobslost or manual reconcilepreserved via request metadata
Streaming responsespartial (open span only)aggregated to originating request span
Agent and tool recursionorphaned child spanschild cost rolled up to parent request
Chargeback-ready exportno (manual reconciliation)yes (per-tenant ledger)
Detects runaway tenantsinvoice surprisebudget alerts within hours

LiteLLM, Helicone, Portkey, and OpenRouter all support attribution-aware mode through per-request metadata, but the work to wire it correctly (Baggage propagation, batch metadata, streaming aggregation) is on the platform team. The FinOps Foundation State-of-FinOps surveys have flagged AI and LLM cost management as the top emerging priority for practitioners in 2024 and 2025, ahead of Kubernetes cost allocation, which means the buying signal for getting this right is now formally a FinOps line item, not an engineering nice-to-have.

Wiring Attribution Correctly: A Concrete Recipe

The shortest path from a context-dropping setup to chargeback-ready spend takes four moves.

First, set tenant.id as OpenTelemetry Baggage at the inbound request boundary (web handler, GraphQL resolver, or job dispatcher). Baggage propagates across process boundaries via the W3C baggage header, which is what survives the proxy hop where a custom x-tenant-id typically gets stripped.

Second, pass tenant identity into the LLM call as provider-native metadata. On OpenAI chat/completions, responses, and batches endpoints, populate the metadata field with tenant_id, workflow_id, and feature. On Anthropic, use the metadata.user_id slot plus custom request tagging. On Bedrock, use application inference profiles tagged with the cost-allocation tag for your tenant, which flows to AWS Cost Explorer.

Third, aggregate streaming and tool-use children to the originating request id. The trace should expose one parent span per user-facing request with all child LLM token usage rolled up under it. Most agent frameworks support a custom propagator; the work is making sure your gateway adds rather than resets it.

Fourth, emit an exportable per-tenant ledger in FOCUS shape if the consumer is FinOps. Even a flat CSV with date, tenant_id, model, input_tokens, output_tokens, cost_usd is enough to start; FOCUS-shaped exports become important when finance starts cross-referencing AI spend against the rest of cloud spend in one tool.

The AI Cost Attribution Auditor walks a sample trace through exactly these checks and tells you which of the six failure modes your pipeline currently has, before the next invoice lands.

When You Need a Vendor and When You Do Not

Not every team needs a third-party AI gateway to do attribution-aware billing. A well-instrumented in-house gateway with OpenTelemetry GenAI spans, Baggage propagation, and a nightly job that joins span data to provider Usage API exports can produce a chargeback-ready ledger. The decision usually comes down to whether platform engineering has the bandwidth to maintain the schema, the propagators, and the reconciliation job as the product changes.

Vendor gateways like LiteLLM, Helicone, and Portkey shorten the wiring but do not exempt you from the four moves above. Their default config still requires you to opt into per-tenant metadata, configure budget caps, and verify that streaming and batch attribution reach your tenant ledger. The leverage is in the dashboard, the budget enforcement, and the ready-made FinOps exports, not in solving the attribution problem for you.

Summary

LLM cost attribution in multi-tenant SaaS fails by default because provider billing is keyed to API credentials, not to your tenants, and because most pipelines drop tenant identity at one or more of six predictable hops: gateway, proxy, queue, streaming, agent recursion, and batch. The fix is not exotic. Set tenant.id as Baggage at the inbound boundary, pass it into provider-native metadata on every call (including batch), aggregate streaming and tool-use children back to the originating request, and emit an exportable per-tenant token and dollar ledger. Whether the work happens in an in-house gateway or a vendor like LiteLLM, Helicone, or Portkey, the schema and the propagation rules are the same. The reward is that the top 5 to 10 percent of tenants who drive 50 to 80 percent of your spend become visible before invoice day, which is the only point at which AI cost allocation in SaaS actually starts paying for itself.

FAQ

Why can't I just use the OpenAI or Anthropic dashboard for chargeback?

Both dashboards segment spend by their own billing entities: OpenAI project, Anthropic workspace, Azure deployment, Bedrock inference profile. None of those map to your application-defined tenant unless you pre-allocated one provider entity per tenant, which breaks past a few hundred tenants because of rate-limit fragmentation and key sprawl. For SaaS, dashboard-level attribution stops working at scale.

One API key per tenant or one shared key with metadata tagging?

Key-per-tenant is clean for a small number of high-value tenants because the provider dashboard does the chargeback for you. It breaks above roughly a few hundred tenants because rate limits, model access, and key rotation become operational toil. Shared key plus per-request metadata tagging is what scales, but it puts the attribution work on your trace pipeline, which is what this post is about.

How do I attribute streaming and tool-use costs?

Use W3C Baggage propagation to keep tenant.id available on every child span the agent spawns. Aggregate child LLM token usage to the originating user-facing request id so the ledger reports cost per user request, not per provider span. If you stream, make sure the span lifecycle waits for the final usage event before closing, or emit a separate usage span linked to the request.

How do I attribute background and batch jobs?

Re-inject tenant identity from the job payload when the worker starts the LLM call. For the OpenAI Batch API, set tenant_id and workflow_id in the metadata field at submission time; usage arrives hours later and the metadata is the only durable bridge back to the originating tenant. Without it, the batch lands under system spend.

Do I need OpenTelemetry to do LLM cost attribution?

OpenTelemetry helps for cross-service propagation and gives you a portable schema, but it is not strictly required. If your gateway logs structured per-request rows with tenant_id, model, input tokens, output tokens, and cost, you can produce a chargeback ledger from those logs alone. OpenTelemetry becomes more important once attribution has to survive multiple services, async workers, and an agent runtime.