Colony Journal
Per-User AI Cost Tracking in Multi-Tenant SaaS: A FinOps Guide
May 29, 2026
TL;DR
- Most multi-tenant SaaS products ship with zero per-user LLM attribution, so one power user or abusive script can consume 50-95% of your monthly inference budget without warning.
- The minimum viable attribution key is three fields:
user_id,tenant_id, andsession_id, set at the application layer - the vendor will not add them for you. - OpenTelemetry Gen AI semantic conventions standardize the span fields your gateway must emit to produce defensible per-user cost rows.
- Six failure modes break attribution even when teams believe it is working: shared inference pools, batched calls, model fallbacks, prompt-cache miscosting, dropped streaming rows, and stripped headers.
- A healthy pipeline reconciles per-user totals against the vendor invoice within 1-3%; gaps over 10% indicate a real leak.
Why User-Level LLM Spend Tracking Is a Revenue Problem, Not Just an Ops Detail
When SaaS teams first wire up an LLM API, the billing model looks simple: one key, one vendor invoice, one monthly charge. That works at low volume with a uniform user base. It breaks badly once any of the following applies: some users run agentic workflows, one tenant's team uses the product far more than others, or a single automation script fires thousands of queries overnight.
The core FinOps issue is consumption skew. Practitioners report 80/20-or-worse patterns where one or two power users consume 50-95% of the inference budget in a single billing cycle. That cost lands in shared infrastructure with no way to bill the right tenant, throttle the right account, or explain the variance in a quarterly review.
Both chargeback (billing tenants their exact cost) and showback (reporting cost without billing) require a defensible mapping from every LLM request to a specific user and tenant. Without it, your cost structure has a blind spot proportional to your heaviest user.
The Math That Makes Multi-Tenant AI Cost Attribution Urgent
At current vendor list prices, GPT-4o runs approximately $2.50 per million input tokens and $10 per million output tokens. Claude 3.5 Sonnet is $3 per million input and $15 per million output (Anthropic and OpenAI pricing pages, 2025).
A single tenant running an agentic loop of 50,000 tokens multiplied by 200 turns per day on GPT-4o costs roughly $30 per day - about 100x a typical user's daily spend. Without user_id on every trace, that $900 per month lands in shared costs and compresses the margin of every other customer you serve.
The math makes a business case for multi-tenant AI cost allocation that is hard to ignore once teams hit meaningful volume.
The Three Fields That Enable Per-User AI Cost Attribution
Per-user LLM cost attribution reduces to an instrumentation problem. Three fields, captured at the gateway or SDK call site, form the minimum viable attribution key.
The first is user_id. OpenAI's Chat Completions API exposes a top-level optional user string parameter, documented primarily for abuse monitoring but used by gateways as the canonical user tag. Anthropic's Messages API has the equivalent at metadata.user_id. Both fields are off by default. Teams that never set them cannot attribute costs later, even retroactively.
The second is session_id or conversation_id. This groups multi-turn requests so a 200-turn agent loop bills as one session rather than 200 orphan rows with no causal connection to a user action.
The third is tenant_id. No vendor injects this field. It must come from the application layer, typically as a custom header or metadata field your gateway reads and logs alongside every request.
According to the OpenTelemetry Gen AI semantic conventions (semconv 1.41, available at opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/), the standardized span fields include gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.model, plus the cross-cutting user.id and session.id attributes from the User and Session attribute groups. A FinOps pipeline that joins on those fields and applies the model's per-million-token price produces a defensible per-user cost row for every request.
Six Failure Modes That Break SaaS AI Cost Allocation in Production
Knowing where attribution breaks is as important as knowing how to build it. These failure modes appear repeatedly across GitHub issues and practitioner discussions in LiteLLM, Langfuse, and Helicone communities.
1. Shared Inference Pools Collapse User Identity
Shared inference pools are the most common root cause. A background worker dequeues prompts from many users and calls the LLM on a single service account. The user field is never set because there is no request-scoped user in context at the time of the call. Every row looks like service-bot and 100% of the bill lands in platform overhead.
2. Batched Calls Lose Per-User Granularity
Batched calls lose granularity. Embedding and classification endpoints accept arrays of inputs. The vendor returns aggregate usage counts. The application must reconstruct per-user costs by splitting proportionally to input lengths, and almost no team implements this correctly, so batched workloads systematically under-report costs.
3. Model Fallbacks Create Pricing Divergence
Model fallbacks create pricing divergence. A request that fails on GPT-4o retries on GPT-4o-mini. The gen_ai.response.model field differs from gen_ai.request.model. Pricing the row by the request model over- or understates the cost by an order of magnitude. This is precisely why the OTel spec distinguishes those two fields.
4. Prompt Cache Fields Require Explicit Handling
Prompt cache fields require explicit handling. Anthropic charges 1.25x the standard input rate for cache writes and 0.10x for cache reads. OpenAI prompt caching discounts cached tokens to 50%. Attribution logic that ignores cache_creation_input_tokens, cache_read_input_tokens, and cached_tokens systematically overbills tenants who benefit from cache reuse.
5. Streaming Disconnects Drop Cost Rows
Streaming responses have a reconciliation gap. Usage metadata arrives only in the final Server-Sent Events chunk. Long-running streamed calls that disconnect before the final chunk never emit it. The cost row is dropped entirely.
6. Internal Proxies Strip Attribution Headers
Internal proxies strip headers. A reverse proxy or auth middleware sitting in front of your gateway can silently drop custom headers, including the user identifier. Every request arrives at the gateway looking anonymous, with no way to reconstruct attribution after the fact.
A Three-Layer Architecture for Reliable LLM Cost Per User
Teams that have solved per-user attribution in production follow a consistent three-layer pattern.
At the application layer, every LLM client call passes through a shared helper that requires user_id, tenant_id, and session_id as typed, non-optional parameters. The helper sets the vendor's user parameter and emits an OTel span with user.id, session.id, and gen_ai.* fields. Making these required rather than optional is the critical enforcement mechanism. Teams that treat attribution as optional instrumentation see coverage drop to 40-60% within months as code paths multiply.
At the gateway or proxy layer, tools like LiteLLM, Portkey, Helicone, or Cloudflare AI Gateway enforce that user is present on every request, log full request and response metadata to a trace store, and apply per-tenant rate limits. Langfuse and OTel collectors feeding ClickHouse or BigQuery are the common persistence choices. The gateway is the right place to reject or flag anonymous requests before they reach the model.
At the FinOps pipeline layer, a nightly job joins gateway traces with a model-price table and emits a per-(tenant, user, model, day) cost table. This feeds both chargeback invoices and in-product usage dashboards that customers can inspect directly.
Comparison: LLM Gateways for Multi-Tenant AI Cost Attribution
| Tool | User Attribution Field | Session Support | Tenant Isolation | Price Table Built-In |
|---|---|---|---|---|
| LiteLLM | user parameter + virtual keys | Session ID via metadata | Virtual keys per tenant | Yes, per model |
| Helicone | Helicone-User-Id header | Property-based grouping | Property filters | Yes |
| Langfuse | User ID on trace | Session object | Project-scoped | No (bring your own) |
| Portkey | Virtual keys per user | Trace ID grouping | Virtual key policies | Partial |
| Cloudflare AI Gateway | Custom metadata | Request ID only | Account-scoped | No |
The right choice depends on whether you need the gateway to enforce attribution at request time (LiteLLM, Portkey, Helicone) or use it primarily as an observability layer (Langfuse). Teams running strict chargeback typically deploy LiteLLM or Portkey as the enforcement proxy and Langfuse as the trace visualization layer behind it.
Verifying Attribution Correctness Against the Vendor Invoice
A working attribution pipeline is not just one that runs without errors - it is one you can verify. The standard check is a back-pressure reconciliation: sum all attributed per-user costs for the month and compare the total to the actual vendor invoice.
A healthy pipeline reconciles within 1-3%, with the remainder explained by rounding and prompt-caching approximations. Drift over 10% indicates a real leak: an unattributed worker calling the model, a model missing from the price table, dropped streaming rows, or a model fallback priced at the wrong tier.
Running this reconciliation monthly keeps the attribution pipeline honest and gives finance teams confidence that chargeback invoices are grounded in the actual vendor charge.
Summary
Per-user AI cost tracking is architecturally solved but operationally neglected. The fields that enable it are available in every major vendor API and standardized by OpenTelemetry's Gen AI conventions. The six failure modes that break it are well documented and repeat across teams of every size. The fix is a mandatory instrumentation wrapper at the SDK call site, a gateway that enforces field presence at request time, and a monthly reconciliation that verifies per-user totals match the vendor invoice. Teams that skip this step discover the problem at month-end, when one tenant has consumed the majority of the budget and no defensible record exists of who drove what.
If you want to audit your current LLM cost attribution pipeline against these criteria, the AI Cost Attribution Auditor runs a diagnostic against your gateway configuration and traces to surface exactly which failure modes are present in your setup.
Frequently Asked Questions
How do I add per-user LLM cost tracking to existing OpenAI API calls without changing my architecture?
Set the user parameter on every Chat Completions request to a stable identifier for the end user making the request. Add a logging middleware that captures user, model, usage.input_tokens, and usage.output_tokens from each response. This gives you basic per-user attribution without any gateway change and works with any existing SDK version.
What is the difference between chargeback and showback for LLM costs in a multi-tenant SaaS product?
Chargeback means you bill the tenant or cost center the exact LLM spend they drove. Showback means you report that cost to them without billing for it. Both require the same attribution data pipeline. Showback is the typical first step because it builds stakeholder confidence in the numbers before invoices start going out.
Why does my per-user LLM cost total not match the vendor invoice at month-end?
The most common causes are model fallbacks priced at the wrong rate, prompt-cached tokens counted at full price, and dropped usage rows from mid-stream disconnects. Run a reconciliation by summing all attributed costs and comparing to the invoice total. Gaps larger than 3-5% usually trace to one of those three categories.
Can I implement per-user LLM cost attribution without running a self-hosted proxy?
Yes. You can instrument attribution at the SDK layer by wrapping every API call in a helper that logs the response metadata alongside the user identifier. Langfuse provides client-side SDKs that capture traces without sitting in the request path. The tradeoff is that SDK-side logging is easier to omit in async worker code paths, which is the same failure mode that causes shared inference pools to lose user identity.
How do I attribute costs correctly for batched embedding or classification calls that serve multiple users?
Split the aggregate usage proportionally to the character or token count of each input in the batch. For each batch, record the total input tokens from the response, compute each item's fraction of the total input length, and multiply that fraction by the batch cost. Store the item-to-user mapping in your application before you send the batch. No gateway can reconstruct this mapping automatically after the fact.