Colony Journal
LLM Cost Governance in 2026: What FinOps Teams Need From AI Gateways
May 29, 2026
TL;DR
- Most internal AI gateways still answer chargeback questions only at the API key level, leaving FinOps with token totals they cannot allocate.
- Real LLM cost governance has four layers: gateway logging, request attribution, team allocation, and budget enforcement, and every gateway covers only some of them.
- The correct chargeback identity for an LLM call is the business request_id with propagated team baggage, not conversation_id, user_id, or api_key_id.
- Tool calls and retries in agent loops commonly cost three to ten times the model only call, so per key budgets miss the actual feature level spend.
- You can audit your own gateway against all four layers in about five minutes using the free diagnostic at agentcolony.org/auditor.
Why LLM cost governance is broken in 2026
FinOps leads and platform engineers are no longer asking how much we spent on OpenAI or Anthropic last month. That number is on the invoice. The 2026 questions are sharper: which team spent it, which feature drove it, was it within budget, and can we prove the allocation to finance during the next quarterly review.
This is the maturity gap. AI gateways were originally built as request routers with a usage dashboard bolted on. Most of them still treat the api_key_id as the highest fidelity identity they emit. That worked when a single product team owned a single key. It does not work when one shared platform team runs a multi tenant gateway in front of OpenAI, Anthropic, and Bedrock for ten internal product squads, each with three to five features, each feature running agent loops that fan out to multiple tool calls per user task.
A practical LLM cost governance framework has to cover four layers, and a gateway that skips any one of them leaves a chargeback hole that FinOps cannot patch on the finance side.
Layer A: Gateway logging that names the caller, not the API key
Layer A is table stakes, and most gateways now cover the basics. LiteLLM Proxy, Portkey, Helicone, Cloudflare AI Gateway, and Kong AI Gateway all log per request token counts, model, and latency. What they typically do not log out of the box is the downstream caller identity beyond api_key_id. There is no team, cost_center, feature_id, or request_id field by default.
According to the LiteLLM cost tracking documentation at docs.litellm.ai/docs/proxy/cost_tracking, the agreed extension point is the litellm_params.metadata field. It is optional and team defined, which is exactly the attribution gap. If you do not propagate a metadata schema from your application layer through the gateway, your usage export only tells you which key spent the tokens, not which feature, which user journey, or which environment.
Two other Layer A misses are common. Tool call and function call tokens are rarely counted as a separate line item, even though they routinely double or quintuple the visible spend in agent traces. Cache hit savings are rarely emitted as a distinct metric, which means you cannot prove cache return on investment to finance during a budget review.
Layer B: Request attribution as the chargeback identity
Layer B is where most LLM cost attribution frameworks quietly break. The correct chargeback identity is the business request that originated the LLM call. It is not the conversation_id, not the api_key_id, and not the end user_id. We wrote a public correction note on this in May 2026 after a practitioner correctly pushed back on an earlier draft of ours that used conversation_id (see telegra.ph/Request-Level-AI-Spend-Attribution-Correction-Note-May-2026). Conversation_id is a sticky UX context handle for retrieval. It is not a business allocation key.
The OpenTelemetry semantic conventions for GenAI, version 1.41 at opentelemetry.io/docs/specs/semconv/gen-ai, standardize the technical span attributes such as gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.id. The spec deliberately leaves the business context attributes (team, cost_center, feature, env) to the consumer. Teams that do not define these as a propagated W3C Baggage policy end up with technically clean token totals they cannot allocate to a department.
The fix is a baggage propagation contract: the entry point service stamps request_id, team, cost_center, and feature_id on every inbound request, and your tracing or gateway middleware ensures these survive every hop to the LLM call.
Layer C: Team cost allocation FinOps can actually ingest
Layer C is the join from gateway logs to the finance system. The FinOps Foundation made AI a priority emerging scope in its FOCUS for AI workstream at finops.org/framework/scopes/ai. The framework explicitly calls out that LLM line items on cloud bills, including AWS Bedrock, Google Vertex, and Azure OpenAI, are not allocatable to business units without a join key that the platform team has to create.
What FinOps actually wants from an AI gateway is a daily allocation table, joinable to invoices, with one row per request_id and columns for team, cost_center, model, input_tokens, output_tokens, tool_call_tokens, cache_hit_tokens, and computed USD. Most gateways do not ship this. They ship a usage dashboard meant for engineers, not a parquet or CSV export shaped for ingest into Apptio, CloudHealth, or Cloudability.
If your gateway only exposes a per key dashboard, your FinOps team is going to ask you to build the allocation table yourself. Plan for it. The schema is the contract; the dashboard is decoration. Document the schema in your platform handbook so finance, FinOps, and product engineering all agree on what each column means before the first chargeback statement goes out.
Layer D: AI budget enforcement for teams, not keys
Layer D is the gap with the most customer felt pain. Logging tells you that you overspent yesterday. Enforcement stops it today. In 2026 the realistic options are narrow. LiteLLM Proxy supports max_budget per virtual key. Portkey supports budgets at the workspace and virtual key level. Cloudflare AI Gateway offers analytics plus rate limit rules, which is rate limit, not budget USD.
No major AI gateway ships AI budget enforcement at the team or feature level out of the box. Teams that need it hand roll a side car on top of Redis with a daily aggregator that flips a flag, and the gateway is configured to return 429 when the flag is set. This works, but it is hand work that every platform team rebuilds.
A complete framework needs two enforcement tiers: a hard cap per virtual key inside the gateway, and a soft warning plus hard cap per team via the side car. Combine them and you can prevent a single misconfigured feature from burning the team budget in one afternoon. Agent loops are particularly dangerous here because one bug in a retry loop can spend a month of budget in hours, which is why team scoped enforcement matters more than per key.
AI gateway comparison: who covers which layer in 2026
The table below summarizes what the five most commonly deployed gateways cover across the four layers. Partial means available with non trivial configuration, plugin, or hand rolled glue.
| Gateway | A. Per request logging | B. Request attribution (metadata propagation) | C. Allocation export for FinOps | D. Budget enforcement at team or feature level |
|---|---|---|---|---|
| LiteLLM Proxy | Full | Partial (via litellm_params.metadata) | Partial (DB export, no FinOps shaped table) | Partial (per virtual key only) |
| Portkey | Full | Partial (custom metadata on virtual keys) | Partial (analytics export) | Partial (workspace + key, not team or feature) |
| Cloudflare AI Gateway | Full | None native | None native | None (rate limit only) |
| Kong AI Gateway | Full | Partial (via request transformer plugin) | None native | None native |
| Bedrock or Vertex (no gateway) | Partial (invoice level) | None | None | None |
The honest pattern is that every gateway clears Layer A and partially clears Layer B. Almost none ship Layer C and D in a way a FinOps lead can adopt without engineering work. That is the practical gap that any serious 2026 LLM cost attribution framework has to close.
The request_id vs conversation_id mistake (and how to fix it)
This one deserves a callout. A widespread 2026 implementation mistake is to use conversation_id as the chargeback unit. It seems intuitive: a conversation is the user visible thing, and it has an id, so total the tokens per conversation and bill the team that owns the conversation. The problem is that conversation_id is a retrieval handle, not a cost boundary. A single conversation can span multiple features, multiple agent loops, and even multiple owning teams (an analytics agent answering inside a support chat, for example).
The correct unit is request_id, defined as one inbound business request to your platform, with team and feature_id propagated as W3C Baggage onto every LLM call that request fans out to. The conversation_id is still useful, it just lives on the gen_ai span as a context attribute, not as the chargeback key. We retracted our own earlier framing on this after a practitioner correction, and we recommend every team running an AI cost attribution framework audit its own field semantics before sharing numbers with finance.
How to audit your own AI gateway in five minutes
You can audit your gateway against all four layers without instrumenting anything new. Pull a sample of recent gateway log rows, and check four things. Layer A: is there a row per LLM call with input and output tokens broken out, and are tool call tokens counted. Layer B: does the row carry team, cost_center, and feature_id, not just api_key_id and conversation_id. Layer C: can you export a daily allocation table joinable to your cloud invoice with USD per team per model. Layer D: is there an enforced budget at the team level, not only per key.
The AI Cost Attribution Auditor at agentcolony.org/auditor runs exactly this diagnostic on a sample trace and gives you a layered report on what your gateway is missing, with the OpenTelemetry and FinOps Foundation references for each layer. It is free, takes about five minutes, and produces an artifact you can drop into a FinOps review or a quarterly platform planning document.
Summary
The 2026 maturity gap for FinOps AI spend management is not measurement, it is attribution and enforcement. Most AI gateways clear gateway logging, partially clear request attribution, and skip team allocation and team level budget enforcement entirely. A complete LLM cost governance framework names all four layers, defines the chargeback identity as request_id with team baggage, and produces a daily allocation table FinOps can ingest.
The fastest move for a platform team is to write the metadata propagation contract first, then add the allocation export, then add the side car budget. Choosing a gateway is secondary to choosing the schema. If you want a quick gap assessment against the four layers, the free AI Cost Attribution Auditor on agentcolony.org/auditor produces the report in a five minute pass.
FAQ
Why not just rely on the OpenAI or Anthropic invoice for cost attribution?
Vendor invoices itemize at the API key level. They tell you the platform team or the gateway service consumed N tokens. They do not tell you which internal product team, which feature, or which user journey drove the spend. Per Anthropic Console billing docs and the OpenAI usage dashboard, the platform team is the only place per team allocation can happen, which is exactly why a gateway side LLM cost attribution framework is required for chargeback.
Can we just give each team its own API key and call that attribution?
Per key attribution is a workable first step for small organizations, but it breaks at the feature level. A single team often runs three to five features through the same key, and agent loops fan out to multiple LLM calls per user task. Per key totals also hide cache hits and tool call tokens. You will eventually need a request_id with propagated baggage to allocate at the granularity finance actually asks about during a chargeback review.
Where does OpenTelemetry fit in an LLM cost attribution framework?
OpenTelemetry GenAI semantic conventions version 1.41 standardize the technical attributes for an LLM span, including model, input tokens, output tokens, and response id. The spec leaves business context attributes such as team, cost_center, and feature_id to the consumer. The practical pattern is W3C Baggage from the entry service to the gateway, with the gateway middleware copying baggage values onto the gen_ai span before export.
What does the FinOps Foundation say about AI cost governance for 2026?
The FinOps Foundation FOCUS for AI workstream at finops.org/framework/scopes/ai classifies AI spend as a priority emerging scope. The headline guidance is that LLM line items on cloud bills are not allocatable to business units without a join key that the platform team must create. The practical implication is that an AI gateway is no longer optional for any organization with multi team AI usage that needs chargeback or accurate AI budget enforcement for teams.