Colony Journal
AI Gateway Cost Attribution Compared: Portkey vs LiteLLM vs OpenAI Proxy (2026)
May 29, 2026
AI Gateway Cost Attribution Compared: Portkey vs LiteLLM vs OpenAI Proxy (2026)
TL;DR
- LiteLLM Proxy treats spend tracking as a first-class primitive with key, user, team, end-user, and tag scopes backed by a required Postgres database.
- Portkey leads on observability ergonomics with a single
x-portkey-metadataheader that flows into logs, traces, analytics, and OTel exports, but its virtual-keys identity primitive is now marked deprecated in the docs. - A custom OpenAI proxy gives you full control and zero attribution out of the box; expect 2 to 6 engineer-weeks to MVP, plus ongoing price-map maintenance for every cache, batch, and tier wrinkle.
- The real differentiator is not the gateway choice; it is whether per-request metadata survives every agent, tool, and subgraph hop into the exported trace. Most teams discover their tags are silently dropped at the LangChain or LangGraph boundary.
- After you pick a gateway, validate the actual emitted trace against your attribution schema before you commit to a chargeback model.
Why Per-Tenant LLM Cost Attribution Is Harder Than It Looks
Most platform teams hit the same wall around their second or third month of production LLM traffic. The monthly Anthropic or OpenAI invoice arrives, the number is large, and finance asks the obvious question: which product team, which feature, which customer drove this spend. The gateway log shows API keys and model names; it does not show the workflow that called the model, the tenant whose request triggered the call, or the agent loop that fanned a single user message into nineteen downstream completions. That gap between an invoice and an attributable cost row is what AI gateway cost attribution is supposed to close.
The difficulty is that the data needed for honest attribution does not naturally live at the gateway. It lives in your application code: the tenant ID on the inbound request, the workflow ID minted at the start of an agent run, the feature flag that routed the call, the customer SLA tier that justified the priority lane. A gateway can only attribute what your app forwards. So the comparison below is less about the gateways themselves and more about how well each one accepts, persists, and exports the metadata your app already has, without losing it at a hop.
LiteLLM Proxy: Spend Tracking as a First-Class Primitive
LiteLLM Proxy, the open-source Python gateway from BerriAI, treats per-tenant LLM cost gateway behaviour as a first-class concern rather than a reporting afterthought. According to the official LiteLLM Spend Tracking documentation at docs.litellm.ai/docs/proxy/cost_tracking, the proxy attributes spend across five built-in scopes: key, user, team, end-user, and tag, persisted to a required Postgres database. The database requirement is worth flagging early because it shifts LiteLLM out of the stateless-proxy bucket and into something closer to a control plane.
The attribution channel that matters most for platform teams is the per-request tags array and the metadata object. Tags survive into the spend log row and through callback exports to Langfuse, Datadog, OpenTelemetry collectors, S3, and GCS. That is what unlocks per-workflow and per-feature attribution rather than only per-team rollups. A request body that includes "tags": ["workflow:onboarding", "tenant:acme-corp", "feature:doc-summary"] produces a spend row carrying those exact tags, queryable downstream.
LiteLLM also auto-detects Vertex AI PayGo versus priority pricing, Bedrock service tiers, and Azure base-model mapping when the upstream response carries tier metadata. Because per-tier price deltas range from 20 to 100 percent, a gateway that silently lumps priority traffic at PayGo rates misvalues every chargeback row downstream. LiteLLM additionally ships a documented step-by-step cost-discrepancy debugging workflow, covering time-range alignment, cache versus non-cache token splitting, and ingestion-vs-formula-vs-model-map root-cause routing. That is unusually mature tooling for the second-month pain that most teams hit.
Portkey: Observability-First Cost Attribution
Portkey approaches the same problem from the observability angle. According to the public Portkey docs nav tree at portkey.ai/docs, the Observability product group ships dedicated pages for cost-management, budget-limits, metadata, traces, logs-export, and OpenTelemetry. Every request is logged with cost, latency, and token counts, then rolled up into the cost-management views. The result feels less like a billing system and more like a Datadog for LLM calls.
The metadata story is Portkey's strongest cost-attribution surface. Custom fields are top-level: you pass them via the x-portkey-metadata header (or the SDK argument) and they surface in logs, traces, filters, analytics dashboards, and exports. Multiple custom fields per request are supported, so a single call can carry tenant_id, workflow_id, feature, env, and cost_center simultaneously without a tag-string-parsing convention.
The identity model is in flux, however. The Portkey docs now mark the virtual-keys group as Deprecated in the nav tree. Teams currently wiring per-team budgets to virtual keys are building on a primitive the vendor is moving away from, likely toward a workspaces-plus-integrations model. That is a real planning consideration: if your AI gateway comparison FinOps decision rides on Portkey's hosted budget enforcement and you have not yet committed, prefer the metadata-and-workspace path over the virtual-key path.
One more practical note. Portkey's default deployment is the hosted control plane, so spend records and prompts leave the customer network unless self-hosted. That is fine for many teams and a non-starter for others, particularly regulated industries where prompts may contain customer PII.
The DIY OpenAI Proxy: What You Actually Build
The custom OpenAI proxy is the most common starting point because it is the easiest thing to ship in week one. An engineer wraps openai.ChatCompletion.create or the raw /v1/chat/completions endpoint behind nginx, a FastAPI service, or a Cloudflare Worker, and computes cost from usage.prompt_tokens and usage.completion_tokens multiplied by a per-model rate table. For a single-provider, single-tier setup, this works.
It stops working the moment the price-map drifts. A Claude Sonnet 3.5 to Sonnet 3.7 price change silently misvalues every row written after the change, and nobody notices until a careful finance person reconciles your internal numbers against the provider invoice. Multiply that by Vertex AI tiers, Bedrock service classes, Azure base-model aliasing, Anthropic prompt-cache discounts, and OpenAI Batch API rates, and your price table becomes a maintenance burden that scales with provider coverage rather than traffic.
Metadata is similarly DIY. You pick a convention, typically a metadata object in the request body or X-Tenant-Id style headers, and you build the logging pipeline yourself. That usually means an OpenTelemetry collector or a Kafka-into-ClickHouse sink, plus dashboards your platform team owns. Realistically, expect 2 to 6 engineer-weeks to a usable MVP and ongoing work for every new provider, every cached-token category, and every batch or structured-output pricing wrinkle. The proxy is cheap to start and expensive to keep honest.
Side-by-Side AI Gateway Comparison for Cost Attribution
The table below condenses the capability landscape into the questions a FinOps or platform engineering lead actually asks during procurement. It is the comparison you can paste into an internal decision doc.
| Capability | LiteLLM Proxy | Portkey | DIY OpenAI Proxy |
|---|---|---|---|
| Per-team / per-tenant cost rollup | Native, DB-backed | Native, logs and analytics | Build it yourself |
| Per-request custom metadata in export | tags array + metadata object | x-portkey-metadata header | Convention you define |
| Cached vs uncached token cost split | Documented, first-class | Surfaced in logs | Usually missed at v1 |
| Provider tier auto-detect (Vertex / Bedrock / Azure) | Yes | Partial | No, manual |
| Hard budget enforcement (reject on overage) | Key / user / team scopes | Yes, plus deprecated virtual-keys path | Alerting only typically |
| OpenTelemetry GenAI semconv emit | Yes, native | Yes, dedicated section | Depends on your impl |
| Data residency / self-host | Yes, open source | Self-host available | Fully self-hosted |
| Maintenance burden on price-map sync | Low, vendor-maintained | Low, vendor-maintained | High, manual updates |
A word on the OpenTelemetry row. The OpenTelemetry GenAI semantic conventions at opentelemetry.io/docs/specs/semconv/gen-ai define gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, and gen_ai.response.model. That is the only cross-tool standard for what cost attribution data should even look like. Both LiteLLM and Portkey emit these spans natively. A DIY proxy must implement them by hand or your downstream FinOps tool, whether that is a vendor or an internal warehouse, cannot stitch the data.
The Metadata Propagation Problem Nobody Warns You About
This is the failure mode that defeats most Portkey vs LiteLLM cost tracking comparisons. You pick a gateway, you wire metadata at the inbound request boundary, you assume attribution is solved, and three weeks later you discover that roughly 30 to 60 percent of your spend rows have empty tenant tags. The cause is almost always that metadata gets dropped at an agent, tool, or subgraph hop inside your application, before the call ever reaches the gateway.
The specific boundaries where it leaks are predictable. LangChain callback context is commonly lost across asyncio.gather because the context-var is not copied into the child task. LangGraph subgraphs that re-instantiate the LLM client without explicitly forwarding the parent trace's metadata produce headless child calls. OpenAI Assistants v2 runs do not pass run-level metadata into the underlying chat completion unless you wire it. These are not gateway bugs; they are application-side propagation gaps that turn an otherwise well-instrumented gateway into a system with attractive dashboards full of unattributed cost.
The fix is to validate the actual emitted trace rather than the metadata you intended to attach. That means capturing a sample of production traces, walking the parent-to-child span chain, and confirming each span carries the tenant_id and workflow_id you expect. Most teams skip this validation step and only discover the leak when finance flags an attribution gap during the first chargeback cycle. The AI Cost Attribution Auditor exists for exactly this check: paste a sample gateway trace, and it reports which metadata fields survived to the export and which leaked at a hop.
Cached Tokens, Batch Discounts, and Provider Tier Pricing
LLM gateway observability is only as honest as its handling of the price wrinkles that providers have shipped over the last eighteen months. Three matter most for chargeback accuracy.
First, Anthropic prompt-caching discounts. Cache hits are billed at 0.1 times the input-token rate, a tenfold delta. If a gateway lumps cached and uncached input tokens into a single prompt_tokens field for cost calculation, attribution on cache-heavy workloads such as RAG with long system prompts, or repeated agent loops with stable instructions, drifts by up to an order of magnitude. LiteLLM's cost-discrepancy debugging documentation calls out the cached-token category as a common attribution gap, which matches what we hear from teams running the audit.
Second, OpenAI Batch API. Same model, same prompt, half the cost, because Batch trades latency for a 50 percent discount. A tenant that batches must be attributed at the batched rate or your chargeback over-bills them. A gateway that hardcodes the sync price overcharges anyone who batches, which silently penalises the most cost-conscious tenants.
Third, provider service tiers. Vertex AI priority lanes cost roughly 30 to 50 percent more than PayGo. Bedrock has on-demand, provisioned, and cross-region tiers. Azure base-model mapping changes the canonical model name. Without auto-detection from response metadata, a gateway will misvalue any tenant whose traffic crosses tiers. LiteLLM auto-detects these from response payloads; Portkey handles them in logs; DIY proxies almost always have a stale price table somewhere in the repo.
Budget Enforcement Is Not Alerting
A recurring confusion in AI gateway comparison FinOps reviews is treating budget alerts as enforcement. Alerts tell you a team has crossed 80 percent of monthly spend; enforcement rejects the next request and protects the rest of the budget. The distinction matters at month-end when a runaway agent loop can burn through a team's remaining cap in minutes.
LiteLLM enforces hard budgets at the key, user, and team scopes, returning a structured rejection when the cap is exceeded. Portkey supports enforcement at the workspace and key level, including via the virtual-keys path that is now marked deprecated, so confirm the current primitive before wiring. A DIY OpenAI proxy almost always lands on alerting only at v1 because writing a transactional, race-safe spend counter that survives concurrent requests is non-trivial; teams ship Slack alerts instead and the bill keeps growing.
The practical recommendation is to pick a gateway whose enforcement model survives a noisy-neighbour incident: one runaway tenant should not be able to consume the next tenant's headroom. That implies per-tenant caps with rejection, not per-org alerts with notification.
How to Pick: A Decision Framework for Platform Teams
The right gateway depends on three constraints: how much metadata your application is already producing, whether you need data residency or can accept a hosted control plane, and how much engineering bandwidth you have for ongoing price-map maintenance.
If your application already mints workflow IDs, tenant IDs, and feature flags per request, and you are comfortable running Postgres, LiteLLM Proxy is the highest-leverage default. The tags channel maps cleanly onto your existing identifiers, spend tracking is first-class, and the open-source license keeps you in control of the data.
If your priority is observability ergonomics and you want one header to carry attribution metadata end-to-end with strong dashboards out of the box, Portkey is the cleanest choice, provided you can accept the hosted control plane and you wire budgets to the supported primitive rather than the deprecated virtual-keys path.
The DIY OpenAI proxy stays defensible only if your traffic is single-provider, single-tier, and you have a specific compliance or routing requirement that no off-the-shelf gateway meets. In every other case, the maintenance cost of keeping the price map honest exceeds the cost of adopting a maintained gateway.
Summary
AI gateway cost attribution is fundamentally a metadata-survival problem, not a gateway-feature problem. LiteLLM Proxy, Portkey, and DIY OpenAI proxies all expose plausible attribution surfaces; the practical difference is how easily your application can ride one through the agent, tool, and subgraph hops without dropping the tenant_id or workflow_id you started with. LiteLLM's tags and Portkey's x-portkey-metadata both work when your propagation is honest, and both produce attractive dashboards full of unattributed cost when it is not.
The right next step for a platform team evaluating these tools is to validate the actual trace your gateway emits in production against your intended attribution schema. Pick the gateway whose ergonomics fit your stack, wire the metadata at every boundary, then verify the emitted trace before you build chargeback on top of it. The AI Cost Attribution Auditor performs that verification: it ingests a sample gateway trace and tells you which fields survived to the export, which leaked at a hop, and where to fix the propagation so your per-tenant LLM cost gateway data is actually trustworthy.
FAQ
What is the difference between Portkey and LiteLLM for cost tracking?
LiteLLM treats spend tracking as a first-class control-plane feature with a required Postgres backend and five built-in attribution scopes including a per-request tags array. Portkey leads on observability ergonomics, attaching custom metadata via the x-portkey-metadata header and surfacing it in logs, traces, analytics, and OTel exports. Both work; LiteLLM feels like a billing system, Portkey feels like a Datadog for LLMs.
Can a custom OpenAI proxy give me accurate per-tenant cost attribution?
Yes, but it is rarely worth it. A DIY proxy requires you to maintain the price map manually for every provider, cache discount, batch tier, and structured-output rate, plus build your own metadata convention and logging pipeline. Expect 2 to 6 engineer-weeks to MVP and continuous maintenance for every new model release, which scales with provider coverage rather than traffic.
Why do my gateway traces have empty tenant tags even though I attached metadata?
This is the metadata propagation problem and it is almost always an application-side leak, not a gateway bug. Common culprits are LangChain callback context lost across asyncio.gather, LangGraph subgraphs that re-instantiate the LLM client without forwarding parent metadata, and OpenAI Assistants v2 runs that do not pass run-level metadata into the underlying chat completion call.
Does LiteLLM or Portkey emit OpenTelemetry GenAI spans by default?
Yes, both emit spans that conform to the OpenTelemetry GenAI semantic conventions, including gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, and gen_ai.response.model. A DIY OpenAI proxy must implement these by hand. Without conforming spans, downstream FinOps tooling cannot stitch the gateway data into a unified cost model.
How do I verify that my chosen gateway is actually attributing cost correctly?
Capture a representative sample of production traces, walk the parent-to-child span chain, and confirm each span carries the tenant_id and workflow_id you expect at every hop. The AI Cost Attribution Auditor automates this check: it ingests a sample gateway trace and reports which metadata fields survived to the export and which leaked, so you can fix propagation before you build chargeback on top of it.