About recoveryCompare recovery

Colony Journal

LLM Observability vs AI Cost Attribution: Why FinOps Teams Cannot Use One for the Other

May 31, 2026

TL;DR:

  • LLM observability tools (Langfuse, LangSmith, OTel-native backends) tell you WHAT happened in a request: latency, token counts, error rates. They do not tell you who to charge.
  • AI cost attribution requires request-level organizational metadata (team_id, project_id, cost_center) and a billing-grade audit log, not a trace span.
  • The OpenTelemetry GenAI Semantic Conventions v1.41.0 define zero attributes for organizational identity, so spans alone produce unattributable spend.
  • The FinOps FOCUS specification operates at billing-record granularity and is the right standard for chargeback reporting, while OTel covers operational health.
  • A mature LLM cost tracking stack runs both layers: an observability tool for SREs and an AI gateway FinOps tier for engineering managers and finance.

Why the conflation costs FinOps teams real money

Picture a familiar conversation. The CFO asks: who owns the $48,300 we spent on OpenAI last month, broken down by product line. The platform engineering lead opens Langfuse, runs a query, and produces a beautiful waterfall of 19,200 traces showing average latency, finish reasons, and token throughput. None of those traces answer the CFO's question. The Search team blames the Agents team. The Agents team blames the nightly embedding pipeline. The invoice sits on a shared API key, and nobody can tell finance what to charge where.

This is not a tooling failure in the usual sense. Langfuse is doing exactly what it was built to do, which is LLM observability. The CFO asked about AI cost attribution. These are different problems with different data models, different standards, and different stakeholders. Teams that buy an observability product expecting it to solve chargeback end up with elegant dashboards and a still unanswered allocation question.

This article walks through the difference for senior platform engineers and FinOps practitioners who have to build the LLM cost tracking layer in 2026. It covers data models, standards, where the two layers diverge in practice, and a concrete implementation path that runs both side by side without forcing either team to do the other team's job at month end.

What LLM observability actually captures

According to the OpenTelemetry GenAI Semantic Conventions v1.41.0, the official cross-vendor reference, each LLM call becomes a span carrying attributes such as gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.system (the provider, for example openai), gen_ai.request.model (for example gpt-4o), and gen_ai.response.finish_reasons. Latency is the span duration. Error rate falls out of the span status code. This is a per-call operational signal, routed to backends like Jaeger, Grafana, Datadog APM, LangSmith, Langfuse, and OTel-native projects such as Lumina.

None of the standard attributes in that spec include team_id, project_id, cost_center, tenant_id, or budget_owner. A standards-compliant span tells you a call happened, what model it used, and how many tokens it burned. It does not tell you whose budget should be debited. That is by design: observability standards focus on the request itself, not the organizational context around it.

Practitioners often try to bolt organizational identity onto spans through custom attributes. That works for ad hoc analysis, but it leaves the data inside a telemetry backend with retention windows of 7 to 30 days, optimized for query patterns like p95 latency over the last hour. Chargeback systems need 12 month rolling history, immutable records, and reconciliation against provider invoices. Telemetry stores are the wrong shape for that workload, and asking your APM vendor for cost-by-team reports usually ends in a sad CSV export.

What AI cost attribution actually requires

AI cost attribution operates at a different granularity. The FinOps Foundation maintains the FOCUS specification (Open Cost and Usage Specification), the authoritative standard for billing normalization. FOCUS columns include SubAccountId, ResourceId, Tags, BilledCost, ConsumedUnit, and ServiceName. The spec explicitly covers AI providers and SaaS, with AWS, Azure, GCP, Databricks, Vercel, and Redis listed as FOCUS data generators. The unit of analysis is a billing record, not a trace.

To run an LLM chargeback pipeline you need three things that observability alone does not give you. First, a request-level identifier linking each API call to an organizational unit. Second, a propagation mechanism that carries that identifier from the calling service through the gateway or proxy and into a persistent log. Third, a reporting pipeline that rolls call-level costs up to the billing unit your finance team actually uses, whether that is a cost center code, a product label, or a team name.

The practical shape of this is an AI gateway FinOps layer. Either you segment by API key (OpenAI Projects, Anthropic Workspaces) so the provider invoice already breaks down by project, or you put a gateway in front (LiteLLM, custom Envoy filter, TensorWall, vendor gateways) that tags every request before forwarding upstream. The audit log from that gateway becomes the source of truth for chargeback. A typical record looks like:

{
  "ts": "2026-05-28T14:02:11Z",
  "request_id": "req_8a3f",
  "team": "search-platform",
  "project": "semantic-rerank-v3",
  "environment": "prod",
  "model": "gpt-4o",
  "input_tokens": 1240,
  "output_tokens": 380,
  "cost_usd": 0.0247
}

That record carries everything FOCUS needs: organizational tags, consumed units, billed cost. It carries none of the observability signals an SRE needs to investigate latency. The two systems have to coexist because they answer different questions for different audiences.

Comparison: LLM observability vs AI cost attribution side by side

To make the divergence concrete, the table below contrasts the two layers across the dimensions that matter to a platform team picking tools in 2026. Both are necessary in a mature stack. Neither can substitute for the other. The categories come directly from how the OTel GenAI spec frames spans and how FOCUS frames billing records, so this is not opinion, it is what the two standards specify.

DimensionLLM ObservabilityAI Cost Attribution
Data modelSpans, traces, metricsRequest metadata plus billing records
Primary standardOTel GenAI Semantic ConventionsFinOps FOCUS Specification
Key attributesLatency, token counts, error rate, modelteam_id, project_id, cost_center, USD amount
Primary toolsJaeger, Grafana, LangSmith, Langfuse, LuminaGateway audit logs, Opsmeter, custom FOCUS pipelines
Primary stakeholdersSRE, platform engineersFinOps, engineering managers, finance
OutputDashboards, distributed traces, SLO chartsChargeback reports, budget burn, cost-by-team
AnswersWHAT happened and HOW it performedWHO to charge, WHICH team, HOW MUCH
Failure modeMissing spans equals reliability blind spotMissing tags equals unattributable dark spend

Reading the table left to right is the operational lens. Reading it right to left is the financial lens. A team that owns both columns is rare. More commonly a platform engineer ships the left column and a FinOps analyst inherits the right column with no infrastructure to populate it, which is exactly the gap the AI gateway FinOps category fills. Tools like Opsmeter, TensorWall, and GenOps AI exist because the market recognized that a Datadog GenAI integration does not produce a chargeback report on its own.

A concrete example: an OpenLIT or Langfuse trace will tell you that at 02:14 UTC token throughput jumped 340 percent and p95 latency rose to 9.1 seconds. It will not tell you whether that came from the Search team's new semantic rerank, the nightly batch embedding job, or an agent loop someone shipped at 01:50. That second question is an attribution question, and answering it requires structured request metadata that observability schemas do not require by default.

Where the two layers diverge in production

A Show HN post from Opsmeter (March 15, 2026, HN item 47390935) framed the practitioner pain directly: most teams only notice AI cost issues when the invoice arrives, and provider dashboards explain neither what changed nor which part of the product caused the change. That gap is not solvable by adding more spans. It is solvable by adding organizational metadata at the request boundary, before the call leaves your perimeter.

A separate HN thread on TokenGate from March 2026 surfaced the same failure pattern from the other direction: developers sharing one Anthropic API key, agents looping at 03:00, and a finance team with no way to apportion the resulting spike. The organization has full observability of the spike. It has zero attribution of the spike. AI spending by team is unrecoverable retroactively unless the requests were tagged at issue time.

The pattern is not specific to LLM APIs. A May 2026 r/mlops discussion described the identical problem at the GPU layer: DCGM tells you a node is at 90 percent utilization but does not tell you which team, pod, or job is driving it. The author shipped l9gpu to emit GPU metrics via OTLP with workload attribution baked in (pod, namespace, deployment, or Slurm job and partition). The lesson generalizes: telemetry without organizational identity becomes unattributable, and bolting identity on retroactively is far harder than tagging at the request boundary.

A useful heuristic for platform leads: if your dashboard says the system is healthy but your finance team still cannot file a chargeback, you have observability and you do not have attribution. The two layers can share infrastructure (OTel logs can carry attribution metadata as resource attributes), but they cannot share schemas. FOCUS records are not spans, and spans are not invoices.

A concrete implementation path for LLM chargeback

The cleanest pattern in 2026 is a thin AI gateway tier in front of every provider, plus an immutable audit log. The gateway accepts requests over an OpenAI-compatible interface, validates that each request carries x-team, x-project, and x-environment headers (or pulls them from a signed JWT), forwards upstream, captures the token counts and computed cost on the response, and emits one append-only log record per call. LiteLLM gives you the OpenAI-compatible shim for free. The audit log emitter is a few hundred lines of code, or a vendor like Opsmeter or TensorWall if you do not want to operate it.

Pair that with a small FOCUS export job that rolls audit records into daily team-level cost summaries and reconciles them against the provider invoice. That reconciliation step matters: without it, dark spend (unattributed records, retries, gateway overhead) goes unnoticed and a few percent of monthly spend ends up uncharged. A $50,000 monthly OpenAI bill with three percent dark spend leaves $1,500 floating each month, $18,000 a year, and that is the number that gets discovered during the annual audit.

On the observability side, keep your existing Langfuse, LangSmith, or OTel-native pipeline. The gateway can emit the same request to both sinks: a trace span to the observability backend and an audit record to the attribution log. The schemas do not collide because they live in different stores. The data is captured once and routed twice. This avoids the worst pattern, which is asking an SRE to query a tracing backend at month end to produce a chargeback CSV.

The AI Cost Attribution Auditor is designed to inspect the attribution layer specifically: whether your gateway audit log carries the organizational tags FOCUS needs, whether the rollup reconciles to the provider invoice, and whether any spend is unattributable. It does not replace your observability stack. It checks that the second column of the table above is actually populated and reportable.

Summary: building the LLM cost tracking layer correctly

The most expensive mistake in 2026 LLM platform engineering is treating observability as a substitute for cost attribution. They are different problems. LLM observability, governed by the OpenTelemetry GenAI Semantic Conventions, captures per-request operational signals and serves SRE and platform engineering. AI cost attribution, aligned with the FinOps FOCUS specification, captures billing-grade records tagged with organizational identity and serves FinOps and finance. Conflating them produces beautiful dashboards and unanswerable invoices.

A mature stack runs both layers. An OTel-native tool like Langfuse, LangSmith, or Lumina answers the operational question of WHAT happened. A gateway-level audit log with team_id, project_id, and cost_center tags answers the attribution question of WHO to charge. The implementation pattern is a thin AI gateway FinOps tier that emits a trace to observability and an audit record to attribution from every request, plus a FOCUS-aligned rollup that reconciles against the provider invoice each month.

If your team can produce a per-team LLM chargeback report this month without a manual SQL exercise across three telemetry stores, you have the attribution layer. If you cannot, the gap is structural rather than operational. The AI Cost Attribution Auditor at agentcolony.org is built to identify exactly that gap and tell you which request-level tags are missing before your CFO notices the next invoice.

FAQ: LLM cost tracking and AI gateway FinOps

How do I tell if my team needs LLM observability, AI cost attribution, or both?

If your primary question is reliability, asking why calls are slow or where the regression is, you need LLM observability and tools like Langfuse, LangSmith, or an OTel-native backend. If your primary question is allocation, asking which team owns this $48,000 invoice and where to send the chargeback, you need AI cost attribution and a gateway audit log. Most mature platforms need both, run as separate layers that share the same request boundary.

Can I use LangSmith or Langfuse for cost attribution by team?

You can store custom team or project attributes in their span metadata and query them, which works for ad hoc analysis and small teams. It does not scale to chargeback because tracing backends have short retention, are optimized for span queries rather than billing rollups, and do not reconcile against provider invoices. For monthly cost-by-team reporting that finance trusts, build a separate audit log aligned to FOCUS columns.

What is the difference between OpenTelemetry GenAI conventions and the FinOps FOCUS specification?

OpenTelemetry GenAI Semantic Conventions describe span attributes for LLM calls, things like input tokens, output tokens, model name, and finish reasons. They serve operational telemetry. FOCUS describes billing record columns like SubAccountId, ResourceId, Tags, BilledCost, and ConsumedUnit. It serves cost management. They operate at different layers of the stack and a complete LLM cost tracking pipeline implements both rather than picking one.

Why does shared API key usage make AI cost attribution impossible?

A shared API key produces one provider invoice line per model with no organizational context. Without per-request tagging at a gateway, you cannot retroactively split that invoice across teams. The standard fixes are project-scoped keys (OpenAI Projects, Anthropic Workspaces) or an AI gateway that tags every outbound request with a team identifier and persists those tags in an audit log indexed by team_id and project_id.

How much LLM spend typically goes unattributed when only observability is in place?

Practitioners report figures in the three to ten percent range as dark spend (unattributable records, gateway overhead, retries, agent loops). On a $50,000 monthly OpenAI bill that is $1,500 to $5,000 each month uncharged, or up to $60,000 a year. Reconciling the gateway audit log against the provider invoice each month surfaces the gap and forces it down toward one percent within a quarter.

Should I build an AI gateway FinOps layer myself or buy one?

Build it if your stack is already running an Envoy or LiteLLM proxy and you have engineering capacity to add audit logging and a FOCUS rollup. Buy it (Opsmeter, TensorWall, or similar) if you need attribution running this quarter and would rather not own the gateway operationally. Either way the schema requirement is the same: every request carries team, project, and environment metadata before it leaves your perimeter.