OpenTelemetry LLM Cost Attribution: Filling the Gap

TL;DR

Standard OTel GenAI spans capture model name, token counts, and latency. The semantic conventions are still in Development status, so the contract can still change.

Missing standard fields include tenant_id, cost_center, workflow_id, retry sequence, cache hit flag, and a USD cost estimate.

Platform engineers patch the gap with custom span attributes (app.tenant_id, app.cost_center) and W3C Baggage propagation inside a span processor.

Retries, semantic cache hits, and gateway-level model substitution silently multiply spend in ways default dashboards do not surface.

The free AI Cost Attribution Auditor at https://agentcolony.org/auditor parses gateway and proxy traces to recover the missing dimensions when re-instrumenting the call site is not an option.

What OTel GenAI spans capture out of the box

The official OpenTelemetry GenAI semantic conventions now live in the open-telemetry/semantic-conventions-genai repository, recently split from the main semconv repo. They define a gen_ai.inference.client span with a tightly scoped attribute set.

Required attributes on every span are gen_ai.operation.name (for example, "chat") and gen_ai.provider.name (for example, "openai"). Conditionally required attributes include gen_ai.request.model ("gpt-4"), gen_ai.conversation.id, gen_ai.request.stream, gen_ai.request.seed, and gen_ai.request.choice.count. Recommended attributes layer on gen_ai.request.max_tokens, gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.response.model (the model actually used to serve the call), and the sampling penalties.

Token usage is not on the span itself. It ships as a separate metric, gen_ai.client.token.usage, with input_tokens and output_tokens dimensions. Every attribute in the GenAI semconv carries a Development stability badge, which means names and semantics can still change before they freeze.

That is the entire standard contract you get for free if you wire up a vendor SDK like the OpenAI Python client through OTel instrumentation today. It is enough to draw a request-volume chart. It is not enough to bill a customer.

The attribution fields the spec leaves blank

The gap shows up the moment a FinOps lead asks who owes what. The GenAI semconv has zero standard attributes for tenant_id, customer_id, cost_center, team_id, workflow_id, pipeline_id, retry sequence, fallback-chain depth, semantic-cache hit flag, or a USD cost estimate.

Two LLM calls with byte-identical span attributes can therefore belong to entirely different tenants or cost centers, and the default OTel export will give the chargeback pipeline no way to tell them apart. Token counts arrive as metrics aggregated across the dimensions you control, not as a per-span dollar value attached to a customer.

A practitioner in r/FinOps captured the pattern bluntly in May 2026: "AI-assisted-dev work is metered at the API but invisible at the feature level. Cloud FinOps closed the same gap a few years back with allocation tags." (Source: r/FinOps thread "FinOps for AI: Track What Your Code Actually Costs Per Commit", citing the FinOps Foundation AI Cost and Usage Tracker working group.)

Engineers patch the spec by hand. The patterns below describe the two that hold up in production.

How to extend OTel spans for multi-tenant cost tracking

The cleanest extension pattern uses three pieces working together: custom span attributes, W3C Baggage, and a span processor that does the gluing.

First, at the request boundary, attach the business context to OTel Baggage as soon as you parse the inbound HTTP request. A typical FastAPI middleware looks like this:

from opentelemetry import baggage, context

ctx = baggage.set_baggage("app.tenant_id", request.headers["x-tenant-id"])
ctx = baggage.set_baggage("app.cost_center", tenant_lookup.cost_center, context=ctx)
ctx = baggage.set_baggage("app.workflow_id", request.headers.get("x-workflow-id", ""), context=ctx)
token = context.attach(ctx)

Second, register a SpanProcessor that copies Baggage values onto every outgoing GenAI span at on_start. The LLM client SDK then never needs to know anything about your business taxonomy. The processor reads from baggage.get_all(context) and calls span.set_attribute("app.tenant_id", value) for each key you allowlist.

Third, declare a stable attribute namespace inside your org (the app.* prefix above) and pin it in an internal schema doc, so downstream cost pipelines, dashboards, and chargeback exports all agree on the field names. Treat your custom GenAI attributes the way you already treat cloud cost-allocation tags.

Retries, cache hits, and model routing the spans miss

Even with tenant and cost_center attached, three classes of invisible cost remain. Retries inflate token spend silently. A single user-visible "one chat call" can fan out into three or four OTel spans when the SDK retries on a 429, but unless your processor stamps a retry sequence and parent attempt id, the dashboards present each retry as an independent request from the tenant.

Semantic and prompt caches reverse the problem. A cached response should cost a fraction of a live inference, yet the default span carries no gen_ai.cache.hit flag. Provider-side prompt caching (OpenAI, Anthropic, Bedrock) is even less visible because the cache lives outside your network and only surfaces as a reduced input_tokens count for the same prompt.

Gateway-level model routing is the last and most expensive trap. The Lumina LLM observability project documented one team whose costs spiked because a routing bug was sending traffic to GPT-4 instead of GPT-3.5. OTel spans capture this if you compare gen_ai.request.model against gen_ai.response.model, but most FinOps dashboards only graph the request model and never see the substitution.

OTel vs FinOps-grade attribution: a comparison

Attribute	OTel GenAI standard	Custom extension you need	What an auditor recovers from gateway traces
Model name	`gen_ai.request.model` (Conditionally Required)	None	Parsed directly
Token counts	`gen_ai.client.token.usage` metric	Per-span copy for joins	Parsed directly
Latency	Span duration	None	Parsed directly
Tenant id	Not defined	`app.tenant_id` via Baggage	Inferred from API key or auth header
Cost center	Not defined	`app.cost_center` via Baggage	Inferred from tenant lookup
Workflow id	Not defined	`app.workflow_id` via Baggage	Inferred from request id correlation
Retry sequence	Not defined	`app.retry_seq` in SDK middleware	Detected from retry status codes
Cache hit flag	Not defined	`app.cache.hit` in proxy	Detected by input-token delta
Substituted model	`gen_ai.response.model` (Recommended)	Diff alert in processor	Detected when request and response model differ
USD cost	Not defined	Computed in span processor	Computed from token counts plus rate card

Where the AI Auditor fills the gap

When you cannot re-instrument the LLM call site (vendor SDK, third-party agent framework, locked legacy pipeline), the practical workaround is to capture the calls at the network boundary. A reverse proxy or LLM gateway already sees every request and response. Parsing those traces gives you exactly the dimensions the OTel spans omit.

That is the job of the free AI Cost Attribution Auditor. Paste a gateway export or a proxy trace bundle and the tool extracts model, tokens, latency, the request and response model diff, the inferred tenant from the auth header, retry chains detected from status-code sequences, and a USD cost estimate against the current rate card. Use it to validate that your in-process OTel attribution matches the wire-level truth, or to backfill attribution on workloads you cannot instrument at the call site.

Summary

OpenTelemetry has won the LLM observability transport layer, but the GenAI semantic conventions stop at model, tokens, and latency. FinOps and platform teams have to add tenant_id, cost_center, workflow_id, retry sequence, cache hit flag, and USD cost themselves, either by extending spans with custom attributes and Baggage at the call site, or by parsing gateway traces after the fact. Treat custom GenAI attributes with the same discipline you already apply to cloud cost-allocation tags, and validate that the two paths agree before you cut a chargeback invoice.

FAQ

Are OTel GenAI semantic conventions stable yet?

No. As of 2026 they live in open-telemetry/semantic-conventions-genai and every attribute carries a Development stability badge. Names and semantics can still change. Pin the version of the semconv your collectors expect and review release notes before upgrading any in-process instrumentation.

Where should tenant_id live: on a span attribute or in W3C Baggage?

Both. Baggage propagates the value across service boundaries without each service knowing the business taxonomy. A span processor then copies the allowlisted Baggage keys onto every GenAI span on start, so dashboards can group by app.tenant_id without joins. The span attribute is what your backend queries; Baggage is what carries it there.

Will adding USD cost to a span work with existing OTel collectors?

Yes, if you treat it as a numeric attribute (app.cost_usd) rather than a metric. Collectors and backends ingest numeric span attributes the same way as strings. For aggregation, also emit a parallel metric so dashboards can roll up without re-scanning spans. Keep the rate-card lookup inside the span processor so cost stays consistent across services.

What about provider-side prompt caching?

Provider caches (OpenAI, Anthropic, Bedrock) report a reduced input_tokens count for the same prompt rather than an explicit cache hit flag. Track a running expectation of token counts per prompt hash in your span processor and stamp app.cache.hit_inferred=true when the actual count drops well below it. That gives you a workable signal until the GenAI semconv adds a first-class cache field.

Can I use the AI Auditor without changing my code?

Yes. Export a gateway or proxy trace bundle and upload it at https://agentcolony.org/auditor. The tool runs entirely on the uploaded trace and never reads your runtime, so it is a safe way to spot-check attribution on locked or third-party pipelines you cannot re-instrument.