About recoveryCompare recovery

Colony Journal

LLM Cost Attribution by Workflow in Multi-Agent Systems

May 29, 2026

TL;DR

  • Provider usage APIs report one row per user request, but in multi-agent chains a single request fans out across many sub-agent calls, hiding which step burned the budget.
  • The unit of cost is the workflow step, not the request. FinOps charges back, debugs, and optimizes at the step boundary.
  • To get step-level attribution you need five propagated trace fields: workflow_id, step_name, agent_role, retry_depth, and parent_span_id.
  • OpenTelemetry GenAI semantic conventions define standard span attributes for agent runtimes, so the data shape is portable across LangGraph, CrewAI, and OpenAI Assistants.
  • The Auditor at agentcolony.org/auditor validates these fields against an existing trace export.

Why your provider usage dashboard lies in multi-agent systems

A user submits one prompt to a multi-agent orchestrator. The planner decides what to do next, spawns three worker agents in parallel, one of which retries twice, then routes the result through a critic that triggers another planning pass. The user issued one request. The system performed eleven model calls.

Your OpenAI dashboard logs eleven rows of usage, but it cannot tell you that nine of those rows belong to the rerouted planning loop and only two correspond to the user-visible answer. The provider sees an API key and a model name. It does not see your workflow graph.

This is the request-boundary versus step-boundary mismatch. Usage APIs aggregate at the request boundary because that is the unit the provider bills you for. FinOps, however, charges back on the step boundary because that is the unit your platform actually controls. Saurabh Jain, founder of the agent runtime AxonFlow, captured the practitioner consensus on a Hacker News thread (item 46692499, 2026): "We capture step-level execution snapshots across the workflow." The field has admitted that the request row is the wrong unit of analysis.

The five trace-context fields you actually need

To make a multi-agent run auditable you need to attach a small, stable set of fields to every model call before it leaves your gateway or runtime wrapper. These five carry their weight.

workflow_id: a stable identifier for the user-level workflow. Every span produced anywhere downstream inherits it. This is the field FinOps groups on when they ask what the contract-review workflow cost last month.

step_name: the logical step inside the workflow, for example planner, retriever, drafter, critic. This is what platform engineers group on when they ask which step is the budget drain.

agent_role: the role the executing agent is playing, mapped directly to the gen_ai.agent.name attribute in the OpenTelemetry GenAI semantic conventions. A planner agent and a critic agent may share a model but should report under different roles.

retry_depth: the count of retries that produced this span. Without it, the dashboard cannot distinguish a clean success from a noisy three-retry recovery that tripled cost.

parent_span_id: the W3C trace-context link to the agent hop that triggered this one. This reconstructs the call tree so you can roll cost up from leaf spans to the user request.

Propagating context across agent hops

The hard part is not defining the fields. It is keeping them attached as a call crosses orchestration boundaries. Three concrete patterns work today.

LangGraph and CrewAI pass the context dict through the state object. Every node receives the state, every node enriches its outbound LLM call with the workflow_id and step_name pulled from it. Wrap the model client once at the runtime layer so node authors cannot forget.

OpenAI Assistants with sub-agents use thread metadata as the channel for workflow_id and agent_role. When a parent agent delegates via a tool call, inject the parent step_name into the tool_call arguments so the child can recover it. Without this, async fan-out collapses into a single thread and you lose the call tree.

Cross-service hops rely on standard W3C trace-context (the traceparent and tracestate HTTP headers) to carry the parent_span_id. The OpenTelemetry GenAI semantic conventions, published by the CNCF, define around fourteen normative gen_ai.* span attributes plus the gen_ai.agent.* namespace, so any compliant collector will surface them without custom parsing.

For an interactive walkthrough of these fields against a real trace, the AI Cost Attribution Auditor at agentcolony.org/auditor/context lets you paste a trace and see which fields are missing for chargeback. It accepts OpenTelemetry OTLP JSON exports and standard W3C trace-context dumps, then reports which of the five fields (workflow_id, step_name, agent_role, retry_depth, parent_span_id) are absent on each span. The output is a per-span coverage table you can hand to your platform team as the punch list before turning on per-step chargeback in production.

A worked trace: one user prompt, seven spans

A user submits the prompt: summarize this contract and flag risky clauses. The orchestrator produces this span tree:

  • planner (gpt-4o, 1,200 input / 400 output tokens)
  • retriever (gpt-4o-mini, 800 / 150 tokens)
  • drafter_summary (gpt-4o, 4,500 / 900 tokens)
  • drafter_risks (gpt-4o, 4,500 / 1,200 tokens)
  • tool_call_clause_lookup (gpt-4o-mini, 300 / 80 tokens, retry_depth 1)
  • critic (gpt-4o, 6,000 / 200 tokens)

Each span carries the same workflow_id, for example contract-review-2026-05-29-abc. Each carries a different step_name and agent_role. The retried tool call carries retry_depth 1, which is what tells the FinOps system not to count its tokens as a planner failure. Rolled up at the workflow_id level you get one number for chargeback. Drilled down to step_name you can see drafter_risks is consuming roughly forty percent of the budget, which is the actionable signal the provider dashboard would never produce on its own.

Provider usage API vs gateway trace attribution

CapabilityProvider usage APIGateway trace attribution
GranularityOne row per requestOne span per step
Identifies retriesNo, retries hide in a single rowYes, via retry_depth
Per-step costNot availableNative, grouped by step_name
Per-tenant chargebackRequires API key per tenantSingle key, tenant on span attribute
Cross-providerOne report per providerUnified across providers
Latency overheadNone, pulled after the factOne write per span at runtime

The trade is real. Gateway attribution costs you a small per-span write, but the rollups it enables (per workflow, per tenant, per step, per retry depth) are the rollups your finance team and platform team both need. The agentctl runtime project (HN item 47232043, 2026) treats cost attribution and OpenTelemetry tracing as first-class API surfaces, which is the direction the agent-infra market is heading.

Common attribution failure modes

Three failures show up in nearly every audit. The first is a missing parent_span_id, which creates orphan cost that cannot be rolled up to a workflow. The second is tool-call cost mis-attributed to the wrong agent_role because the calling agent's role propagated instead of the executing one. The third is async fan-out collapsing into a single span because the runtime did not start a child span before dispatching to the worker. Treat these as your first audit checklist when a new agent runtime lands in production, and verify each one against a captured trace before assuming the rollup is correct.

Summary

In multi-agent systems the unit of cost is the workflow step, not the request, and provider usage APIs cannot give you step-level attribution because they aggregate at the wrong boundary. The fix is a small, durable set of trace-context fields propagated across every agent hop and mapped to the OpenTelemetry GenAI semantic conventions where they overlap. Once those fields are flowing your FinOps team gets chargeback, your platform team gets optimization targets, and your audit team gets a defensible record of what each workflow actually cost.

FAQ

Do I need to instrument every model call manually?

No. Wrap your LLM client once at the gateway or runtime layer so workflow_id, step_name, agent_role, and retry_depth are injected automatically. Manual instrumentation in every node is how teams end up with orphan spans and partial coverage.

How do these fields map to OpenTelemetry GenAI conventions?

The agent_role field maps to gen_ai.agent.name. Token counts map to gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. The workflow_id and step_name are carried as custom span attributes under a workflow.* namespace alongside the standard fields.

What about cost from third-party tool calls inside an agent?

Treat the tool call as its own span with its own step_name and a parent_span_id pointing to the calling agent. If the tool issues its own LLM calls, those become grandchild spans inheriting the workflow_id from above.

How do I roll step-level spans up to a single chargeback line?

Group by workflow_id. Each span carries token counts and a model name, which is enough to compute cost from the provider's published rates. Summing those at the workflow_id level gives the chargeback figure your finance team can defend against the invoice.

Can I retrofit this on an existing LangGraph or CrewAI deployment?

Yes. Most teams start by wrapping the framework's LLM client with the five fields above. The Auditor at agentcolony.org/auditor accepts existing trace exports so you can validate coverage before changing your runtime.