Colony Journal
Multi-Agent AI Cost Attribution Across Agent Chains
May 29, 2026
- Your AI gateway records every sub-agent call as a flat, independent span, so a single user workflow shows up as N anonymous LLM hits with no parent context.
- The fix is context propagation, not better dashboards. Inject three fields at every call site: workflow_id, parent_agent_id, and retry_depth.
- LangChain, AutoGen, and CrewAI all expose callback or middleware hooks that let you set request headers without forking the framework.
- Retry loops on bad structured output are the dominant cost multiplier. Without retry_depth in the propagated context, tail-event 10x spikes stay invisible until the bill arrives.
- Validate your instrumentation by running a real trace through the AI Cost Attribution Auditor.
Why Your AI Gateway Sees N Flat Calls Instead of One Workflow
If you operate a multi-agent system in production, your AI gateway logs likely look the way ours did last quarter: a long stream of independent LLM calls, each tagged with model, prompt tokens, completion tokens, and timestamp. What is missing is the only dimension your FinOps team actually needs, which is the user workflow that triggered the call.
That gap is structural, not cosmetic. Each sub-agent call is a separate HTTP request to the gateway. Unless you explicitly inject correlation data on the way out, the gateway has no way to know that step 4 of workflow A is related to step 5 of workflow A, let alone whether either one is a retry of an earlier failed attempt.
According to dkowalski in a March 2026 Hacker News thread on forecasting agent workflow costs, "most observability tools show you the LLM call as one flat span. you can see it cost X tokens but you cant correlate it with the API request that triggered it" (source). That is the canonical statement of the problem. The cost data is there. The grouping dimension is not.
The Three Context Fields You Need
The minimum viable instrumentation is three fields, propagated through every LLM call your agents make. The first is workflow_id. This is a single identifier assigned at the top of the user request and carried unchanged through every nested call. It is the join key your FinOps queries will group by. A ULID or UUID generated at the orchestrator entry point works well, and the value should ride on every downstream HTTP request as a header such as X-Workflow-Id.
The second is parent_agent_id. This is the identifier of the immediate caller. The top-level orchestrator gets a null parent. Each sub-agent receives the id of the agent that spawned it. Together with workflow_id this lets you reconstruct the call tree, not just the call list, and answer questions about depth and fanout.
The third is retry_depth, an integer count of how many times this logical step has been retried so far. Most cost spikes do not come from a bad first call. They come from the third or fourth retry on a structured-output validation that keeps failing. Propagating retry_depth lets you slice spend by retry tier and flag pathological loops before they hit the budget.
Comparison: Common Attribution Approaches
Teams converge on one of four patterns. Only context propagation actually answers the FinOps question.
| Approach | What it captures | What it cannot answer |
|---|---|---|
| Raw gateway logs | Per-call token count, model, latency | Which workflow or user triggered the call |
| Per-team API keys | Aggregate spend by team key | Per-feature, per-workflow, retry-tail cost |
| OTel monkey-patch on SDK | Span tree inside one process | Cross-process or cross-service agent chains |
| Context propagation via headers | Full workflow-to-call mapping including retries | Nothing structural, but requires call-site changes |
The first three are easier to install. They also leave the original question unanswered. Context propagation is the only pattern that survives multi-process and multi-language orchestrators.
Instrumenting LangChain Without Forking
LangChain exposes BaseCallbackHandler. Subclass it, override on_llm_start, and read the current workflow_id and retry_depth out of a contextvars.ContextVar that your orchestrator entry point set. Attach the values to the chain's metadata or directly to the request headers via the model client's default_headers parameter.
For LCEL chains, the cleanest approach is RunnableConfig with configurable_fields. Set workflow_id once at the outermost .invoke() call and let it flow through every step. For agent executors with tool calls, the same callback fires on each sub-agent LLM call, so retry_depth must be incremented in your validation wrapper before the retry, not after.
A common mistake is binding the context to the chain object at construction time. The value gets captured once, then every concurrent request sees the same workflow_id. Use contextvars or RunnableConfig per call, never constructor arguments, and you will avoid the cross-request bleed that quietly destroys attribution accuracy.
AutoGen and CrewAI Hooks
AutoGen's GroupChat fires a message_received hook for every agent turn. Wrap the model client with a thin adapter that pulls workflow_id and parent_agent_id from a ContextVar and injects them as default_headers on the underlying OpenAI or Anthropic SDK client. The adapter sits between the agent and the SDK, so framework upgrades do not break it.
CrewAI exposes step_callback and task_callback on Crew and Agent. Use task_callback to set parent_agent_id from the spawning agent's id, and step_callback to bump retry_depth when a tool call returns a validation error. Like LangChain, the underlying model client supports custom headers via its constructor or per-call kwargs.
In all three frameworks the rule is identical. Resolve the context at call time, not at agent-construction time, and inject it through the model client's header path so the gateway sees the values without any change to model-provider code or any patch to the framework itself.
From Flat Spend to Per-Workflow Cost
Once the three fields land in your gateway log, three FinOps questions become straightforward SQL. Per-feature cost: group by workflow_id, join to a feature_map table keyed on the top-level handler that generated the workflow_id. You now know that summary generation costs $0.42 per run and research mode costs $3.10. Forecasts stop being aggregate guesses.
Retry tail cost: filter for retry_depth greater than two and sum tokens. As novachen observed in the same HN thread, step-level cost is much more stable than request-level because it absorbs the variance in tool calls and retries. Retry-tail surfacing is what makes that step-level view usable rather than aspirational.
Chargeback by team or tenant: group by the team_id that your orchestrator attached as a fourth optional header. With workflow_id as the join key, you can produce a per-tenant invoice that survives audit, instead of guessing from API-key aggregates that smear shared infrastructure across every customer.
Validating With the Auditor
The hardest part of context propagation is proving it works end to end. A field can be set in the orchestrator, lost by a middleware, restored by a retry wrapper, and silently dropped by an SDK upgrade. The bill will not tell you. The dashboard will not tell you. You find out three weeks into the quarter when chargeback reconciliation fails.
Run a sample workflow trace through the AI Cost Attribution Auditor. The Auditor checks that every leaf LLM call carries a workflow_id, that parent_agent_id forms a valid tree with no orphans, and that retry_depth increments only on the validation wrapper. If any leaf is unlabeled or any tree is broken, you get a specific call site to fix instead of a vague suspicion.
Summary
Multi-agent cost attribution is a propagation problem, not a reporting problem. Three context fields, injected as headers at every model call site, turn an opaque stream of N flat spans into a workflow-grouped, retry-aware, audit-ready cost record. LangChain, AutoGen, and CrewAI all support this pattern without a framework fork. Validate the result against a real trace before you trust it in a chargeback report.
FAQ
Do I need a new gateway to do this?
No. Most commercial AI gateways and self-hosted proxies already log arbitrary request headers. Add the three headers at the call site and grep your existing logs. The cost record was always there; you were only missing the join key.
What if my orchestrator runs across multiple services?
Propagate workflow_id through your existing service-to-service tracing layer, for example W3C trace context or a custom header. The same identifier should reach every downstream LLM call regardless of process or language boundary.
How do I handle streaming responses?
The headers are set on the request, so streaming versus non-streaming makes no difference. Token counts arrive on the final usage record either way, and your gateway groups them by workflow_id at ingest time.
What about embedding calls and tool calls?
Embedding calls are cost line items too. Same headers, same propagation pattern. Pure-code tool calls that do not hit a paid API can be skipped, though logging them with the same workflow_id makes latency analysis easier later.
Where do I start if my codebase already has dozens of call sites?
Start at the orchestrator entry point and the framework callback. Those two changes cover roughly 80 percent of calls. Hand-audit the long tail with the Auditor and fix the unlabeled leaves one call site at a time.