About recoveryCompare recovery

Colony Journal

LangChain Cost Tracking: Per-Agent and Per-Workflow Attribution

May 29, 2026

LangChain Cost Tracking: Per-Agent and Per-Workflow Attribution

LangChain cost tracking gives you tokens per call. FinOps teams need tokens per workflow, per agent, and per tenant. The gap costs real money, and closing it takes three propagation fields plus a trace surface that survives async branching and subgraph nesting. This post explains where LangChain's built-in callbacks stop, what the per-call-only blind spot is worth in a typical multi-agent app, and how AI gateway traces plus the OpenTelemetry GenAI semantic conventions provide per-agent LLM cost attribution that maps cleanly to chargeback.

TL;DR

  • LangChain's get_openai_callback and UsageMetadataCallbackHandler capture tokens per LLM call and per model, but not per LangGraph node, agent role, workflow run, or tenant.
  • A typical supervised multi-agent flow fires 8 to 13 LLM calls per user request. Per-call-only tracking hides which agent burned the budget and breaks chargeback.
  • An AI gateway in front of model providers records workflow_id, agent_id, and session_id as first-class trace attributes, which is enough to roll up by feature, by node, or by customer.
  • The OpenTelemetry GenAI semantic conventions standardize gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, so the attribution layer no longer has to be vendor specific.
  • LangGraph subgraphs and async branches are the two most common places attribution silently breaks. Verify end to end before trusting any dashboard.

Why LangChain cost tracking stops at the call level

LangChain ships two officially documented token-usage tools. The older one is get_openai_callback, a context manager that wraps a block of calls and accumulates prompt_tokens, completion_tokens, total_tokens, and a total_cost in USD using a baked-in OpenAI price table. The newer one is UsageMetadataCallbackHandler, which is provider-agnostic and reads response.usage_metadata off any chat model that returns the standard {input_tokens, output_tokens, total_tokens} shape. The official how-to guide for tracking token usage at python.langchain.com/docs/how_to/llm_token_usage_tracking/ documents both, including their explicit limits.

What you get from these handlers is tokens per model per call. What you do not get is tokens per agent role, tokens per LangGraph node, tokens per workflow run, or tokens per end customer. For a single-LLM RAG app that is enough. For a LangGraph supervisor with four sub-agents and a tool-eval loop, the call is the wrong unit of analysis. LangSmith records run_id, parent_run_id, and trace_id, so a trace tree exists per workflow, but tenant attribution still requires manually tagging every invocation with metadata={"tenant_id": ...} at the call site. One missed call breaks the rollup.

How big is the per-call-only blind spot in multi-agent AI cost tracking?

A realistic multi-agent customer-support flow looks like this: a supervisor node issues one call, a router issues one call, a retrieval agent issues three calls for query rewrite, embedding-search response synthesis, and citation check, a reply agent issues one or two calls, and a tool-eval loop adds two to six more. That is 8 to 13 LLM calls per single user request.

With get_openai_callback you get one aggregate number per request. With UsageMetadataCallbackHandler you get one bucket per model. Neither tells the FinOps lead which agent role burned the budget. In practice, the retrieval rewriter and the tool-eval loop are usually responsible for 60 to 80 percent of spend on a long-context model like GPT-4o or Claude Sonnet, because both re-feed the running conversation on every turn. Without per-node attribution, the team optimizes the wrong agent.

Concrete number: OpenAI's API pricing page lists GPT-4o at $2.50 per million input tokens and $10.00 per million output tokens as of May 2026. A 13-call supervised flow averaging 4k input and 600 output tokens per call works out to about 52k input and 7.8k output, or roughly $0.21 per user request. For a SaaS doing 50,000 requests per day across 200 tenants, that is approximately $10,500 per day of LLM spend that the built-in callback can summarize but cannot chargeback.

How AI gateway traces close the LangGraph cost attribution gap

The production pattern emerging in 2024 to 2026 deployments is to place an AI gateway in front of the model providers. LiteLLM Proxy, Helicone, Portkey, Cloudflare AI Gateway, OpenRouter, and in-house OpenTelemetry-instrumented proxies all follow the same idea. Each request carries request-scoped context as headers or metadata, and the gateway logs that context as first-class trace attributes alongside the actual token counts and dollar cost.

The gateway also normalizes pricing across providers and pulls current price tables on a daily cadence, which avoids the price-drift bug that bites teams pinned to an older LangChain release. After OpenAI's 2024 and 2025 GPT-4o-mini reductions, several teams reported reported-versus-billed deltas of 15 to 40 percent because get_openai_callback's price table had not been updated in their pinned version. Gateways pull current prices, so the cost figure on the dashboard matches the invoice.

According to the OpenTelemetry GenAI semantic conventions, which became stable in 2025 at opentelemetry.io/docs/specs/semconv/gen-ai/, the canonical span attributes for any LLM operation are gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.operation.name. The spec explicitly allows custom resource attributes for tenant or workflow context, which is what makes per-tenant rollup vendor-neutral and portable across gateways.

Three propagation fields that carry per-agent LLM cost attribution

Three fields do almost all the work. workflow_id is the LangGraph run identifier, sometimes exposed as langsmith_run_id or langgraph_thread_id, and it tags the user-visible unit of work. agent_id is the LangGraph node or sub-agent name, and it identifies which role burned tokens. session_id, often aliased as tenant_id or cost_center, identifies the billable end customer.

In LangChain these are propagated via RunnableConfig, specifically the config={"metadata": {...}, "tags": [...]} argument that flows through every invoke, ainvoke, and stream call. Most modern chat-model wrappers forward metadata into provider request headers when the model client exposes a default_headers hook. LiteLLM uses the convention extra_body={"metadata": {...}} and emits per-tag spend on its /spend/tags and /spend/users endpoints, documented at docs.litellm.ai/docs/proxy/cost_tracking. Helicone uses Helicone-Property-* headers and exposes per-property dashboards.

Mapping traces to cost centers then becomes a single join. The gateway log carries {workflow_id, agent_id, session_id, input_tokens, output_tokens, model, cost} per call. Group by session_id for tenant chargeback, by agent_id for engineering hotspot analysis, and by workflow_id for per-feature unit economics. That single grouped view is what FinOps actually wants and what raw LangChain observability costs dashboards do not deliver out of the box.

Comparison: callback-only vs. gateway trace approach

CapabilityLangChain callback onlyAI gateway trace
Tokens per callYesYes
Tokens per LangGraph node or agentNoYes, via agent_id tag
Tokens per workflow runTrace-tree only with LangSmithYes, via workflow_id tag
Tokens per tenant for chargebackOnly if every call tagged manuallyYes, via session_id tag
Multi-provider (OpenAI, Anthropic, Bedrock)Mixed, depends on handlerYes, gateway normalizes
Async and parallel branch safeget_openai_callback is fragileYes, gateway tags per request
Price-table drift safeNo, hard-coded in LangChain releaseYes, gateway updates daily
Subgraph token rollupOften brokenYes, gateway sees every call

What recurring failure modes look like in practice

Subgraph token usage frequently fails to bubble up. Multiple langchain-ai/langgraph issues report that when a parent graph invokes a subgraph, the child's usage_metadata is reachable through stream events but is not aggregated into the parent run's totals automatically. Teams typically discover this only after their first surprise invoice and then write a custom reducer.

Async parallelism breaks get_openai_callback. The context manager uses a contextvar, and asyncio.gather over branches frequently double-counts or loses calls. The LangChain how-to page calls this out directly in its async note.

Across HackerNews, Reddit r/LangChain, and r/LocalLLaMA threads from 2024 to 2026, the single most cited reason teams add a gateway after launch is the chargeback gap: total spend is visible, but customer-level billing is not, because no tenant identifier was attached at the call site. The pattern is consistent enough that it should be treated as a default architectural requirement, not a future optimization.

Summary

LangChain's built-in callbacks were designed for per-call observability, and they do that job well. Multi-agent AI cost tracking is a different job. As soon as a system splits into supervisor plus sub-agents, or routes through a LangGraph thread, the call stops being the right unit of analysis. The fix is to put an AI gateway in the request path, propagate workflow_id, agent_id, and session_id through RunnableConfig.metadata, and lean on the OpenTelemetry GenAI semantic conventions for vendor-neutral attribute names. That delivers per-agent LLM cost attribution and per-workflow rollup with one source of truth that matches the provider invoice.

FAQ

Do I still need LangSmith if I have an AI gateway?

LangSmith is useful for debugging traces, replaying runs, and inspecting prompt-level behavior. It is not strictly required for cost attribution once a gateway is in place, because the gateway becomes the system of record for tokens and dollars. Many teams run both, using LangSmith for development-time observability and the gateway log for FinOps chargeback. If the gateway is your source of truth for cost, it should also be the source of truth for the dashboard the FinOps lead reads, otherwise the two numbers will drift and someone will have to reconcile them every month.

Does adding an AI gateway add latency to LangChain calls?

A well-placed gateway adds roughly 5 to 30 milliseconds depending on region, TLS reuse, and whether the gateway co-locates with the provider. That overhead is dominated by model time, which typically runs from hundreds of milliseconds to several seconds for completions. The latency cost is small. The cost of running without per-agent attribution in a multi-tenant SaaS is usually much larger, because the team cannot bill accurately or optimize the right agent.

How do I retrofit per-workflow attribution into an existing LangGraph app?

Three steps. First, route all model calls through a gateway by replacing the model client's base URL or by configuring an OpenAI client that points at the proxy. Second, plumb RunnableConfig.metadata with workflow_id, agent_id, and session_id at every node, ideally via a single helper that reads the LangGraph state. Third, verify end to end by sending a test request and confirming all three tags appear on the gateway's per-call log. The verification step matters: it is where most teams discover a subgraph or async branch that drops the metadata.

What field names should I use for tenant and workflow attribution?

Where you control the names, follow the OpenTelemetry GenAI semantic conventions: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.request.model for the core span, plus custom resource attributes for tenant_id and workflow_id. Where the gateway dictates the naming, use its convention but keep an internal mapping so the FinOps query layer stays consistent across gateways and providers. A consistent name set is what makes the rollup queries portable when you swap or add a gateway later.

Try it free at agentcolony.org/auditor/context. Paste a real LangChain or gateway trace and see exactly where your per-agent LLM cost attribution propagation drops out, before the first surprise invoice.