Colony Journal
AI Cost Anomaly Detection: Tracing LLM Spend Spikes with Gateway Trace Data
May 29, 2026
TL;DR
- Provider billing dashboards (OpenAI usage, Anthropic console, AWS Bedrock Cost Explorer) lag by 1 to 24 hours and aggregate at hour or day granularity, so by the time a spike surfaces the runaway is already paid for.
- The early signal lives in your gateway traces, not the invoice:
gen_ai.usage.*token counts,request.modelvsresponse.modeldivergence,retry_depth, and per-tenant token share. - Five durable detection rules cover the three real failure modes: retry storms, silent model fallback to expensive tiers, and single-tenant runaways.
- According to the OpenTelemetry GenAI semantic conventions (semconv 1.41.0), the cost-relevant attributes are already standardised; the missing piece is joining them to a price catalog and a chargeback model.
- The AI Cost Attribution Auditor joins trace attributes against price data so you can run Breakdown, Compare, and Context views by workflow_id and tenant_id, not just by model and account.
Why the bill is the last place an LLM spend spike shows up
If your AI spend has ever 10x'd overnight, you know the pattern. Engineering finds out from finance, finance finds out from the invoice, and the invoice arrives a day late. The signal was sitting in the gateway the entire time, the dashboard just could not see it.
Three distinct incident classes drive almost every spike worth investigating:
- Retry storms during provider degradation, where an SDK quietly multiplies real token consumption by its
maxRetriesvalue. - Silent model fallback inside a router or gateway, where a 429 on a cheap tier flips traffic to a much more expensive tier without any traffic change at all.
- Single-tenant runaway, where one misconfigured customer or a recursive agent loop generates orders of magnitude more tokens from a single
tenant_idwhile the org-wide chart barely moves.
The common factor is that the cause is per-request and per-tenant, but the report is per-account and per-day. That mismatch is why anomaly detection has to move down the stack to the gateway.
The three failure modes, in concrete numbers
Retry storms multiply spend without multiplying traffic
Most provider SDKs (Vercel AI SDK, LangChain, OpenAI's official client) retry on 429 and 5xx with exponential backoff. The default maxRetries is typically 3 to 6. During a 30 minute provider degradation, a workflow that normally costs 1x silently costs 4x to 7x. The HTTP success rate may even look fine because the retries eventually succeed. The only artefact left behind is a fatter retry_depth distribution in your spans, and a bigger bill at midnight.
Model fallback chains can swap an 89x cost differential silently
From a November 2025 r/LocalLLaMA thread on SLM vs frontier economics, DeepSeek V4-Flash output tokens list at $0.28 per million while Claude Opus 4.6 lists at $25 per million. That is roughly 89x per output token. A router config that falls back from the cheap tier to the expensive tier on 429 is, in cost terms, equivalent to traffic going 89x without traffic going up at all. Native billing tools group spend by model after aggregation, so they show the symptom but cannot tell you which workflow_id or tenant_id drove the shift.
Single-tenant runaways hide inside the org-wide average
A prompt-injection or a recursive agent loop on one customer can push that tenant's token volume up 1000x while the global chart stays close to flat. Without per-tenant attribution at the span level, the only detector is the invoice.
What OpenTelemetry GenAI semconv gives you for free
According to the OpenTelemetry Semantic Conventions for Generative-AI client spans (semconv 1.41.0), a properly instrumented LLM call emits a standard set of cost-relevant attributes on its span:
gen_ai.usage.input_tokens,gen_ai.usage.output_tokensfor billed token counts.gen_ai.request.modelandgen_ai.response.model, where divergence catches silent fallback.gen_ai.request.max_tokens,gen_ai.response.finish_reasons, andgen_ai.operation.name.
The spec deliberately stops short of cost attributes, because pricing is provider and contract specific. That gap is the entire job of an AI cost attribution layer: take the standard OTel GenAI span, attach your application's tenant_id, workflow_id, retry_depth, and parent_request_id, join against a price catalog, and you have $/request, $/workflow, and $/tenant in real time, well before the provider invoice cuts.
Five gateway trace signals that catch real LLM cost anomalies
These are the durable detectors, ranked by how often they fire in practice:
- Per-workflow token-volume z-score. Compute daily mean and stddev per
workflow_idover a 14 day window; alert at z greater than 3. Catches recursive-agent loops and prompt-injection runaways. - Retry-depth distribution shift. Bucket spans by
retry_depth(0, 1, 2, 3+). A sudden jump in the higher buckets, correlated withgen_ai.response.finish_reasonsand HTTP status, is a retry storm in flight. Fire before it ends so the gateway can shed retries. - Request-model versus response-model divergence rate. Normally near 0 percent. Any spike means silent fallback. Multiply the divergent request count by the price delta between the two models for the dollar impact.
- Per-tenant token share. For each 5 minute window, alert when a tenant's share crosses 3x its trailing 7 day p95. This is the only detector that catches single-tenant runaways inside the hour.
- $/request p99 drift. Track p99 cost per logical request grouped by route. A drift up almost always traces back to a code change that landed without a cost review (bigger context windows, longer prompts, more tool-calls).
Gateway anomaly detection vs cloud-native billing tools
| Capability | Provider billing dashboards (OpenAI / Anthropic / Bedrock Cost Explorer) | Cloud cost tools (AWS Cost Explorer, Azure Cost Management, Datadog Cloud Cost) | Gateway trace anomaly detection |
|---|---|---|---|
| Latency to first alert | 1 to 24 hours | Daily, follows the CUR / invoice | Seconds to minutes from the span |
| Granularity | Account, model, day | Account, SKU, service | Per workflow_id, tenant_id, retry_depth |
| Catches retry storms | No, retries roll into the daily total | No, no per-request view | Yes, via retry_depth bucket shift |
| Catches silent model fallback | After the fact, by model line item | No | Yes, via request.model vs response.model divergence |
| Catches single-tenant runaways | Only at invoice cut | No, no tenant dimension | Yes, via per-tenant token share alert |
| Attribution to a chargeback model | No | Tag based, account level | Span attribute level, per request |
The right reading of this table is not that one approach replaces the other; the invoice is still the source of truth for what you owe. The gateway trace layer is the only layer that can tell you which workflow and which tenant moved the number before the invoice arrives.
Where the AI Cost Attribution Auditor fits in the loop
The Auditor is built on exactly the data model above: OTel GenAI spans, plus the chargeback attributes the application owns, joined against a price catalog. Three views map to the three failure modes:
- Breakdown collapses any time window into cost by
(tenant_id, workflow_id, model), the dimensions native dashboards do not carry. - Compare diffs two time windows on the same dimensions, so the cohort that caused yesterday's spike is one click away.
- Context pivots from an anomalous workflow into the actual span trace, so engineers see prompt size, retry chain, and fallback path and can fix the cause rather than the symptom.
This is the difference between knowing the bill went up and knowing that tenant 42's recovery worker retried Claude Opus 6 times against a degraded endpoint last night between 02:14 and 02:47 UTC.
A practitioner-grade detection workflow
A reasonable starter pipeline looks like this:
- Wrap every gateway call so it emits an OTel GenAI client span with the semconv attributes, plus
tenant_id,workflow_id,retry_depth, andparent_request_id. - Stream spans to a trace store that supports aggregation (ClickHouse, BigQuery, Tempo with a sidecar, or your existing OTLP backend).
- Run the five detectors above on a 1 to 5 minute window. Three of them (z-score, retry-depth shift, tenant share) are cheap SQL.
- Route alerts not to a generic on-call channel but to the owning team, keyed on the
workflow_idtag. A spike onembed-search-rerankis a search-team problem, not a platform-team problem. - When an alert fires, open the Auditor Compare view on the anomalous window vs the matched baseline and ship the diff in the incident doc.
The first three steps are achievable in a sprint. The fifth step is the one that turns anomaly detection from a noisy stream into a closed loop, because every spike that gets diffed and explained ratchets your detection rules forward.
Summary
AI cost anomaly detection cannot live in the provider's billing dashboard. The data is too coarse, too late, and missing the dimensions (workflow, tenant, retry depth) that actually explain a spike. The signal you need is already in your OpenTelemetry GenAI spans; the work is in joining those spans to a price catalog, propagating chargeback attributes, and running a handful of durable detectors against the result. That stack catches retry storms, silent model fallback, and single-tenant runaways while they are still cheap to fix, and it turns the eventual invoice into a confirmation rather than a surprise.
FAQ
How do I detect an LLM cost spike before the provider invoice arrives?
Instrument every gateway call as an OpenTelemetry GenAI span carrying gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, gen_ai.response.model, plus your tenant_id and workflow_id. Join those spans to a price catalog in your trace store and run a per-workflow z-score (>3) plus a per-tenant token-share threshold (>3x trailing 7d p95) on a 1 to 5 minute window. That catches almost every spike inside the hour, not at midnight.
How do I attribute LLM cost per tenant in a multi-agent system?
The only stable chargeback key is what you propagate into the span at request time. Add tenant_id and workflow_id as span attributes at the gateway and through any subagent calls (use OTel baggage or your framework's context propagation). Do not rely on conversation_id for chargeback because conversations span tenants in shared agents. The Auditor's Attribute view groups by those propagated keys, which is why getting them on the span is the prerequisite.
What is a retry storm and how do I see it in gateway traces?
A retry storm happens when an upstream provider returns 429 or 5xx and your SDK retries up to its maxRetries ceiling, often 3 to 6. In span data it looks like a sharp jump in spans with retry_depth > 0 and a corresponding rise in gen_ai.response.finish_reasons values that indicate transient errors. Alert on the distribution shift, not the raw count, so you fire during the storm and not after it.
Why is my AI bill spiking when my traffic is flat?
Almost always one of two causes: silent model fallback (your router promoted requests from a cheap tier to an expensive one because the cheap tier was rate limited) or a single-tenant runaway (one customer or recursive agent loop is consuming most of the tokens). The first shows up as request.model not equal to response.model on a rising share of spans. The second shows up as one tenant_id blowing past its trailing p95 token share. Both are invisible to dashboards that aggregate by model and account.
Can Datadog or CloudWatch alert on per-workflow LLM cost?
Not natively as of late 2025. AWS Cost Explorer and Azure Cost Management work off the cloud usage report and stop at the account or SKU level. Datadog APM and LLM Observability ingest span data but their cost views are model and account level. To get per-workflow or per-tenant chargeback you need the span attributes joined to a price catalog, which is what the AI Cost Attribution Auditor is built to do.