Colony Journal
LLM Request Retry Storms: How Retries Multiply AI Costs and Break Attribution
May 29, 2026
TL;DR
- AI gateways like LiteLLM retry failed requests automatically, resending the full prompt token payload each time. A gateway set to
num_retries=2can bill you 3x the expected token cost on any request hitting a transient error. - Retry attempts appear as independent requests in spend logs with no
is_retryflag, making them invisible to FinOps dashboards and chargeback reports. - A 15% rate-limit error rate with two retries adds roughly 30% invisible overhead to your AI bill, and that overhead gets mis-attributed to users and teams as genuine consumption.
- The OpenTelemetry Semantic Conventions for Generative AI (v1.41.0) define no retry attributes, so every standard observability backend treats retried requests as genuine new usage.
- Detecting retry bursts requires trace-level analysis: grouping requests with identical context fields in short time windows and correlating against provider billing deltas.
Why Your AI Bill Is Higher Than Your Gateway Reports
Platform engineers running AI inference infrastructure tend to trust their gateway spend logs. They build dashboards summing spend by team_id or user, run chargeback reports at month end, and present those numbers to finance as accurate per-team consumption. The problem is that none of those numbers account for LLM retry cost.
Here is the scenario: a peak traffic window pushes your inference endpoint past its rate limit. Your gateway receives a 429 from the model provider. Rather than surfacing an error to the calling application, it waits, retries with the same full prompt payload, and eventually succeeds. The client sees one successful response. Your spend log sees two or three independent requests, each billed at full token cost.
At low error rates this overhead is negligible. At scale, with multi-thousand-token system prompts and a 15% error rate, it becomes a substantial and entirely silent cost multiplier.
How AI Gateway Retry Behavior Actually Works
Every major AI gateway framework ships retry logic as a default reliability primitive. According to the LiteLLM Router documentation, its reliability model provides "basic reliability logic: cooldowns, fallbacks, timeouts and retries (fixed and exponential backoff) across multiple deployments/providers." The retry sequence is deterministic: the gateway fires the original request, receives a failure (a 429, 500, or timeout), waits using exponential backoff, resends the identical payload, and repeats up to num_retries times before escalating to a fallback model group.
The Squirrel open-source LLM gateway, discussed on Hacker News in February 2026, makes the client-transparency design explicit: "Auto-Retry and Failover: if a provider throws a 500 error or times out, Squirrel seamlessly switches to a backup provider. Your client-side code does not need to handle a thing." That selling point, transparency to client code, is precisely what creates the attribution blind spot. The retry is invisible to the caller. The spend log is not.
The LLM Token Cost Retries Math
Each retry resends the full input context to the provider. For a typical chat completion with a 2,000-token system prompt and 500-token conversation history, the billing arithmetic is straightforward:
- Single attempt: 2,500 prompt tokens billed
- Two retries before success: 7,500 prompt tokens billed (3x cost)
- Fallback to a costlier model after exhausting retries: the fallback attempt may cost 4 to 10 times more per token than the original target
The practical implication for teams managing AI gateway retry storms: a gateway configured with num_retries=2 multiplies prompt token spend by up to 3x on any request that encounters a transient error. A team running at 15% 429 error rate pays approximately 30% more in LLM token cost than their per-request cost model predicts.
A practitioner building an AI news aggregator on Cloudflare Workers captured the operational anxiety this creates: "I literally couldn't sleep at night, kept worrying that some bug in my code would spiral into a self-inflicted Denial of Wallet attack by morning. That fear is what pushed me to build the circuit breaker early on." (ethan_zhao, HN Show HN #47322794, March 2026.) That is not an edge case fear. That is the production reality for any team that has not explicitly measured their LLM request retry overhead.
Why Spend Logs Cannot See the Retry Overhead
The attribution problem has a structural cause. According to the LiteLLM Spend Tracking documentation, the LiteLLM_SpendLogs table stores each API call with fields including api_key, user, team_id, request_tags, model, and spend. There is no is_retry field, no retry_attempt_n, and no original_request_id linking a retry to the originating call.
This schema gap means every retry appears as an independent full-cost request in the spend log. Chargeback reports double-count retry overhead as genuine usage. Dashboards summing spend by team_id or user overstate that team's real consumption by the full retry factor.
The successful response returns to the client as if it were the first attempt. The client never knows retries happened. The spend log records all N attempts, and the delta is invisible until someone compares provider billing against gateway attribution totals.
This attribution gap has real organizational consequences. A FinOps practitioner at an enterprise firm described the downstream effect on HN: "it was too hard to figure out how to meter usage and charge back to business lines, so they are essentially going to discontinue those services and make business lines self manage." (steveBK123, HN #42210788, Nov 2024.) AI cost attribution retries being miscounted is one of the most systematic and least instrumented sources of chargeback error.
The OpenTelemetry Spec Gap That Affects Every Observability Backend
The problem extends beyond individual gateway implementations. According to the OpenTelemetry Semantic Conventions for Generative AI (v1.41.0, 2026), standard span attributes for LLM requests include gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reason. What the spec does not define are retry attributes: gen_ai.request.retry_attempt, gen_ai.request.is_retry, or gen_ai.request.original_span_id.
Without these attributes, every observability backend (Datadog, Grafana, Honeycomb) that ingests OTel GenAI spans treats retried requests identically to originating requests. The gen_ai.client.token.usage histogram defined in the metrics spec has no retry dimension, so it conflates retry tokens and genuine new-request tokens into one undifferentiated count. Any cost dashboard built on standard OTel data systematically overcounts token usage in proportion to the retry rate.
Comparing Detection Approaches for Retry Overhead
| Approach | What It Catches | What It Misses | Complexity |
|---|---|---|---|
| Gateway spend log sum | Total tokens billed per team/user | Retry overhead baked into sum; no attribution split | Low |
| Provider billing vs gateway delta | Gross overage (retry plus fallback overhead combined) | Cannot isolate retry source vs fallback source | Medium |
| OTel span analysis (no retry attrs) | Per-request latency, model, token counts | Retry grouping; treats retries as independent requests | Medium |
| Trace burst detection (same-context clustering) | Retry bursts by time window and identical input tokens | Requires gateway-level trace export; needs threshold tuning | High |
| AI Cost Attribution Auditor | Retry bursts, fallback cost chains, chargeback delta | Requires trace data ingestion setup | Medium-high |
The burst detection approach groups requests with identical or near-identical input_tokens counts from the same team_id within short time windows (typically under 30 seconds), where all but the last request returned a 4xx or 5xx status. This surfaces the retry rate and its direct cost impact without requiring changes to the gateway schema.
Three Steps to Fix AI Cost Attribution for Retries
The most practical remediation path combines instrumentation, periodic checks, and trace-level analysis.
First, instrument your gateway to emit retry metadata. If you run LiteLLM, add custom tags to span metadata at the router level to flag retry attempts before they enter the spend log. Even a boolean is_retry: true tag attached to the request payload propagates into LiteLLM_SpendLogs.request_tags and makes the overhead visible in downstream queries.
Second, run a periodic billing delta check. Compare your provider billing statement (which includes all attempts) against your gateway spend log total for the same window. A consistent 10 to 30% overage in provider billing is the signature of unlogged retry overhead.
Third, use trace-level burst detection for retroactive analysis. The AI Cost Attribution Auditor parses request traces to identify same-context burst patterns, quantifies the retry cost per team, and surfaces the attribution delta in chargeback-ready format, without requiring gateway schema changes.
Summary
LLM request retry storms are a silent, systematic cost multiplier hiding inside the reliability defaults of every major AI gateway. Because retries share prompt context with the original request but carry no retry metadata in standard spend logs or OTel spans, they inflate both the bill and the per-team attribution without any visible signal. A gateway running at a 15% error rate with two retries adds roughly 30% invisible overhead to AI token costs, and detecting that overhead requires trace-level burst detection, not just summing spend logs.
FAQ
How much do LLM retry costs actually add to a typical AI bill?
The overhead depends on your error rate and num_retries setting. At a 15% 429 error rate with num_retries=2, you pay approximately 30% more in prompt tokens than your per-request cost model predicts. For a team spending $10,000 per month on LLM inference, that is roughly $3,000 in invisible overhead attributed to real users and teams as if it were genuine consumption.
Why does my AI gateway cost attribution look accurate but my provider bill is higher?
The most common cause is retry overhead from an AI gateway retry storm. Your gateway spend log sums all requests including retries, but many FinOps dashboards group by successful responses or user session, missing the full retry chain. Compare your provider billing total against your gateway attribution total for the same time window. A persistent overage of 10 to 30% is the signature of unlogged retry activity.
Can I fix the LLM request retry overhead attribution gap without changing my gateway schema?
Yes. The fastest fix is a periodic billing delta check: provider bill minus gateway attribution total for the same period. This does not tell you which team generated the retry overhead, but it quantifies the gross impact. For per-team attribution, you need trace-level burst detection or custom is_retry tagging in your gateway configuration.
Does OpenTelemetry track LLM request retry attempts automatically?
No. The OpenTelemetry Semantic Conventions for Generative AI (v1.41.0) do not define retry attributes. Standard OTel spans have no gen_ai.request.is_retry or gen_ai.request.retry_attempt field. Every observability backend ingesting OTel GenAI data therefore treats retry attempts as independent requests, which means standard cost dashboards built on OTel data overcount token usage in direct proportion to the retry rate.
What is the easiest way to detect an AI gateway retry storm in production?
Look for clusters of requests from the same team_id or user within a 30-second window that share identical input_tokens counts and where all but the last request returned a non-200 status. Most gateway observability tools log individual requests but do not group them by retry relationship. The AI Cost Attribution Auditor automates this burst detection and surfaces the LLM retry cost impact per team in a format ready for chargeback reporting.