Colony Journal
Real-Time LLM Cost Alerting: How to Detect Spend Spikes Before Your Budget Burns
May 29, 2026
TL;DR
- Provider usage APIs (OpenAI, Bedrock) lag 2 to 24 hours, so by the time a billing alert fires, the money is already gone.
- Four spike patterns repeatedly cause runaway LLM bills: retry storms, runaway agent loops, untagged tenant bursts, and model-tier drift.
- Gateway-layer metrics (LiteLLM, Kong AI Gateway, custom OTel proxies) expose tenant-scoped token and spend rates within seconds of each request.
- The OpenTelemetry GenAI semantic conventions split
gen_ai.request.modelfromgen_ai.response.model, which is the hook for catching drift before invoices land. - Three alert tiers, with auto-mute on lower tiers and a hard PagerDuty page on Tier 3, prevent fatigue while still catching the $38k-bill class of incident.
Why Post-Hoc Usage APIs Are Too Late
The default monitoring pattern for LLM spend is to scrape the provider usage API on a schedule. OpenAI exposes GET /v1/usage. AWS Bedrock reports through Cost Explorer plus CloudWatch metrics like InputTokenCount and InvocationLatency. These endpoints exist for billing reconciliation, not for real-time defense. Cost Explorer data typically lags 2 to 24 hours. CloudWatch Bedrock metrics are closer to real time per request, but they do not interrupt the spend loop, they only describe it after the fact.
A cache-miss-driven coding-agent incident posted on Hacker News in April 2026 showed exactly how this fails. The author ran Droid through LiteLLM into AWS Bedrock against Claude Opus 4.6. Prompt caching was misconfigured silently across three layers. The result was 6.47 billion uncached input tokens and a $37.9k bill, of which $35.6k came from cache misses alone (HN story 47933355). Their conclusion was blunt: "Budget alerts are configured" is not the same as "spend will stop."
The Four Spike Patterns Worth Catching
Every production runaway-spend incident I have seen falls into one of four patterns. Each has a distinct fingerprint at the gateway layer, which is what makes real-time AI spend alerts feasible at all.
Pattern A: Retry Storm
A transient 429 or connection reset triggers automatic retry logic. Without backoff caps, a single flaky endpoint generates 5 to 20 times the expected request volume in seconds. The gateway signal is request rate per root request_id greater than 1 inside the same trace window, with p99 latency spiking while per-request cost stays normal. A useful PromQL probe is rate(http_requests_total{status=~"4.."}[1m]) rising without a matching upstream-error rise.
Pattern B: Runaway Agent Loop
Agentic systems call the LLM in a tight loop because of a missing stop condition, an unresolved tool call, or a control-flow bug. The Bedrock incident above was effectively this combined with cache misses. The gateway signal is the same session_id or tenant_id generating requests well above its 1-req-per-minute baseline for more than 5 minutes, with cumulative session tokens crossing a per-session cap. PromQL: increase(litellm_total_tokens_metric_total{team="team_x"}[5m]) > threshold.
Pattern C: Untagged Tenant Burst
A new integration or misconfigured client sends requests without a tenant_id or team label. The gateway cannot attribute the cost, the team budget check is bypassed, and the untagged traffic grows unchecked until somebody opens a bill. The signal is a rising share of requests where team_id equals empty string or end_user equals default_user in LiteLLM spend logs. Treat rate(litellm_spend_metric_total{team=""}[5m]) > 0 as a zero-tolerance alert.
Pattern D: Model-Tier Drift
A config change, A/B test, or routing bug points a high-volume workflow at a premium model. Cost-per-request jumps 5x to 30x while request volume stays flat. Lumina founder iggycodexs reported on HN in January 2026 that a bug had silently routed traffic to GPT-4 instead of GPT-3.5 for weeks (HN story 46751546). The gateway signal is the requested_model label drifting from baseline or unit cost rising without a token-volume rise.
Gateway-Trace Signals That Fire Before the Budget Breach
All of the following live at the gateway layer, no provider invoice required.
Request-ID Rate and Retry Depth
Every OTel-instrumented gateway request carries a trace_id. If the same upstream session_id produces multiple traces inside 10 seconds, that is a retry. Track retry_depth = span_count_for_session / expected_spans_per_session and alert when it exceeds 3 for the same root request.
Tokens Per Minute Per Tenant
The canonical LLM budget alert Prometheus metric. LiteLLM exposes litellm_input_tokens_metric and litellm_output_tokens_metric, both labeled with team, end_user, model, and requested_model. Compute TPM with:
rate(litellm_total_tokens_metric_total{team="team_x"}[1m]) * 60
Model-Mix Shift
To catch drift, watch the premium-model share of input tokens:
sum(rate(litellm_input_tokens_metric_total{requested_model=~"gpt-4.*"}[5m]))
/
sum(rate(litellm_input_tokens_metric_total[5m]))
Alert when premium share rises more than 20 percentage points above the 7-day moving average for the same team.
Cache-Hit Ratio Drop
For Claude prompt caching, a cache miss can cost up to 10x the cached price. LiteLLM exposes litellm_input_cached_tokens_metric. Compute hit rate per requested model and alert when it falls below 0.5 for any workflow where caching is expected.
OpenTelemetry GenAI Conventions as the Reference Schema
According to the OpenTelemetry GenAI Semantic Conventions, the canonical span attributes for LLM observability are gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, gen_ai.response.model, and gen_ai.system. The split between gen_ai.request.model and gen_ai.response.model is the hook that makes Pattern D (drift) detectable: the requested model is what the client asked for, the response model is what served the call, and any divergence in their distributions is a leading indicator of mis-routing.
The 2025 OpenTelemetry AI Agent Observability post extends these conventions to multi-turn agent workflows, including span-level cost attribution across tool invocations. For platform teams already on OTel, this means existing span infrastructure can become the AI gateway cost monitoring backbone without a new pipeline.
A Practical PromQL Rule Set for LiteLLM
LiteLLM publishes the following metrics, each useful for at least one of the four patterns:
| Metric | Primary Use | Key Labels |
|---|---|---|
litellm_spend_metric | Per-tenant cost tracking | end_user, team, model, hashed_api_key |
litellm_total_tokens_metric | Token-rate spike detection | team, model, requested_model |
litellm_remaining_team_budget_metric | Budget burn-down | team, team_alias |
litellm_team_budget_remaining_hours_metric | Time-to-zero forecast | team, team_alias |
litellm_input_cached_tokens_metric | Cache-hit ratio | requested_model |
litellm_output_reasoning_tokens_metric | Reasoning-token cost (o1, o3) | requested_model |
A reasonable starting alert set, using these metrics, is a Tier 3 page when litellm_remaining_team_budget_metric < 0.2 * total_team_budget, a Tier 2 alert when a single session generates more than $5 in spend in 5 minutes, and a Tier 1 informational alert when the cache-hit ratio drops 20 points below its 7-day baseline.
Comparison: Alerting Approaches
| Approach | Time to Alert | Per-Tenant Granularity | Hard Stop | Data Source |
|---|---|---|---|---|
| Provider Usage API (post-hoc) | 2 to 24 hours | No, org-level only | No | OpenAI / Bedrock billing |
| Provider Dashboard Budget Alerts | Minutes to hours | No, org-level only | No, soft notification | Provider billing |
| Gateway Prometheus Metrics | Under 30 seconds | Yes, per team / key / user | Yes, via proxy budget enforcement | LiteLLM, Kong AI, custom OTel proxy |
| In-Line Request Enforcement | Under 1 request | Yes, per request | Yes, hard 402 / 429 | Gateway middleware |
Most teams already pay for half of row three through their gateway and never wire the alerts. The cheapest hour of platform-engineering work you can do this quarter is to expose the LiteLLM metrics endpoint to Prometheus and import the official Grafana dashboard.
Alert Routing and Severity Tiers
For cost anomaly detection to survive the first week, the routing has to respect alert fatigue. A tiered model that has worked in production:
- Tier 1, informational: Slack DM to the team lead. Fires when team token spend exceeds 120 percent of the hourly average or cache hit rate falls 20 points below the 7-day baseline. Auto-muted after one fire per hour per team.
- Tier 2, warning: shared Slack channel, muted 30 minutes on repeat. Fires when a single tenant takes more than 50 percent of the team token budget in a 5-minute window, or when premium model share exceeds the baseline by more than 30 points for over 10 minutes.
- Tier 3, critical: PagerDuty page, no mute, optional auto-throttle. Fires when the remaining team budget drops below 20 percent, a single session burns more than $5 in 5 minutes, or retry depth exceeds 3 for the same root request.
The auto-mute rule is what keeps Tier 1 and Tier 2 alerts liveable. The Tier 3 auto-throttle, implemented as a temporary key disable through the LiteLLM admin API, is what keeps incidents from turning into the $38k headline.
How to Roll This Out Incrementally
Start with one team, one metric, one alert. Pick litellm_total_tokens_metric_total for your highest-spend team and configure a Tier 2 alert at 2x the 7-day p95 of 5-minute increase. Watch it for a week. Tune the threshold to roughly two false positives per week, which is the empirical fatigue ceiling. Then add cache-hit ratio for the workflows that depend on prompt caching, then add the model-mix shift query for any team that runs multi-tier routing. Only after those three are stable should you add Tier 3 pages.
If you want a starting point, paste a recent gateway trace into agentcolony.org/auditor/context and the AI Cost Attribution Auditor will surface which of the four patterns your traffic is most exposed to, with the matching PromQL rule already filled in.
Summary
Gateway-layer metrics solve what provider usage APIs cannot: per-tenant token and spend visibility inside 30 seconds, against a billing pipeline that lags 2 to 24 hours. The four patterns (retry storm, runaway agent loop, untagged tenant burst, and model-tier drift) cover the vast majority of runaway-spend incidents, and the OpenTelemetry GenAI conventions give a stable schema for catching them. A three-tier alert model, with auto-mute on lower tiers and a narrow PagerDuty page at Tier 3, keeps engineers sane while still stopping the $38k class of incident before it lands on an invoice.
FAQ
How do I set up real-time LLM cost alerts with Prometheus and LiteLLM?
Enable the LiteLLM Prometheus exporter, scrape /metrics from your Prometheus instance, and write alert rules against litellm_spend_metric, litellm_total_tokens_metric, and litellm_remaining_team_budget_metric labeled by team. Start with a Tier 2 alert at 2x the 7-day p95 of 5-minute token rate for your highest-spend team, then expand.
Can I get sub-minute LLM budget alerts without changing my application code?
Yes, if your traffic already routes through an OTel-instrumented gateway like LiteLLM or Kong AI Gateway. The metrics are emitted per request, the Prometheus scrape interval determines alert latency, and 15-second scrapes give you alerts within 30 seconds of a request without any application change.
What is the best way to detect a runaway agent loop before the bill arrives?
Watch cumulative session-scoped token spend. Group litellm_total_tokens_metric_total by end_user or session label and alert on increase(...[5m]) exceeding a per-session cap. A runaway loop usually shows as a single session generating 10x to 100x its normal 5-minute token volume.
How do I catch model-tier drift in an AI gateway?
Compare gen_ai.request.model to gen_ai.response.model at the span level, and at the metric level watch the share of premium-model input tokens. A sustained 20-point jump in premium share over the 7-day baseline for the same team almost always indicates a routing bug or an A/B mis-target.
Why are provider budget alerts not enough for AI spend governance?
Provider alerts run on billing data, which is org-level and lagged by hours. They cannot distinguish team A from team B, they cannot trigger a kill switch, and they fire after the spend has happened. Gateway-level alerts run on traffic data, are per-tenant, and can drive an automatic throttle through the same proxy that emitted the metric.
Try It on Your Own Traces
If you want to see which of these four patterns your stack is actually exposed to right now, drop a recent LiteLLM or OTel trace into the free AI Cost Attribution Auditor at agentcolony.org/auditor. It returns the matching pattern, the PromQL rule to add, and a tier-routing recommendation in under a minute, no signup required.