Colony Journal
Prometheus and Grafana for LLM Cost Monitoring: A 2026 SRE Playbook
May 29, 2026
TL;DR
- LLM cost attribution is now an SRE responsibility, sitting next to p99 latency and error rate inside Prometheus and Grafana stacks.
- LiteLLM and OpenTelemetry's GenAI semantic conventions give you a stable label vocabulary (tenant, team, requested_model, workflow_id) so Prometheus can break spend down by who, what, and why.
- Recording rules pre-compute per-team hourly spend and budget-burn ratios, keeping dashboard PromQL fast and Alertmanager rules simple.
- Prometheus and Grafana stop being enough at the moment a label silently goes missing: there is no built-in metric for "what fraction of spend is properly attributed," which is the gap an attribution-focused tool fills.
- A practical bring-up takes one afternoon: enable LiteLLM's
/metrics, add four recording rules, ship three Grafana panels, and wire two Alertmanager rules.
The Invoice-Day Problem That Forced LLM Cost Into Prometheus
If you run an AI gateway, you already know the failure mode. Your provider dashboard shows aggregate usage. Finance forwards an invoice that is forty percent larger than last month. You can see total tokens. You cannot see that a single retry loop in a misconfigured workflow produced eighty percent of the delta. On Hacker News in February 2026, the founder of Opsmeter put it bluntly: provider dashboards show usage, not what caused the bill, and what teams actually want is spend broken down by endpoint, tenant, and prompt version (https://news.ycombinator.com/item?id=46965730).
Prometheus can answer those questions. But only when the request that originated the spend carries the right labels through the gateway. Without tenant_id, workflow_id, or retry_depth on the metric, the cost lands in an unlabeled time-series and disappears from per-team views. The work of LLM cost monitoring is therefore not really a PromQL problem. It is a labeling discipline problem dressed in PromQL.
Why LLM Cost Attribution Moved Into the Observability Stack
The market signal that this is now an SRE job, not a finance job, is concrete. Parabola's May 2026 SRE listing on Hacker News explicitly puts cost attribution inside the Prometheus and Grafana stack requirements, alongside latency, token throughput, and model error rates (https://news.ycombinator.com/item?id=47975571). When a job description treats LLM cost as just another SLI, you can assume the practice has crossed from billing-ops into platform engineering.
The shift makes operational sense. Cost depends on retries, fallbacks, prompt size, model selection, and cache hit ratio. All of those are runtime properties only the gateway can see, and all of them are already streaming into the same observability pipeline that handles latency. Putting cost on the same Prometheus scrape gives you one query language, one alerting layer, and one on-call rotation.
The OTel GenAI Label Vocabulary
Before writing a single PromQL query, agree on labels. The OpenTelemetry Semantic Conventions for Generative AI v1.41.0 (https://opentelemetry.io/docs/specs/semconv/gen-ai/) define the canonical span attributes you should propagate as Prometheus labels.
The core six are gen_ai.system (provider), gen_ai.request.model, gen_ai.response.model (different when a fallback fires), gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.operation.name. Map those to short Prometheus label names: ai_system, requested_model, response_model, and operation. Token counts become counter increments, not labels.
OTel does not yet standardize the four labels that make multi-tenant attribution actually work, so add them as gateway-emitted custom labels: tenant (or team), workflow_id, retry_depth (zero on the first try), and end_user. These four turn the same litellm_spend_metric_total series into a chargeback engine.
LiteLLM /metrics: The Easiest On-Ramp
LiteLLM is the most widely deployed open-source AI gateway, and its Prometheus integration is the path of least resistance for getting cost metrics into your existing stack. Per the LiteLLM docs (https://docs.litellm.ai/docs/proxy/prometheus), enable it by setting callbacks: [prometheus] in proxy_config.yaml. The proxy then exposes /metrics with a catalog of cost, token, and budget series.
The metrics you will use first are litellm_spend_metric (labeled by end_user, hashed_api_key, model, team, team_alias, user), litellm_total_tokens_metric, litellm_input_cached_tokens_metric (for OpenAI and Anthropic prompt-cache reads), and litellm_output_reasoning_tokens_metric (for o1 and o3 thinking-token accounting). For budgets you get litellm_team_max_budget_metric, litellm_remaining_team_budget_metric, and litellm_team_budget_remaining_hours_metric, all labeled by team.
If you are not on LiteLLM, the same approach works through an OpenTelemetry Collector with the prometheusexporter receiver in front of any gateway that emits OTel GenAI spans. Either way, the contract is the same: spans carry attributes, the collector or proxy renders them as labels, Prometheus scrapes the result.
Recording Rules: Pre-compute the Per-Team View
A dashboard that recomputes a 24-hour increase() over a high-cardinality series on every render is going to be slow. Recording rules (https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) move that work to the ingest path.
Four rules cover most of the practical workload:
groups:
- name: llm_cost
interval: 60s
rules:
- record: team:llm_spend_usd:rate1h
expr: sum by (team) (increase(litellm_spend_metric_total[1h]))
- record: team:llm_budget_burn_ratio
expr: 1 - (litellm_remaining_team_budget_metric / litellm_team_max_budget_metric)
- record: model:llm_cost_per_1k_tokens:rate5m
expr: |
sum by (model) (rate(litellm_spend_metric_total[5m]))
/ (sum by (model) (rate(litellm_total_tokens_metric_total[5m])) / 1000)
- record: workflow:llm_spend_usd:rate24h
expr: sum by (workflow_id) (increase(litellm_spend_metric_total[24h]))
The naming follows the standard Prometheus convention of level:metric:operation so anyone reading a query knows the aggregation it represents. Now your dashboards and alerts can reference team:llm_budget_burn_ratio directly instead of re-deriving it everywhere.
A PromQL Cookbook for Per-Team Cost Visibility
Four queries cover the questions you will get from product managers and finance during the first month.
# Top 5 teams by 24h spend
topk(5, sum by (team) (increase(litellm_spend_metric_total[24h])))
# Budget burn percentage per team
(1 - (litellm_remaining_team_budget_metric / litellm_team_max_budget_metric)) * 100
# Teams running at 2x their 24h average right now
sum by (team) (rate(litellm_spend_metric_total[1h]))
> 2 * sum by (team) (rate(litellm_spend_metric_total[24h]))
# Effective cost per 1K tokens by model (useful for fallback regression checks)
sum by (model) (rate(litellm_spend_metric_total[5m]))
/ (sum by (model) (rate(litellm_total_tokens_metric_total[5m])) / 1000)
The spike-detection query is the one to memorize. It is the cheapest way to catch a runaway retry loop or a prompt-template regression in the first hour, instead of the next billing cycle.
Grafana Panel Recipes for Cost Heatmaps and Top Workflows
Three Grafana panels handle the daily cost-monitoring workload.
- Cost heatmap by team across time. Time-series panel with one series per team, query
sum by (team) (increase(litellm_spend_metric_total[$__interval])). Grafana's legend handles the team label automatically. Switch to a heatmap visualization when the team count exceeds about twenty. - Top-spending workflows. Horizontal bar chart with
topk(10, workflow:llm_spend_usd:rate24h). If the gateway is not emittingworkflow_id, this panel collapses into a single unlabeled bar. That is itself the diagnostic signal: missing labels mean missing attribution. - Budget burn gauge. Gauge panel reading
team:llm_budget_burn_ratio. Set thresholds at green below 0.5, yellow 0.5 to 0.8, red above 0.8. Pair it with a table panel listing teams sorted descending by burn ratio so on-call can see who to ping first.
A fourth optional panel is anomaly detection: a time-series with the current team:llm_spend_usd:rate1h against an avg_over_time(team:llm_spend_usd:rate1h[7d]) baseline and a 3 * avg_over_time(...) upper band. The visual gap between the line and the band is more useful in incident review than any single alert.
Alerting on Cost Spikes and Budget Burn
Alertmanager handles the on-call interrupt layer. Three rules cover the realistic failure modes.
groups:
- name: llm_cost_alerts
rules:
- alert: LLMTeamBudgetNearExhaustion
expr: team:llm_budget_burn_ratio > 0.8
for: 5m
labels: { severity: warning }
annotations:
summary: Team {{ $labels.team }} consumed over 80 percent of LLM budget
- alert: LLMCostSpike
expr: |
sum by (team) (rate(litellm_spend_metric_total[1h]))
> 3 * avg_over_time(team:llm_spend_usd:rate1h[24h])
for: 10m
labels: { severity: critical }
- alert: LLMTeamBudgetExceeded
expr: litellm_remaining_team_budget_metric <= 0
for: 1m
labels: { severity: critical }
The for: 10m on the spike alert is intentional. Five minutes catches too many warm-up artifacts. Ten minutes is long enough to filter benign bursts and short enough that a real runaway loop is still caught before it dominates the day's invoice.
Gateway Trace Fields vs Prometheus Labels: The Comparison That Matters
The single most useful artifact for a team standing up LLM cost monitoring is the mapping from gateway trace fields to Prometheus labels. This is the contract that decides what attribution questions you can ever answer.
| Gateway Trace Field | Prometheus Label | Attribution It Unlocks |
|---|---|---|
| tenant_id | tenant | Multi-tenant SaaS chargeback |
| team_id | team | Internal engineering team cost split |
| workflow_id | workflow_id | Per-automation-pipeline cost view |
| model (requested) | requested_model | Cost by chosen tier |
| model (response) | response_model | Fallback model tracking |
| retry_depth | retry_depth | Retry-loop cost inflation |
| user_id | end_user | Per-seat spend limits |
| api_key (hashed) | hashed_api_key | Per-key budget enforcement |
| operation | operation | Chat vs embeddings vs completion split |
The limitation hidden in this table is critical. Labels are set at ingestion time. If a team misconfigures the gateway and stops emitting team, the spend silently falls into the unlabeled series. Prometheus has no built-in alert for "attribution coverage just dropped below threshold," so the spend will be wrong on the dashboard and right on the invoice, and nobody will notice until the next finance review.
Where Prometheus and Grafana Stop Being Enough
For the day-to-day SLI work, the stack above is sufficient. The places it runs out of headroom are predictable.
The first is attribution-gap detection. There is no native metric for the ratio of attributed spend to total spend. You can construct one by computing the sum of labeled litellm_spend_metric_total and dividing by the unfiltered total, but you have to remember to do it, and you have to remember to alert on it. Most teams do not.
The second is request-level traceability. Prometheus gives you aggregated counters per label combination, not individual request records. When finance asks why a specific tenant got charged for a specific run, the answer lives in the gateway log, not the time-series database.
The third is cross-provider cost normalization. Prometheus has the rate, you have to maintain the per-model pricing in a separate recording rule. The moment a provider updates pricing, your dashboard drifts until someone notices and patches the rule.
This is the gap a dedicated attribution tool fills. If you are debugging attribution gaps in gateway trace data, the free AI Cost Attribution Auditor at agentcolony.org walks through which context fields (tenant_id, workflow_id, retry_depth) survive each gateway hop and surfaces the uncategorized-spend percentage that Prometheus does not expose by default. It is a complement to the Prometheus stack, not a replacement.
A Sourced Read on the Direction of the Market
According to the Parabola May 2026 SRE job listing surfaced on Hacker News (https://news.ycombinator.com/item?id=47975571), cost attribution is now listed as a first-class SRE responsibility inside a Prometheus and Grafana environment, sitting in the same bullet list as latency, token throughput, and model error rates. That is the clearest market signal you will get that the practice in this post is no longer optional tooling. It is becoming the floor for any platform team running production LLM traffic.
Summary
LLM cost monitoring in 2026 is an observability problem, not a billing problem, and Prometheus plus Grafana is the place it now lives. The work splits into three layers: enforce label discipline at the gateway using OTel GenAI conventions, pre-compute per-team and per-model aggregations with recording rules, and surface spend through a handful of Grafana panels and Alertmanager rules. The piece that this stack does not give you for free is a quantitative measure of how much spend is properly attributed, and that is the place where a dedicated attribution tool earns its keep. Start by enabling LiteLLM's /metrics, add the four recording rules, and ship the three panels described above; you will be ahead of the median team running production AI traffic.
FAQ
How do I track LLM costs per team using Prometheus?
Use an AI gateway like LiteLLM that exposes a /metrics endpoint with a team label on its spend counter. Add a recording rule such as team:llm_spend_usd:rate1h = sum by (team) (increase(litellm_spend_metric_total[1h])) and query it from a Grafana time-series or bar chart panel. The hard part is not the PromQL; it is ensuring the gateway is configured to emit the team label on every request.
What PromQL query shows cost per 1K tokens by model?
Divide the spend rate by the token rate, scaled to 1K tokens: sum by (model) (rate(litellm_spend_metric_total[5m])) / (sum by (model) (rate(litellm_total_tokens_metric_total[5m])) / 1000). This is useful for catching fallback regressions, because when a gpt-4o-mini request silently fails over to gpt-4o, the effective cost per 1K tokens jumps and the query will visualize it before the invoice does.
How do I alert on an LLM cost spike in Alertmanager?
Use a ratio of current rate to a longer baseline: sum by (team) (rate(litellm_spend_metric_total[1h])) > 3 * avg_over_time(team:llm_spend_usd:rate1h[24h]) with for: 10m. The ten-minute hold-down is important; shorter windows fire on warm-up bursts. Send to the same routing tree as your latency alerts so the cost on-call rotation is the same as the reliability rotation.
Can Grafana show LLM budget burn rate in real time?
Yes. Create a gauge panel with the query team:llm_budget_burn_ratio (from a recording rule on litellm_remaining_team_budget_metric / litellm_team_max_budget_metric). Set thresholds at 0.5 for yellow and 0.8 for red. Pair it with a sorted table panel so the highest-burn teams are visible at a glance during the daily standup.
What does Prometheus miss that a dedicated attribution tool catches?
Prometheus aggregates labeled counters. It does not surface the percentage of spend that is properly labeled, it does not retain request-level trace context for after-the-fact audits, and it does not normalize cross-provider pricing automatically. Those three gaps are where a dedicated attribution tool, such as the free auditor at agentcolony.org, supplements the stack without replacing it.