GCP Vertex AI Cost Attribution: Per-Team Tracking via Gateway Traces

TL;DR

GCP Vertex AI cost attribution at the per-team level is structurally impossible with Cloud Billing alone, because the BigQuery billing export aggregates at the SKU × project × day level and carries no per-request fields.
Cloud Audit Logs record who called aiplatform.googleapis.com.GenerateContent, but they intentionally redact request bodies, so they do not contain token counts or computed cost, which kills any audit-log-only chargeback plan.
The working pattern for Vertex AI per-tenant cost tracking is an AI gateway (LiteLLM proxy, Apigee, or a custom Cloud Run shim) that injects a tenant_id, reads usageMetadata from the Vertex response, and computes cost_usd at call time.
A four-stage pipeline (capture, enrich, store, reconcile) joins gateway trace data in BigQuery against the billing export to verify that gateway-computed spend matches the actual GCP invoice within roughly ±2%.
agentcolony.org/auditor inspects this stack end to end and flags the most common attribution leaks for Vertex AI workloads: missing tenant headers, ungated direct-to-Vertex calls, and unreconciled gateway logs.

Why GCP Vertex AI cost attribution is harder than AWS or Azure

If you have built FinOps reporting for Bedrock or Azure OpenAI, you already know the shape of the problem: a cloud provider charges by tokens, but your internal teams want per-team chargeback. On GCP Vertex AI the problem is sharper, because Google Cloud AI billing by team relies on a per-request signal that does not exist in any of the native data sources you have access to.

Vertex AI exposes models like Gemini 1.5 Flash, Gemini 1.5 Pro, and the older PaLM family through aiplatform.googleapis.com. The billing pipeline counts input and output tokens and posts a daily aggregate to Cloud Billing. There is no tenant header on the wire that GCP retains, no request-body label that flows into billing, and no per-call cost record exposed anywhere in the platform.

That means LLM cost attribution on GCP cannot use a simpler approach borrowed from CPU or storage workloads. You need a request-level capture layer, and the only place that layer can live is in front of the Vertex AI endpoint, where you still own the call.

The Cloud Billing gap: SKU-level aggregation only

According to the Cloud Billing BigQuery export documentation, the standard billing export schema includes service.description, sku.description, project.id, a labels array, cost, usage.amount, and usage.unit. There is no request_id, no tenant_id, no user, no team. The pipeline rolls up to the SKU × project × resource level and posts roughly one row per SKU per day.

GCP Resource Labels are the official workaround. You can attach up to 64 key-value pairs to a project or a resource, and those labels propagate into the billing export. If every team has its own GCP project, you can split the Vertex AI line item by labels.team. That is fine for small orgs.

Why shared projects collapse label-based attribution

Label-based attribution falls apart the moment teams share a Vertex AI project, which is the default pattern in multi-tenant SaaS:

The generateContent API does not accept request-body metadata that flows into Cloud Billing.
There is no per-call label or tenant header that GCP Billing captures.
One shared project produces one aggregated cost line per SKU per day.

Concretely, Gemini 1.5 Flash is priced at $0.075 per million input tokens and $0.30 per million output tokens (Google Cloud pricing, 2024). A shared project running 50,000 requests per day across ten teams produces one row in the export. Without external instrumentation, no FinOps analyst can split that row.

The audit log dead end

A second instinct is to lean on Cloud Audit Logs. The Vertex AI audit logging documentation confirms that Cloud Audit Logs record the principal (service account or user) that called GenerateContent, the method, the project, and the timestamp. That sounds promising for Vertex AI per-tenant cost tracking, until you actually inspect a log entry.

Audit logs were designed for compliance and security, not cost accounting. The request body is redacted in Data Access logs to prevent leaking sensitive content. That redaction removes the exact fields you need: prompt size, response size, and the usageMetadata token counts. The log also has no pricing context. It tells you a call happened and who made it; it does not tell you what that call cost.

Audit logs also incur their own Data Access logging charges if you enable verbose mode, which means the cleanest version of this approach actively increases your bill. There is no native join key in GCP that links one generateContent call to a dollar value for that call, which is the structural reason naive solutions fail.

Gateway traces as the attribution layer

The pattern that works in production is to route Vertex AI traffic through an AI gateway that sits between application teams and the Vertex endpoint. The gateway is the only layer that sees both the caller metadata (a tenant_id or team_id passed as an HTTP header, a request body field, or a virtual API key) and the response metadata (usageMetadata.promptTokenCount and candidatesTokenCount from the Vertex AI response).

LiteLLM proxy is the most widely deployed open-source option for this layer. According to the LiteLLM spend tracking documentation, the proxy tracks spend for keys, users, and teams across more than 100 LLMs, and applies provider-specific cost tracking automatically when the response includes tier metadata (including Vertex AI PayGo and priority pricing). The proxy maintains a SpendLogs table with columns for hashed api_key, user, team_id, request_tags, model, prompt_tokens, completion_tokens, and spend.

Virtual keys, trace schema, and the BigQuery handoff

Virtual keys are the lever that ties this to org structure. You issue one virtual key per team or per workload, attach a spend cap, and then GET /team/info?team_id=analytics returns near-real-time cumulative cost for that team. The proxy computes cost at call time using the same token-times-rate formula that GCP billing uses, but it tags every row with caller identity.

The minimum trace schema for a Vertex AI gateway is straightforward:

trace_id or request_id (UUID generated by the gateway)
tenant_id or team_id (from header or virtual key)
model_id (for example gemini-1.5-flash-001)
input_tokens, output_tokens (from usageMetadata)
cost_usd (computed at call time)
timestamp, project_id, region

Streamed to BigQuery, that schema becomes the attribution source of truth, and the Cloud Billing export drops back into the role of a reconciliation anchor. If you want a quick check on whether your own pipeline emits this minimum schema, the per-endpoint heatmap at agentcolony.org/auditor is built to flag missing tenant fields.

Approach comparison: native GCP versus AI gateway

The trade-offs across the three plausible data sources are clearer side by side. The table below matches each option against the dimensions that matter for Vertex AI gateway traces cost reporting and team-level chargeback.

Dimension	Cloud Billing Export	Cloud Audit Logs	AI Gateway (LiteLLM, Apigee)
Granularity	SKU × project × day	Per API call	Per request
Tenant identity	Resource labels only	Service account or caller email	Custom header or virtual key
Token counts	No	No (body redacted)	Yes (from API response)
Cost per call	No	No	Yes (computed)
Latency	About 24 hours	0 to 5 minutes	Real-time
Setup complexity	Low (enable export)	Medium (log sink + BQ)	Higher (deploy and configure)
Reconciliation role	Primary anchor	Not useful	Secondary verifier
Chargeback ready	No (without gateway)	No	Yes

The practical reading: Cloud Billing stays as your invoice of record, Cloud Audit Logs are useful for security and compliance but not for cost, and the gateway is the only layer that can produce a row whose granularity matches the granularity of a chargeback decision.

The four-stage pipeline that actually ships

Most teams that solve this converge on the same architecture, regardless of whether the gateway is LiteLLM, Apigee, or a hand-rolled Cloud Run service.

Capture. The gateway sits in the request path. It accepts the call from the application, extracts tenant_id (header, body field, or implicit via virtual key), forwards to Vertex AI, and reads usageMetadata from the response.
Enrich. The gateway appends cost_usd = input_tokens × input_rate + output_tokens × output_rate at call time, using a model price map that you update when Google publishes new SKUs.
Store. Trace records flow to BigQuery, either via direct streaming insert or via Pub/Sub fanning into Dataflow. Partition the llm_usage table by date and cluster by tenant_id so the chargeback query is cheap.
Reconcile. On a monthly cadence, join SELECT tenant_id, SUM(cost_usd) FROM llm_usage GROUP BY tenant_id against SELECT SUM(cost) FROM billing_export WHERE service.description = 'Vertex AI' AND project.id = '<shared_project>'. A healthy pipeline lands within roughly ±2% of the billed total, with the variance explained by token-counting rounding and minor billing-cycle skew.

Common pitfalls that break attribution

A few patterns recur across teams that adopt this stack and still end up with unreliable Google Cloud AI billing by team:

One project, one bill. Shared GCP projects consolidate quota management but eliminate label-based attribution. Without a gateway, the bill is unsplittable.
Direct-to-Vertex bypass. If even one team is allowed to call Vertex AI without going through the gateway, the reconciliation join breaks for that team's costs, and the shortfall shows up as unattributed spend in the variance check.

Shared infrastructure and managed-service pitfalls

The second class of pitfalls hits even teams that route every call through a gateway, because shared infrastructure and managed analytics surfaces silently hide the right fields:

Static endpoint labels in Kubernetes. When multiple teams share a node pool and a single Vertex AI endpoint, labels on that endpoint resource do not change per call, so you cannot back-derive the team that triggered each call.
Apigee analytics drift. Apigee can absolutely intercept Vertex AI calls and emit custom analytics, but the default dashboard is traffic-focused, not cost-focused. Operators have to write response message policies that extract usageMetadata and post it to Analytics, and that work tends to lag the actual launch.
Reliance on the AI Cost Summary Agent. GCP's billing console now surfaces an AI Cost Summary Agent that narrates trends in aggregate Vertex AI spend. It is genuinely useful for anomaly detection, but it operates on the same aggregate export and cannot deliver per-team chargeback.

Summary

LLM cost attribution GCP is a request-level problem trying to be solved by aggregate-level data, and the gap will not close on Google's side any time soon. The Cloud Billing export remains a project-and-SKU-grained invoice, and Cloud Audit Logs intentionally drop the request body that would carry the cost-bearing fields. The only durable path to Vertex AI per-tenant cost tracking is a gateway in the request path that captures token counts, tags them with tenant identity, computes cost at call time, and reconciles against billing in BigQuery on a monthly cadence. The agentcolony.org/auditor reviews this exact pipeline for Vertex AI workloads and surfaces the leaks that quietly invalidate chargeback numbers.

FAQ

Can I track Vertex AI costs per team using GCP labels only?

Only if every team has its own GCP project. In that case, projects carry labels like team: analytics that propagate into the billing export, and a BigQuery group-by produces a clean split. The moment teams share a Vertex AI project, labels become static on the endpoint resource and do not vary per request, so you cannot attribute a specific call to a specific team without a gateway in the path.

What does Cloud Audit Log actually capture for Vertex AI calls?

Audit logs capture the caller identity (service account or user email), the method name (for example aiplatform.googleapis.com.GenerateContent), the project, the region, and the timestamp. They do not capture token counts, prompt or completion sizes, or any cost data, because the request body is redacted in Data Access logs to avoid leaking sensitive content. That redaction is non-negotiable, so audit logs alone cannot drive chargeback.

Do I need to run my own gateway, or can I use a managed service?

For per-request Vertex AI gateway traces cost reporting, you need something in the request path. LiteLLM proxy is the lightest-weight option, runs in a container, and has Vertex AI support out of the box with virtual keys, spend tracking, and a SpendLogs table. Apigee is the enterprise-native GCP option but requires more configuration to extract token usage from the response body. A custom Cloud Run shim is a viable middle ground if you only need a few fields.

How do I reconcile gateway cost estimates with actual GCP billing?

Run a monthly BigQuery join: sum cost_usd from your llm_usage table grouped by month and project, then compare against SUM(cost) from the billing export filtered to service.description = 'Vertex AI' and the same project. A well-instrumented pipeline lands within about ±2% of the billed total. Larger variance usually points at direct-to-Vertex calls bypassing the gateway, a stale model price map, or a token-counting bug in custom client code.

Does the GCP AI Cost Summary Agent solve per-team chargeback?

No. The AI Cost Summary Agent is a narrative layer over the aggregate billing export. It can tell you that Vertex AI spend rose forty percent month over month and flag candidate causes. It cannot break that spend down per tenant within a shared project, because the underlying data simply does not contain per-tenant identity. It is a useful complement to a gateway-based pipeline, not a replacement.