VPSSpark Blog
← Back to Dev Diary

Why Your AI Bill Keeps Rising as Token Prices Fall

AI Cost Insights · 2026.06.17 · ~10 min read

Diagram: falling token price vs rising total AI bill — Jevons Paradox for AI
Per-token prices halved; total spend doubled—not a bug, a 160-year-old economic law replaying in the API era.

In 2023, GPT-4 API input cost $30 per million tokens. By late 2024, models at roughly the same capability had fallen below $3. In 2025, Haiku-class models dropped under $1 per million. If you only watch the unit price, this should feel like a win — AI is getting dramatically cheaper.

Open last month's invoice and the mood shifts: why is this higher than last year?

You're not imagining it, and no vendor is secretly overcharging you. What you're seeing is a 160-year-old economic law playing out again in the API era. It's called the Jevons Paradox.

Bottom line
Every time token prices halve, typical usage tends to grow 3–5×. Savings from cheaper units get fully eaten by volume — and your bill climbs anyway.
−97%
GPT-4-class model
unit price 2023→2026
10×
Average token volume
growth over same period
Average developer
monthly AI bill growth

Jevons Paradox: cheaper units, bigger bills

In 1865, British economist William Stanley Jevons studied coal consumption and found something counterintuitive: every time steam engines got more efficient, total coal use went up, not down. Efficiency cut the cost per unit of work, which unlocked new applications — factories that couldn't afford steam before came online; single-line plants added second and third lines.

Saving per unit of consumption is not the same as saving total consumption. That's Jevons Paradox: when a resource gets more efficient or cheaper, total demand for it usually rises instead of falling.

Tokens follow the same curve. Each order-of-magnitude price drop unlocks use cases that used to die behind a "too expensive" gate:

  • At $30/M, you use AI for weekly report summaries.
  • At $3/M, you add code review on every PR.
  • At $0.30/M, you run background agents, ticket triage, hourly log scans.
  • At $0.03/M, you wire the whole workflow in — and forget to turn it off.

Each price cut isn't really "savings." It's permission to use more. Absolute spend creeps upward while the sticker price keeps falling.

If you live in Cursor, Claude Code, or OpenClaw daily, you've already lived this. A model that felt reckless to leave running overnight in 2023 is now "just another cron job." The paradox isn't abstract — it's in your provider dashboard.

Three structural drivers behind bill inflation

Jevons explains the why. To control spend, you need to know where money actually burns. In AI API bills, three structural amplifiers show up again and again.

Driver 1: volume inflation — from occasional asks to always-on agents

Two years ago, typical AI usage looked like "I have a question, I ask." Today, with Cursor, OpenClaw, and similar agent stacks, AI has shifted from passive Q&A to active execution: CI analysis while you sleep, repo hygiene during meetings, user feedback while you eat lunch.

Call frequency went from dozens per day to thousands. Even at one-tenth the unit price, thousands of calls still double the bill without breaking a sweat.

Usage stage Typical call frequency Tokens per call Monthly token total
Q&A assist (2023) 30/day ~500 tokens ~450K tokens
Code review (2024) 200/day ~3,000 tokens ~18M tokens
Always-on background agent (2025+) 2,000/day ~8,000 tokens ~480M tokens

Look at the last row: 450K to 480M is roughly a thousand-fold jump. Token prices fell ~90% over the same window — yet spend can still land two orders of magnitude above 2023.

The shift isn't just "we use AI more." It's that tools like Cursor Agent and OpenClaw gateways treat inference as infrastructure, not a luxury. A webhook fires, a model runs. No human in the loop to ask "is this worth $0.02?" That's Jevons in product form.

Driver 2: context bloat — each request weighs more

The sneakier bill killer isn't call count. It's payload size per call.

In 2023, GPT-3.5 shipped with a 4K context window. A conversation was a short question plus a few turns of history. Today, Claude Sonnet handles 200K context; Gemini 2.5 Pro opens a 1M window. Developers stopped trimming — whole repos, full PDFs, entire chat logs go in, because "the model can handle it now."

The hidden cost of context
Stuffing a 500KB source file into a prompt is roughly 125,000 tokens — more than a typical developer's entire monthly usage in 2023. If your agent carries full context on every hop, spend scales exponentially, not linearly.

Extended thinking modes make this worse. Claude's thinking path bills internal reasoning tokens before the final answer — often more than the visible output. One "deep analysis" can cost 5–10× what you estimated from the reply length alone.

Cursor users see this on large monorepos: @-mention the whole workspace once and you've bought a small inference job. OpenClaw sessions that never compact memory accumulate the same way. Cheaper per-token pricing doesn't help if every token multiplies across a fat context window.

Driver 3: the agent multiplier — tokens add up like multiplication, not addition

This is the fastest, least predictable source of bill shock.

In plain Q&A mode, one request is one request — linear token math. In agentic workflows, a single user message triggers a whole call chain:

Fig. 1 · What happens inside one "simple" agent request

User sends one instruction"Review this PR and suggest fixes"
Orchestrator agentPlans subtasks, delegates to child agents → 1 LLM call
Child agents ×4Read code, search docs, check tests, draft comments → 4 LLM calls (each with full context)
Synthesis + retriesOrchestrator merges results; errors trigger auto-retry → 2–3 more LLM calls

The user pressed send once. The invoice shows 7–8 LLM calls, each dragging a heavy context payload. That's the multiplier — one button, eight line items.

Worse: runaway agent loops. Agents retry on error. Tasks without hard stop conditions keep going. Two agents waiting on each other deadlock while both keep burning tokens. A bad termination condition can turn an overnight cron into a four-figure surprise.

Real incident
A developer wired an "auto-fix failing tests" agent into CI and forgot to cap retries. A flaky test triggered a loop — 2,300 LLM calls in eight hours, $340 on the invoice. The model was cheap. The multiplier wasn't.

This pattern is increasingly normal as CI + agent tooling merges. GitHub Actions runs the build; the agent reads logs, patches, pushes, triggers again. Each round is another multiplier stack. Token prices fell; agent loops didn't get the memo.

Do the math: a sobering back-of-envelope

Say your stack today runs 10 agent tasks per day. Each task averages 8 LLM calls. Each call averages 10,000 tokens including context.

Parameter Value
Agent tasks per day 10
LLM calls per task (multiplier) 8
Tokens per call 10,000
Monthly token volume 10 × 8 × 10,000 × 30 = 24M tokens
At Sonnet-class pricing ($3/M) 24M × $3/M = $72/month
At flagship pricing ($15/M) 24M × $15/M = $360/month

$72–$360/month for one developer's routine workflow. Ten people on the team, or double the agent task count, and you multiply again. Bill size is no longer "do we use AI?" — it's "how long is the multiplier chain?"

Making bills governable: not less usage, intentional usage

Jevons doesn't mean "stop." It means "structure matters." More coal fueled the Industrial Revolution; more tokens can fuel a real productivity jump. The question is whether spend matches value.

Three controls you can deploy now, lowest effort first:

Fix 1: tiered routing — let cheap models carry the bulk

Not every task needs the strongest model. "Does this snippet have a syntax error?" and "design the system architecture" differ by an order of magnitude in required reasoning — but if both hit Claude Sonnet, your bill differs by an order of magnitude too.

Split by complexity into three tiers:

  • Formatting, summarization, simple classification: Haiku / GPT-4o-mini class, ~$0.15–$0.30/M, fastest latency.
  • Code generation, multi-step reasoning, docs: Sonnet / GPT-4o class, ~$3–$5/M, best default.
  • Architecture, hard debugging, extended thinking: Opus / o3 class — on demand, never the default.

Implementation: define model aliases (fast / smart / deep) in a gateway like LiteLLM, route clients by task type, and keep master keys plus routing logic on a always-on control plane. For a full three-tier setup on Cloud Mac + OpenRouter, see our Cloud Mac + OpenRouter hands-on guide.

Routing is the highest-leverage fix because it attacks all three drivers at once: fewer flagship calls (volume), smaller default contexts on fast tiers (context), and shorter chains when the orchestrator can delegate cheap sub-steps (multiplier).

Fix 2: budget circuit breakers — trip before the fire spreads

Tiered routing picks the right model. Budget caps stop runaway volume. Agent loops, surprise retries, tasks with no termination — these aren't textbook risks; they're Tuesday if you run agents in production.

Minimum two layers:

  1. Upstream credit cap: set a hard monthly limit in OpenRouter or your Anthropic console. Over cap, the API refuses — no silent overage.
  2. Virtual Key spend cap: in a self-hosted LiteLLM gateway, issue each client (Cursor, OpenClaw, scripts) its own Virtual Key with an independent monthly budget. One tool goes haywire, it burns only its quota.
LiteLLM Virtual Key creation (API)
curl -X POST http://127.0.0.1:4000/key/generate \
                  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
                  -H "Content-Type: application/json" \
                  -d '{
                    "key_alias": "cursor-dev",
                    "models": ["fast", "smart"],
                    "max_budget": 20,
                    "budget_duration": "1mo",
                    "metadata": {"tool": "cursor", "env": "dev"}
                  }'

This Virtual Key caps at $20/month, may only call fast and smart, and returns 429 when exhausted — master key untouched. That's the smallest viable "personal enterprise" governance stack.

For the CI agent loop that hit $340: a per-key cap of $50 would have stopped at call ~400, not call 2,300. Cheap models don't save you from infinite loops; hard limits do.

Fix 3: spend observability — you can't cut what you can't see

Most bill pain is discovered in arrears: month-end reconciliation, and some agent quietly burned $50 last Friday on a job that finished days ago. The agent wasn't malicious — you had no live spend view.

Minimum viable observability:

  • LiteLLM built-in dashboard: start with litellm --detailed_debug, open /ui for per-Virtual-Key spend, request volume, and latency.
  • Daily spend alerts: cron a query against litellm.dbSELECT key_alias, spend FROM litellm_verificationtoken — and ping Slack when a key crosses threshold (OpenClaw on Slack handles this cleanly).
  • Upstream reconciliation: weekly, compare LiteLLM local spend against OpenRouter / Anthropic console totals. A >10% gap means a tool is bypassing your gateway and hitting the provider directly.
What observability actually buys
Teams that ship spend monitoring typically find 20–30% "waste" in month one: agents whose output nobody reads, scripts that ship full repo context when five lines would do, cron jobs everyone forgot to delete.

Observability doesn't fight Jevons — it makes the paradox visible. You stop arguing about whether AI is "worth it" in the abstract and start cutting specific line items with no ROI.

The real question isn't "how do I spend less?"

Jevons had another side people skip: he never said efficiency was bad. Britain burned more coal and industrialized.

Same here — cheaper tokens and higher bills aren't automatically a failure. The questions that matter:

  • Did the extra tokens produce equal or excess value?
  • How much spend is deliberate investment vs. unconscious leakage?
  • Do you have a system that answers both, on demand?

"Save tokens" is the wrong goal. "Spend every token on purpose" is the right one. Tiered routing, budget breakers, and spend visibility don't exist to make you use less — they exist so you know what you're using, where, and whether it's worth it.

Tokens will keep getting cheaper. Bills will probably keep rising. With governance in place, at least the rise tracks value — not drift.

FAQ

Will Jevons Paradox always hold? For resources with no obvious ceiling on utility, usually yes. AI inference can substitute for human effort across a huge surface area — each price drop unlocks new workloads. The paradox breaks when capability hits a real ceiling ("already better than everyone; cheaper doesn't open new use cases"). That horizon isn't visible yet.

Can I control bills by switching to cheaper models? Short term, yes. Long term, no — savings get redeployed into more tasks, and you're back on the Jevons track. Durable control is budget constraints plus visibility, not permanent downgrades.

Can you eliminate the agent multiplier? Not entirely. You can bound it: hard max steps / max calls per task; result caching so identical subtasks don't re-run; orchestrator logic that asks "does this step need an LLM?" before every hop — rule engines beat models by orders of magnitude on deterministic checks.

Does this get harder as the team grows? Yes. More people means more tools, more keys, more opaque routing. Virtual Key isolation and per-user spend caps stop being nice-to-have around three engineers. Stand up a gateway before the migration tax compounds.

Run gateway, routing, and circuit breakers on one always-on Cloud Mac

You can't repeal Jevons — but you can install a breaker. Tiered routing sends tokens to the right price tier. Virtual Keys cap each tool independently. Spend logs show where money goes in real time. That "gateway governance" pattern needs a control plane that's always online, keeps secrets off laptops, and runs the full macOS-native toolchain.

VPSSpark Cloud Mac mini M4 is built for this: LiteLLM Proxy under launchd, master keys in server-side .env, laptops and phones on Virtual Keys only. M-series idle power is low enough for 24/7 gateway duty; unified memory handles concurrent agents plus proxy comfortably; Gatekeeper, SIP, and FileVault shrink the attack surface vs. a generic Linux VPS hosting long-lived API secrets.

If "cheaper tokens, bigger bills" is your current reality, start with a gateway that can trip a breakerexplore VPSSpark Cloud Mac plans and keep control plane and agent execution on one secure machine.

Gateway governance

Tiered routing · spend caps · cost visibility · keys off clients

Cloud Mac + LiteLLM + OpenRouter · launchd 24/7 · Virtual Key isolation

Back to home
Limited offer See plans now