Tokens Are Not the Metric

Two stories sitting next to each other on the AI engineering beat this week.

Uber burned its entire 2026 AI budget in four months. Engineers spent between $500 and $2,000 per month per developer on Claude Code and Cursor combined. Adoption hit 95% of engineers using AI tools monthly. 70% of code now being committed at Uber originates from an AI assistant. Usage doubled between December 2025 and February 2026 as developers discovered multi-step agentic workflows.

Microsoft launched Claude Code in its Experiences & Devices division in December 2025 and shut it down effective June 30, 2026 — same pattern, same four-month burn. Developers are being redirected to GitHub Copilot, which Microsoft owns outright.

The framing in both cases is the same: AI tool costs blew up.

That framing is wrong.

What broke wasn't the budget. It was the meter.

Read the Uber numbers again. 95% adoption. 70% of committed code AI-originated. Productivity gains so good developers doubled their consumption in two months because they discovered better workflows.

Those numbers don't describe a budget problem. They describe one of the most successful enterprise tool rollouts in recent memory. The cost overrun is not failure — it's the bill for a productivity gain the company couldn't have predicted.

The mistake wasn't spending the money. The mistake was the budgeting unit.

Companies set “AI budgets” the same way they set “software license budgets” — a flat allocation per team or per seat, set once at the start of the year, predictable to the CFO. Token-based pricing breaks this model fundamentally. You're not buying a seat. You're buying compute that scales linearly with use, and use scales superlinearly with productivity — because more capable workflows consume more tokens per task.

If the unit is seats, this looks like a runaway expense.
If the unit is cost per shipped feature, this looks like an unbelievable bargain.

The math that actually matters

A senior engineer at a US tech company costs roughly $300,000 fully loaded — salary, benefits, equity. That's about $25,000 per month. At $2,000/month in AI tools (Uber's upper bound), the cost is 8% of one engineer's monthly comp.

That's the break-even threshold. If AI tooling makes that engineer 8% more productive, the math is net-neutral. Anything beyond is profit. The reported gains are an order of magnitude above 8%.

The math only looks bad if you compare $2,000 against $0 (no AI). It looks obvious if you compare it against the alternative — hiring another engineer, or shipping 30% less.

Almost nobody is doing that math. They're comparing this year's AI line item against last year's AI line item, watching it grow 10× over four months, and concluding “this is unsustainable.” What's actually unsustainable is asking the question that way.

The mirror image

For every Uber — where 95% adoption and 70% AI-originated code justifies a 10× token bill — there's a small team where the same monthly spend produced very little of measurable value. Same vendor. Same pricing. Wildly different output per dollar.

The difference isn't the tool. It's the operator.

What goes wrong, predictably:

Frontier model for every task (50–100× more expensive than needed for narrow work).
Verbose chat-style prompts paying for the context window over and over because there's no retrieval layer or persistent context file.
Agentic loops that take 50 calls when a structured single prompt would have worked.
No caching of repeated requests, so the same question costs the same every time.
Treating every dev session like a frontier-model conversation when most of the work would survive on a small model.

These mistakes scale invisibly. A team that's never measured cost-per-feature looks at the monthly bill, assumes “AI tools are just expensive,” and either caps spending (kills the productivity gains) or quietly moves on. The actual diagnosis is usually that the workflow is wasteful by 5–20×, and there's no measurement layer to see it.

This is why the metric matters more for smaller teams than for Uber. Uber has the productivity gains to absorb a measurement mistake. A founder running on a $500/month AI budget cannot. The difference between “$500 well-spent” and “$500 burned” comes down to a few specific habits — and almost all of them become obvious once you start measuring output instead of input.

What we actually track

We build production MVPs in six weeks for a fixed price. AI tool spend is a real line item for us — and we track it carefully, because we eat the cost difference between a well-routed workflow and a wasteful one.

What we don't do: set per-developer caps based on token consumption. What we do:

Cost per shipped feature, not cost per developer. Some weeks are cheap (incremental work). Some weeks are expensive (an agentic task that does the work of a week of human review).
Route by job, not by tier. Classification goes through Haiku. Architecture review goes through Opus. Code generation goes through whichever model is benchmarking best for the specific task that week. The aggregate token count is noise; the per-task choice is signal.
Measure output, not input. Was this week's spend justified by what shipped? That's the only question. The aggregate burn rate is downstream of that answer.

This isn't a startup-vs-enterprise insight. Uber and Microsoft can do exactly the same. What broke their budgets is not the pricing model — it's how the budget was framed.

The bigger pattern

Every wave of new infrastructure produces the same mistake at the same point. Early cloud: companies treated EC2 like rented servers and got surprised by the bill at the end of the month. Early SaaS: companies treated subscriptions like one-time licenses and got surprised when renewals stacked up. Now: AI inference is being treated like a seat license and companies are getting surprised when token consumption scales with productivity.

The fix is always the same. Stop measuring the input. Start measuring what the input produced.

Uber's “back to the drawing board” moment isn't because Claude Code is too expensive. It's because the finance team is asking the wrong question. And the right question — for Uber, for Microsoft, for a two-person team running on $500/month — is the same: is productivity per dollar going up or down?

If you're not measuring that, you can't answer it. And if you can't answer it, you'll either over-spend without seeing the value (the Uber framing) or under-spend and kill the value that was there to be had (the small-team framing). Both endings are avoidable. Both come from measuring the input instead of what the input produced.

The cost overrun isn't the story. The measurement model is.