Your AI bill is mostly wasted tokens
1. Key Themes
Token Costs Are the New Unit Economics for AI Products
Every AI product's profitability hinges on token efficiency. The article frames this starkly: "You pay per token, roughly three-quarters of a word each, on the way in and the way out. Most production apps resend the same system prompt, the same tool list, and the same documents on every call, paying full freight thousands of times a day."
Prompt Caching Is the Single Highest-Leverage Cost Lever
Of all optimization tactics, caching repeated inputs delivers the most dramatic savings with the least quality tradeoff. "Prompt caching alone trims repeated input by up to 90% on Claude."
AI Agents Create Compounding Cost Loops That Demand Architectural Discipline
Multi-step agent loops multiply token waste at every iteration, making the economics of agentic workflows fundamentally different from single-call use cases. The article promises "the agent-loop economics" and "a worked ROI math on a realistic agent workload, so you can size your own savings before you touch a line of code."
The Research Frontier Is Moving Toward Token Minimization as a Core CS Problem
Token optimization is no longer just a cost concern — it's becoming a computer science research problem in its own right. "A researcher recently pointed Codex at a problem computer scientists file under intractable: finding a provably optimal tokenizer... discovered a family of constraints it named 'cycle constraints,' and produced a provably optimal tokenizer for an entire book in about a day."
2. Contrarian Perspectives
Most teams are paying for waste they don't know exists — and the savings don't require model quality tradeoffs. The common assumption is that reducing AI costs means accepting lower output quality. The article directly challenges this: "Stack the rest of the system and a typical bill drops by half or more, with the output quality held steady." The implication is that the majority of current AI spend is structurally inefficient, not a necessary cost of performance.
The frontier moved "while most teams looked away." The article suggests that the competitive shift toward token efficiency is already underway, but underappreciated. "The frontier moved while most teams looked away, and it moved toward one question: how few tokens does the job actually take." Teams focused on capability improvements may be missing the more immediately actionable optimization layer.
3. Companies Identified
| Company | Description | Why Mentioned | Quote |
|---|---|---|---|
| Claude / Anthropic | Anthropic's flagship LLM | Cited as offering up to 90% cost reduction via prompt caching, making it the primary model referenced for caching mechanics | "Prompt caching alone trims repeated input by up to 90% on Claude." |
| Codex | OpenAI's AI coding tool | Featured as a case study for autonomous research capability and token-efficient problem solving | "A researcher recently pointed Codex at a problem computer scientists file under intractable... produced a provably optimal tokenizer for an entire book in about a day." |
4. People Identified
No named individuals beyond the author are identified in the available article text.
| Person | Description | Why Mentioned | Quote |
|---|---|---|---|
| Ruben Dominguez | Author, The AI Corner newsletter | Wrote the token cost optimization playbook | Byline attribution |
| Unnamed Researcher | Computer scientist / AI researcher | Demonstrated autonomous research loop using Codex to solve a previously intractable tokenizer optimization problem | "A researcher recently pointed Codex at a problem computer scientists file under intractable... discovered a family of constraints it named 'cycle constraints.'" |
5. Operating Insights
Implement prompt caching immediately on any production system with repeated context. The highest-ROI action for any team running Claude in production is restructuring prompts to hit cache — specifically ordering prefixes correctly and setting cache_control breakpoints. "Prompt caching alone trims repeated input by up to 90% on Claude." The article targets a specific "hit-rate target," suggesting this is measurable and manageable.
Replace document-stuffing with targeted retrieval to cut input tokens 30–60%. Rather than sending entire documents to the model, the article advocates for "the retrieval pattern that replaces stuffing whole documents with searching for the chunks that matter" — combined with prompt rewrites that cut input tokens "30 to 60% while holding output quality."
Apply a serialization trick to structured data in agent tool calls. A specific, underutilized tactic mentioned is "the serialization trick that halves the cost of structured data" — applicable within agent and tool call loops where JSON or structured payloads are passed repeatedly.
6. Overlooked Insights
Eight silent failure modes are actively erasing optimization gains. The article references "the 8 failure modes that silently erase your savings, each with the fix" — suggesting that teams who implement caching and retrieval improvements may still be losing savings to invisible leaks. This diagnostic layer is buried in the playbook description but could be the most practically urgent section for teams who've already attempted cost reduction and seen disappointing results.
A 30-day structured rollout is prescribed, starting with measurement. Rather than ad hoc optimization, the article frames token cost reduction as a staged program: "the 30-day rollout from measuring your spend to a fully optimized stack." The emphasis on measuring first implies most teams don't have baseline visibility into where their token spend is actually going — making instrumentation a prerequisite, not an afterthought.