How to stop tokenmaxxing and cut AI spend 10x

Ravi Mehta

Jun 4

Three fixes to reduce token burn and improve ROI.

Read →

9 Comments

Sumeet Maniar

Jun 4Edited

Ravi,

Excellent analysis.

I’ve been arguing that the winning architecture for most enterprise AI systems will be a hybrid of deterministic and non-deterministic approaches. LLMs are incredible for discovery, ideation, ambiguity, and rapid prototyping, but many production workflows eventually benefit from being refactored into deterministic systems where possible. That improves cost, reliability, explainability, and scalability. Meaning use the speed of non-deterministic to get to the deterministic effectively.

In our product and AI labs since last year, we often followed a pattern:

• Used frontier models to rapidly explore, prototype, and discover solutions

• Identify where reasoning is actually required versus where deterministic logic, traditional ML, or conventional software can perform the task more efficiently. Also think through about what are plain vanilla tokens vs. thinking tokens. Bifurcate the branches. Even count the tokens used per total turns achieving the outcome.

• Re-architect the workflow accordingly.

Healthcare is a great example. Much of clinical decision-making ultimately maps to evidence-based rules, protocols, and structured logic. The challenge isn’t replacing everything with LLMs it’s understanding where language reasoning adds value and where it doesn’t.

In another use case it took me 47 versions via to optimize parallel API calling rather than sequential for a map /geo solution I was iterating on. The frontier models could not do it, whereas if I coded it - this would take 10 minutes. A lot of tokens and two hours wasted.

Your point on “skills” resonated. I’ve wondered whether we’re sometimes just relocating context-window complexity rather than reducing it. Hidden prompt bloat, excessive guardrails, and agent-to-agent chatter can create significant token overhead with diminishing returns, which are additional, yet similar topics that add to “token maxxing.” I still want to try dynamic inference calling at the time of the agent processing to both closed or open models to see how this performs.

We all know now a solution - it’s likely a tightly orchestrated system of specialized components, each with a “very” narrow responsibility, small context window, and clear evaluation criteria. But if its too narrow, does that not become a function call? That architecture tends to be more efficient, more reliable, and less prone to cascading hallucinations.

For PMs, the real “craft” is cradt systems to consistently reach 90–95%+ human-level performance through rigorous evals and iteration. Though, sometimes my mind gets numbed or bored by sub-optimal time in tweaking. Once that threshold is achieved, the hard work shifts to industrializing the solution governance, monitoring, reliability, and engineering inference at scale and in parallel.

Great perspective on the skills layer where at times they get called up. On another note, Anthropic did a great session on optimizing guardrails via evals late last week (one can Google the accompanying video). Workshop name: agent decomposition. https://github.com/anthropics/cwc-workshops and XVideo: https://x.com/0x_rody/status/2061019244595233135?s=20

Reply (2)

Ravi Mehta

Jun 5

I think you're exactly right that a good approach is to start with frontier models to figure out overall system architecture, and then render that architecture down into the most suitable components. These might be mid-tier models or deterministic rules, all being orchestrated together to achieve a system that strikes a balance between speed, reliability, cost, and predictability. This is a step that I think a lot of people miss. They focus on just getting the system working and use AI as a general-purpose computer because it's easy to program (i.e., you can program it in English). Rather than recognizing that that approach leads to systems that are slow, suboptimal, prone to hallucination, and very expensive to run.

And to your point, those systems are hard to operationalize as well. The cost to build has compressed, but now that means we need to spend more time post-launch nailing our e-bals and getting things improved, which is much harder to do when you have a lot of prompt float and where one optimization is fighting against something else in the prompts.

Thanks so much for sharing your detailed thoughts and for sharing the link to the Anthropic session. I'm going to go through that. They seem to be on fire lately with the quality of the content that they've been releasing.

Sumeet Maniar

Jun 4

on substack not allowed to put in full URL, so one can add the https:

Armughan

Jun 4

I especially appreciate the piece you share about building skill libraries and how much context is dragged along when invoking them.

Reply (1)

Ravi Mehta

Jun 4

I’ve been surprised at how big some of the skills are. Many include examples, like a multi-page writing sample, that burn a ton of tokens. This is especially true for third party skills that are trying to solve a general use case.