Gemini 3.5 Pro is landing late this month with a 2M token context window and a Deep Think reasoning mode, priced around $15 per million input tokens and $60 per million output. Read that pricing again. Reasoning tokens count as output. So the more a model thinks, the more you pay, and a single Deep Think request can quietly cost ten times what a normal call would.
That changes a quiet assumption a lot of teams have been making. For the last year, "use the smartest model on the hardest setting" felt like the safe default. It isn't anymore. Reasoning has become a dial with a price tag attached to every turn.
Max reasoning everywhere is a budget leak
Here is the trap. A model with a deep reasoning mode will happily spend two thousand tokens deliberating over a request that needed twenty. It looks like rigor. On the invoice it looks like waste.
Most of what production systems ask an LLM to do is not hard. Classify a support ticket. Pull three fields out of a document. Decide which of four tools to call. None of that needs a model pausing to reason through a chain of thought. But if your default is "Deep Think on, always," you pay the reasoning premium on every routine call, and you eat the latency too. Deep reasoning modes can turn a sub-second response into a thirty-second one. Your users feel that.
GitHub Copilot moving every plan to usage-based billing on June 1 made the same point from a different angle. AI work is metered now. When the meter is running, how much your system thinks stops being a model setting and starts being an engineering decision.
Set a reasoning budget per task tier
The fix is not to avoid reasoning. It is to spend it where it earns its keep. Treat reasoning effort as a parameter you set deliberately, the same way you'd pick a timeout or a retry count.
A practical way to structure it:
- Routine tasks get a cheap model and reasoning off or minimal. Extraction, classification, routing, formatting. Speed and cost matter more than depth.
- Judgment tasks get a mid reasoning setting. Drafting a response, summarising a thread, choosing between a few real options.
- Genuinely hard tasks get the deep mode. Multi-step planning, tricky debugging, anything where a wrong answer is expensive and the model actually benefits from working it through.
Then measure the thing that matters: cost per resolved task, not cost per token. A cheap call that gets the job done in one shot beats an expensive one that thinks beautifully and still needs a retry.
Build the routing in, don't bolt it on
The teams that handle this well treat model and reasoning selection as part of the architecture, not a config someone tweaks under pressure when the bill arrives. That means a routing layer that picks the model and reasoning level based on the task, logging that records which tier each request used and what it cost, and the ability to dial a whole class of requests up or down without a redeploy.
Do that, and a new model launch becomes a tuning exercise instead of a fire drill. Gemini 3.5 Pro ships, you slot it into the "hard task" tier, you compare cost per resolved task against your current setup, and you keep what wins.
This is the discipline that separates a demo from a product. Anyone can wire up the smartest model on its deepest setting and show something impressive. Running it at scale without the cost curve bending the wrong way is the actual work.
We're here to help founders and teams design and build digital products that are built to scale with you, not slow you down. If you're looking to build something, get in contact with us today.