Smarter does not mean safer

Here is a finding that runs against most teams' instincts: smarter reasoning models hallucinate more, not less. Recent research has confirmed it directly. Train an LLM with reinforcement learning to reason harder, and tool-hallucination rates climb in lockstep with task performance. The very reasoning that lets an agent crack a harder problem is what lets it confidently invent a tool call that does not exist.

That decoupling is the part to sit with. We have spent two years assuming reliability rides along with capability, that the more capable model is also the more trustworthy one. In production agent systems, that assumption is now wrong. Capability and reliability have come apart, and you cannot close the gap by reaching for a bigger model.

Why this happens

A reasoning model trained to push toward an answer is, in effect, rewarded for resourcefulness. When the right path is unclear, a more capable model is better at constructing a plausible path anyway, including inventing a tool that would solve the problem if only it existed. The model is not malfunctioning. It is doing exactly what harder reasoning encourages: filling gaps with confident inference. For prose, that is creativity. For a tool call against a real system, it is a bug with a blast radius.

Engineer the surface, not just the model

If you are shipping agents in production, the response is not to avoid capable models. It is to stop treating the model as the whole system and start engineering the surface around it. Three things worth making non-negotiable:

Strict tool schemas with validation at the call site. Validate the moment a call is proposed, not after it has run. A hallucinated argument should be rejected at the boundary, before it touches anything real.
A registry the model is grounded against. Never trust a free-form tool name. The set of callable tools should be an explicit, enumerated contract, and anything outside it is refused by construction, not by hope.
Eval harnesses that test for hallucinated calls, not just task success. A passing task-success rate hides the failure mode entirely. You need evals that specifically measure how often the agent invents calls, calls tools with invalid arguments, or reaches for capabilities it does not have.

A higher task-success number can be hiding a higher hallucination rate. Measure both, or you are flying blind.

The mindset shift

The old mental model was: pick the smartest model and inherit its reliability. The new one is: pick a capable model, then assume it will occasionally be confidently wrong, and build the scaffolding that catches it. Reliability is now something you engineer into the system, not something you buy with a model upgrade.

This is good news for serious builders. It means durable advantage does not come from whoever has the flashiest model. It comes from whoever builds the most disciplined surface around it: tight schemas, hard grounding, and evals that look for the failure modes that actually bite.

We're here to help founders and teams design and build digital products that are built to scale with you, not slow you down. If you're looking to build something agentic that actually holds up in production, get in contact with us today.

The smarter the agent, the more rigor it demands from the system holding it. Plan for that, and capability becomes an asset instead of a liability.