Kevin Van Kerckhoven - Entrepreneur, Digital Leader & Agentic AI

We’ve spent years chasing scale.

Faster models, bigger models, “smarter” agents. The logic has been: the more intelligence you pack in, the more you can automate. But here’s a simple truth that’s becoming harder to ignore, especially if you’re leading real transformation on the ground:

AI agents don’t fail because they’re not smart enough. They often fail because we ask them to do too much.

This isn’t just gut instinct, it’s backed by research. A recent whitepaper from Apple's Machine Learning Research team, titled “The Illusion of Thinking,” reveals something counterintuitive about how advanced reasoning models perform under pressure. It turns out there’s a sweet spot, what I like to call the Goldilocks zone, where AI agents truly shine.

Let’s unpack what this means for those of us trying to implement AI at scale in the real world.

When Thinking Harder Makes Things Worse

Apple’s study put Large Reasoning Models (LRMs) through a series of controlled puzzle environments designed to test their reasoning abilities, not just whether they could spit out the right answer, but how they arrived at it.

The results were surprising:

Low-complexity tasks? Models overthought and got tripped up.
Medium complexity? Strong performance, good reasoning flow.
High complexity? A total collapse in accuracy.

In other words, when the task becomes too complex, adding more reasoning doesn’t help, it hurts. The models produce longer, more confident explanations, but those explanations are often riddled with mistakes.

And in a business context, that’s dangerous.

Imagine deploying an AI agent to manage contract workflows, only to realize it’s been misunderstanding clauses with 90% confidence. You’re not just looking at performance issues, you’re looking at potential reputational or legal risks.

The Goldilocks Zone: Where AI Agents Actually Deliver

The most important insight here isn’t about failure, it’s about where agents consistently succeed.

AI agents are at their best when the problem:

Has some complexity, but not too much,
Involves structured patterns, but not rigid logic chains,
Benefits from reasoning, but not full-on planning.

And that’s great news for business leaders. Because most real-world workflows (think customer service escalation, logistics prioritization, first-pass data analysis) fall squarely in that middle zone.

These are areas where:

Human expertise is valuable, but expensive to scale.
Rules-based automation feels too rigid.
But AI agents? Just right.

If you’re designing for AI deployment, this is the zone you want to target.

Scope Is Strategy

Too often, we try to build all-knowing, end-to-end AI agents. The kind that can read a report, analyze your business, and rewrite your strategy.

But the smarter move? Build smaller agents with clear boundaries:

An agent that triages support tickets, but doesn’t resolve billing disputes.
A co-pilot for summarizing meeting notes, but not interpreting shareholder reports.
A claims processor that flags anomalies, but doesn’t approve payouts.

Smaller scope = higher trust, faster deployment, and fewer hallucinations.

The most successful agent implementations I’ve seen recently weren’t trying to replace expert roles, they were designed to support humans at the right moment with just enough intelligence to be helpful, not risky.

AI Agents Aren’t Doers — They’re Thinkers (With Limits)

Here’s another hard truth from the whitepaper: even when you give agents step-by-step instructions, they can still fail to follow through. This isn’t a bug. It’s a reflection of what today’s AI is good at:

Reasoning, not repetition.
Insight, not execution.
Judgment calls, not multi-step computation.

So instead of forcing AI agents to “do it all,” smart deployments use agents to make decisions and hand off the doing to something else: a calculator, a rule engine, or yes, a human.

The best agent architectures I’ve seen in the field pair LLM-based agents with traditional systems that check, verify, and execute. That hybrid model isn’t just safer. It’s more sustainable.

Trust Requires Transparency

One of the most striking findings from Apple’s study? AI models often produce correct answers early, but then keep “thinking” and change their minds, landing on the wrong result.

This reveals a deeper limitation: many agents don’t know when they’ve already succeeded. They lack true self-awareness or confidence calibration.

That’s a big deal if you’re deploying agents in regulated environments, or anywhere a bad answer carries real consequences.

To fix this, we need what I call transparent reasoning scaffolds:

Show the full reasoning trace, not just the final answer.
Build in checkpoints that compare new conclusions to earlier ones.
Flag when uncertainty spikes, so humans know when to step in.

In short: we need to design for doubt.

So What Should You Do with This Insight?

Here’s the practical takeaway:

Don’t push agents to do more. Design them to do less, but better.

Start with a clear task. Limit the scope. Focus on where AI already performs well. Use controlled handoffs. Measure what matters.

The illusion of thinking is a warning, but also a gift. It reminds us to build agents for where they are today, not where we hope they’ll be in five years.

Because the AI agents that thrive in practice aren’t the most powerful, they’re the most focused.

Final Thought

I’ve spent over 1,5 decades helping companies navigate digital transformation. One thing never changes: success comes from designing around reality, not ambition.

If you’re building with AI agents, this is the time to get real. The next big leap won’t come from dreaming bigger. It’ll come from thinking smaller and delivering smarter.

Let’s build the agents that actually work.

—

Want help designing scoped, sustainable AI agents for your organization? Let’s talk.

Enjoyed this article?

Click on a star to rate it.

★★★★★

AI Agents Are Getting Smarter, But That Doesn’t Mean They Should Do More