Apple research results reasoning model

Apple Research Shows AI Reasoning Capabilities Are Limited

Jul 22, 2025

3 min read

Apple Research Shows AI Reasoning Capabilities Are Limited — Source: Unsplash - Medhat Dawoud

Apple Machine Learning Research's June 2025 paper "The Illusion of Thinking" revealed fundamental limitations in current Large Reasoning Models (LRMs). Researchers used four puzzle problems including Tower of Hanoi with variable complexity to examine the performance of models like o3-mini and DeepSeek-R1. The experiments showed model behaviour progresses through three regimes: at simple problems, both reasoning and standard models perform similarly well; at medium complexity, reasoning models perform better; while at high complexity, both groups' performance collapses to zero.

Apple's researchers observed that as task complexity increases, models' reasoning effort increases to a point, then declines despite having adequate token budget—indicating fundamental limits to scalability. They also analysed reasoning traces generated by the models and found that for simpler problems, models often "overthink": the correct solution appears early but models continue to explore incorrect ideas; while for medium complexity problems, models explore incorrect solutions before finding the correct one. The research demonstrated that even when given explicit solution algorithms, models failed to execute them reliably, suggesting deeper reasoning bottlenecks.

Apple's study sparked extensive debate in the AI community, particularly about whether current metrics adequately evaluate models' true capabilities. Cognitive scientist Gary Marcus stated the study "fundamentally shows that LLMs are no substitute for good well-specified conventional algorithms," while AI commentator Simon Willison pointed out that "reasoning LLMs are already useful today" regardless of whether they can reliably solve Tower of Hanoi. Anthropic's July 2025 rebuttal argued that Apple's alarming results stemmed not from models' reasoning limitations but from poorly designed evaluations—models didn't fail to think but failed to enumerate within token constraints.

Sources: