Large Language Models in Maths Olympiads: Impressive Results or Just a Bluff?

Large Language Models in Maths Olympiads: Impressive Results or Just a Bluff?
Source: Freepik - jcomp

Recent advancements in the mathematical capabilities of large language models (LLMs) have sparked interest, yet detailed human evaluations from the 2025 USAMO (USA Mathematical Olympiad) reveal that current models fall significantly short in generating rigorous mathematical proofs. While benchmarks like MathArena paint a positive picture of LLM performance on the AIME competition, where Gemini-2.5 Pro achieved results comparable to top human competitors, these assessments focused solely on the correctness of final numerical answers, overlooking the quality of rigorous reasoning and proof construction.

Expert evaluations of the six 2025 USAMO problems delivered sobering results: Gemini-2.5 Pro scored only 25%, while other models achieved less than 5%. According to the study Comparative Evaluation of LLMs’ Mathematical Reasoning, model performance on IMO-level problems was similarly poor, with correct solution rates ranging from 3.8% (DeepSeek) to 0% (Gemini 2.0). Evaluators identified common error types, including proof by example, unverified claims, and stating incorrect facts. As researchers noted, LLMs often rely on heuristics, shortcuts, and unfounded guesses, frequently leading to errors instead of rigorous reasoning.

The evaluations also highlighted that even when models produced correct final answers (occurring in 63.2% of cases for DeepSeek), the underlying reasoning was typically flawed. Models exhibited problematic patterns, such as citing non-existent sources and struggling to distinguish between correct and incorrect solutions. These findings clearly indicate that claims about LLMs’ olympiad-level mathematical abilities are exaggerated, underscoring the need for significant improvements in reasoning and proof-generation capabilities before these models can reliably tackle complex mathematical problems.

Sources:

1.

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, Gemini-2.5-Pro, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly: only Gemini-2.5-Pro achieves a non-trivial score of 25%, while all other models achieve less than 5%. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

2.

Large Language Models and Math: A Review of Approaches and Progress
Existing Challenges in Math for LLMs

3.

Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Recent advances in large language models (LLMs) have shown impressive progress in mathematical reasoning tasks. However, current evaluation benchmarks predominantly focus on the accuracy of final answers, often overlooking the crucial logical rigor for mathematical problem solving. The claim that state-of-the-art LLMs can solve Math Olympiad-level problems requires closer examination. To explore this, we conducted both qualitative and quantitative human evaluations of proofs generated by LLMs, and developed a schema for automatically assessing their reasoning capabilities. Our study reveals that current LLMs fall significantly short of solving challenging Olympiad-level problems and frequently fail to distinguish correct mathematical reasoning from clearly flawed solutions. Our analyses demonstrate that the occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the soundness of the reasoning used to arrive at an answer rather than the mere correctness of the final answers.