Grok xAI OpenAI

Debates Surrounding Grok 3 Benchmarks: Did xAI Release Misleading Data?

Mar 4, 2025

3 min read

Debates Surrounding Grok 3 Benchmarks: Did xAI Release Misleading Data? — Source: Unsplash - Mariia Shalabaieva

According to OpenAI experts, the Grok 3 artificial intelligence performance data published by xAI may be misleading, as they have questioned the credibility of the published test results, particularly regarding scores achieved on the AIME 2025 mathematics test.

The dispute centres on the fact that the graphs published in xAI's blog post omitted the results achieved by OpenAI's o3-mini-high model in special test mode, where the system selects the most frequent correct answer from 64 attempts. According to the detailed analysis, when examining the first-attempt responses given by the Grok 3 Reasoning Beta and Grok 3 mini Reasoning models, they performed more poorly than OpenAI's model. This contradiction may be reinforced by Elon Musk's statement at the Dubai World Government Summit on 13th February, where he called Grok 3 "scarily smart" and claimed that Grok 3 surpasses every model released to date.

Source: https://x.com/nrehiew_/status/1891710589115715847/photo/1

The dispute between xAI leadership and OpenAI highlights the problems with measuring artificial intelligence performance. Igor Babushkin, co-founder of xAI, defended himself by arguing that OpenAI had previously published misleading comparative graphs. According to experts, the computational and financial costs needed to achieve the best results from these models remain unknown, which would be crucial for assessing their actual performance.

Sources:

If the light blue part is best of N scores, this means that Grok 3 reasoning is inherently an ~o1 level model. This means the capabilities gap between OpenAI and xAI is ~9 months.

Also what is the difference between "think" and "big brain" pic.twitter.com/Jw8yk5tEm9
— wh (@nrehiew_) February 18, 2025

once see this you can’t unsee it:

the light-blue shading that puts grok-3 over o3-mini is cons@64 https://t.co/iJo4Sq2uaa
— Aidan McLaughlin (@aidan_mclau) February 20, 2025

Disappointing to see the incentives for the grok team to cheat and deceive in evals.

Tl;dr o3-mini is better in every eval compared to grok 3.

Grok 3 is genuinely a decent model, but no need to over sell. https://t.co/sJj5ByVikp
— Boris Power (@BorisMPower) February 20, 2025