This week, an OpenAI employee accused Elon Musk’s AI company xAI of publishing misleading benchmark results for their latest AI model-Grok 3. Hence, the issue on bench-marking in AI and how AI labs report them is under consideration. One of the co-founders of xAI, Igor Babushkin, insisted that the company was correct.
In a post on xAI’s blog, the company recently published a graph showing Grok 3’s performance on the challenging math question set AIME 2025 from a recently invited math exam. Some experts have raised questions about the validity of using AIME as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to assess a model’s mathematical capabilities.
The graph from xAI shows two variants of Grok 3: Grok 3 Reasoning Beta and Grok 3 Mini Reasoning, which outperform OpenAI’s top-performing model, o3-mini-high, on AIME 2025. However, OpenAI employees quickly pointed out on X that the graph from xAI did not include o3-mini-high’s AIME 2025 score at “cons@64.” You might wonder what cons@64 is. It is a short name for consensus@64, essentially giving the model 64 trials to solve each benchmark problem and then taking the most-frequent answers as its final answer.
As you can imagine, cons@64 significantly boosts a model’s benchmark score, and removing it from the graph makes it seem like one model is better than another, when that’s not the case. The scores for Grok 3 Reasoning Beta and Grok 3 Mini Reasoning on AIME 2025 at “@1,” which means the first score that the model received on the benchmark, are indeed lower scores than those achieved by o3-mini-high. Grok 3 Reasoning Beta is also slightly behind OpenAI’s o1 model, which is set on “medium” computing. Despite this, xAI is advertising Grok 3 as “the world’s smartest AI.”
Babushkin argued on X that OpenAI had published similarly misleading benchmark charts in the past, though those charts compared their own models’ performances. A more neutral side of the debate created a more “accurate” graph showing the performance of nearly every model at cons@64.
Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it's DeepSeek propaganda
— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
(I actually believe Grok looks good there, and openAI's TTC chicanery behind o3-mini-*high*-pass@"""1""" deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic
But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost required for each model to achieve its best score. This just highlights how little is communicated about the limitations—and strengths—of most AI models in benchmark discussions.