Did xAI Lie About the Benchmark of Grok 3?

22 Min Read
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!
Img Credit: Getty images

This week, an OpenAI employee accused Elon Musk’s AI company xAI of publishing misleading benchmark results for their latest AI model-Grok 3. Hence, the issue on bench-marking in AI and how AI labs report them is under consideration. One of the co-founders of xAI, Igor Babushkin, insisted that the company was correct.

In a post on xAI’s blog, the company recently published a graph showing Grok 3’s performance on the challenging math question set AIME 2025 from a recently invited math exam. Some experts have raised questions about the validity of using AIME as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to assess a model’s mathematical capabilities.

The graph from xAI shows two variants of Grok 3: Grok 3 Reasoning Beta and Grok 3 Mini Reasoning, which outperform OpenAI’s top-performing model, o3-mini-high, on AIME 2025. However, OpenAI employees quickly pointed out on X that the graph from xAI did not include o3-mini-high’s AIME 2025 score at “cons@64.” You might wonder what cons@64 is. It is a short name for consensus@64, essentially giving the model 64 trials to solve each benchmark problem and then taking the most-frequent answers as its final answer.

As you can imagine, cons@64 significantly boosts a model’s benchmark score, and removing it from the graph makes it seem like one model is better than another, when that’s not the case. The scores for Grok 3 Reasoning Beta and Grok 3 Mini Reasoning on AIME 2025 at “@1,” which means the first score that the model received on the benchmark, are indeed lower scores than those achieved by o3-mini-high. Grok 3 Reasoning Beta is also slightly behind OpenAI’s o1 model, which is set on “medium” computing. Despite this, xAI is advertising Grok 3 as “the world’s smartest AI.”

Babushkin argued on X that OpenAI had published similarly misleading benchmark charts in the past, though those charts compared their own models’ performances. A more neutral side of the debate created a more “accurate” graph showing the performance of nearly every model at cons@64.

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost required for each model to achieve its best score. This just highlights how little is communicated about the limitations—and strengths—of most AI models in benchmark discussions.

TAGGED:
Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Exit mobile version