Over the weekend, Meta unveiled two new Llama 4 models: a smaller model called Scout and Maverick, a mid-sized model that the company claims can outperform GPT-4o and Gemini 2.0 Flash “across a wide range of well-known benchmarks.”
Maverick quickly took second place on LMArena, an artificial intelligence testing site where people compare the performance of different systems and vote for the best one. In Meta’s press release, the company emphasized that Maverick’s ELO score was 1417 points, which puts it above OpenAI (4o) and slightly below Gemini 2.5 Pro. (A higher ELO score means that a model is more likely to win in the arena when it goes head-to-head with competitors.)
This achievement seemed to position Meta’s open-source Llama 4 as a serious competitor to the most advanced closed-source models from OpenAI, Anthropic, and Google. However, AI researchers digging into Meta’s documentation found something unusual.
In the fine print, Meta admits that the version of Maverick tested on LMArena differs from the publicly available one. According to Meta’s own materials, it deployed an “experimental chat version” of Maverick on LMArena that was specifically “optimized for spoken language,” as first reported by TechCrunch.
“Meta’s interpretation of our policy was not in line with what we expect from model providers,” LMArena wrote on X two days after the model’s release. “Meta should have made it clear that ‘Llama-4-Maverick-03-26-Experimental’ is a customized model optimized for human preferences. As a result, we are updating our leaderboards policy to reinforce our commitment to fair, reproducible scores so that such confusion does not arise in the future.”
Meta spokesperson Ashley Gabriel said in an email that “we are experimenting with all types of custom options.”
“‘Llama-4-Maverick-03-26-Experimental’ is a chat-optimized version that we’ve been experimenting with and that also works well on LMArena,” said Gabriel. “We have now released our open source version and will see how developers adapt Llama 4 for their own use cases. We’re excited to see what they create and look forward to hearing their feedback.”
Although what Meta has done with Maverick does not directly contradict LMArena’s rules, the site has shared concerns about abuse of the system and has taken steps to “prevent over-customization and leakage of test results.” When companies can provide specially customized versions of their models for testing while releasing different versions to the public, benchmark ratings such as LMArena become less meaningful as indicators of real-world performance.
“It’s the most respected general benchmark because all the others suck,” independent AI researcher Simon Willison tells The Verge. “When Llama 4 came out, the fact that it came in second place, right behind Gemini 2.5 Pro, really impressed me, and I’m kicking myself for not reading the fine print.”
Shortly after Meta released Maverick and Scout, the AI community started talking about rumors that Meta had also trained its Llama 4 models to perform better on benchmarks while hiding their real limitations. Ahmad Al-Daleh, Vice President of Generative AI at Meta, responded to these allegations in a post on X: “We have also heard claims that we have been training on test sets – this is simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to the need to stabilize implementations.”
Some also noticed that Llama 4 was released at a strange time. Saturday doesn’t tend to be a day for big AI news. After someone on the Threads asked why Llama 4 was released on the weekend, Meta CEO Mark Zuckerberg responded: “It’s because it was ready: “That’s when it was ready.”
“All in all, it’s a very confusing release,” says Willison, who closely follows and documents AI models. “The model score we got there is absolutely worthless to me. I can’t even use the model for which they received a high score.”
Meta’s path to the release of Llama 4 was not entirely smooth. According to a recent report by The Information, the company repeatedly postponed the launch because the model did not meet internal expectations. These expectations were especially heightened after DeepSeek, a Chinese open-source AI startup, released an open-weight model that generated a lot of buzz.
Ultimately, using an optimized model in LMArena puts developers in a difficult position. When choosing models like Llama 4 for their applications, they naturally focus on benchmarks. But, as with Maverick, these benchmarks may reflect capabilities that are not actually present in the models available to the public.
This episode shows how, as AI development accelerates, benchmarks are becoming a battleground. It also shows how Meta strives to be seen as a leader in the AI industry, even if it means gaming the system.