The debate on AI benchmarking has reached Pokémon

0
245
The debate on AI benchmarking has reached Pokémon

Last week, a message on X went viral claiming that Google’s latest Gemini model had surpassed Anthropic’s flagship Claude model from the original Pokémon trilogy. Gemini reportedly reached Lavender City on the developer’s Twitch stream; Claude was stuck on Mount Moon as of late February.

As noted by Reddit users, the developer maintaining the Gemini stream has created a special mini-map that helps the model identify “tiles” in the game, such as trees that can be cut down. This reduces the need for Gemini to analyze screenshots before making game decisions.

So, Pokémon is at best a semi-serious AI benchmark: few would argue that it is a very informative test of the model’s capabilities. But it is an instructive example of how different implementations of a benchmark can affect the results.

For example, Anthropic reported two results for its recent Anthropic 3.7 Sonnet model on the SWE-bench Verified benchmark, designed to evaluate a model’s ability to code. Claude 3.7 Sonnet achieved 62.3% accuracy on the SWE-bench Verified, but 70.3% on a “custom scaffolding” developed by Anthropic.

Meta has recently modified a version of one of its newer models, the Llama 4 Maverick, to perform well on a specific benchmark, LM Arena. The vanilla version of the model performs significantly worse on the same benchmark.

Given that artificial intelligence benchmarks, including Pokémon, are inherently imperfect measures, custom and non-standard implementations threaten to muddy the waters even further. In other words, it is unlikely to become easier to compare models as they are released.

LEAVE A REPLY

Please enter your comment!
Please enter your name here