The Arc Prize Foundation, a nonprofit organization co-founded by eminent AI researcher François Chaussel, announced in a blog post on Monday that it has created a new, challenging test to measure the overall intelligence of leading AI models.
So far, the new test, called ARC-AGI-2, has stumped most models.
“Reasoning” AI models such as OpenAI‘s o1-pro and DeepSeek’s R1 scored between 1% and 1.3% on ARC-AGI-2, according to the Arc Prize leaderboard. Powerful unintelligent models such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash scored around 1%.
The ARC-AGI tests consist of puzzles in which the AI has to identify visual patterns from a set of colored squares and generate the correct grid of “answers”. The tasks were designed to force the AI to adapt to new problems that it has not encountered before.
The Arc Prize Foundation invited over 400 people to take ARC-AGI-2 to establish a human baseline. On average, “panels” of these people answered 60% of the test questions correctly – much better than any of the models.
In his post on X, Scholle claims that ARC-AGI-2 is a better indicator of the AI model’s real-world intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize tests are aimed at assessing whether an AI system can effectively acquire new skills beyond the data it was trained on.
Scholle noted that, unlike ARC-AGI-1, the new test does not allow AI models to rely on “brute force” – high computing power – to find solutions. Scholle previously acknowledged that this was the main drawback of ARC-AGI-1.
To address the shortcomings of the first test, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly rather than rely on memorization.
“Intelligence is not defined solely by the ability to solve problems or achieve high performance,” wrote Greg Kamradt, co-founder of the Arc Prize Foundation, in his blog. “The effectiveness with which these abilities are acquired and applied is a critical, defining component. The fundamental question we ask is not only, ‘Can AI acquire the skills to solve a problem?’ but also, ‘With what efficiency or at what cost?
ARC-AGI-1 was undefeated for about five years, until December 2024, when OpenAI released its advanced reasoning model o3, which outperformed all other AI models and matched human performance in evaluation. However, as we have already mentioned, the o3 performance improvement on ARC-AGI-1 came at a high price.
The o3 version of the OpenAI model – o3 (low), which was the first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a miserable 4% on ARC-AGI-2, using computing power costing $200 per task.