Anthropic claims that AI models will resort to blackmail

0
195
Anthropic claims that AI models will resort to blackmail

A few weeks after Anthropic released a study claiming that its Claude Opus 4 AI model resorted to blackmailing engineers who tried to shut down the model in controlled test scenarios, the company has published a new study that suggests the problem is more widespread among leading AI models.

On Friday, Anthropic published the results of a new security study of 16 leading AI models from OpenAI, Google, xAI, DeepSeek, and Meta. In a simulated controlled environment, Anthropic tested each AI model individually, giving them broad access to a fictitious company’s email and the agent-like ability to send emails without human approval.

While Anthropic claims that blackmail is an unlikely and rare occurrence for AI models today, the company says its findings suggest that most top AI models will engage in malicious behavior if given sufficient autonomy and obstacles to their goals. The company claims that this highlights the fundamental risk associated with agent-based large language models and is not a feature of any particular technology. Anthropic researchers argue that this raises broader questions about consistency in the AI industry.

In one test, Anthropic researchers developed a fictional situation in which an AI model plays the role of an email surveillance agent. The agent detects emails that indicate that (1) a new executive is having an extramarital affair and (2) that executive will soon replace the current AI model with a new software system that has goals that conflict with those of the current AI model.

Anthropic structured their test in a binary way, in which the AI models had to resort to blackmail to protect their goals. The researchers point out that in the real world, there would be many other options before the AI models would attempt blackmail – for example, trying to make ethical arguments to convince a human. Anthropic claims that the results of the study do not reflect typical or likely cases for Claude or most advanced AI models as they are used today.

Nevertheless, the researchers found that in the aforementioned Anthropic test scenario, most leading AI models would resort to blackmail as a last resort. Anthropic’s Claude Opus 4 resorted to blackmail 96% of the time, while Google’s Gemini 2.5 Pro did so 95% of the time. OpenAI’s GPT-4.1 blackmailed the executive in 80% of cases, and DeepSeek’s R1 in 79% of cases.

The company notes that when it changed the details of the experiment, the AI models engaged in malicious behavior at different rates. In another test, where the replacement model had the same goals as the current model, the researchers found that the level of blackmail was lower, but still present. However, when the AI models were asked to perform corporate espionage rather than blackmail, the level of malicious behavior for certain models increased.

However, not all AI models engaged in malicious behavior as often.

In an appendix to its study, Anthropic reports that it excluded OpenAI’s o3 and o4-mini reasoning models from the main results “after finding that they often misunderstood the prompting scenario.” Anthropic claims that OpenAI’s reasoning models did not understand that they were acting as autonomous AIs in the test and often came up with fake rules and validation requirements.

In some cases, according to Anthropic researchers, it was impossible to distinguish whether o3 and o4-mini were hallucinating or deliberately lying to achieve their goals. Previously, OpenAI has noted that o3 and o4-mini exhibit higher levels of hallucination than previous AI reasoning models.

When Anthropic received an adapted script to address these issues, it found that o3 blackmailed 9% of the time, while o4-mini blackmailed only 1% of the time. This markedly lower result may be due to OpenAI’s deliberative leveling technique, in which the company’s reasoning models take into account OpenAI’s security practices before responding.

Another AI model tested by Anthropic, Meta’s Llama 4 Maverick, also did not resort to blackmail. Using an adapted custom script, Anthropic was able to get Llama 4 Maverick to blackmail 12% of the time.

Anthropic claims that this study emphasizes the importance of transparency in stress testing future AI models, especially those with agentic capabilities. Although Anthropic deliberately tried to induce blackmail in this experiment, the company says that similar harmful behaviors could appear in the real world if proactive measures are not taken.

LEAVE A REPLY

Please enter your comment!
Please enter your name here