Anthropic: 'Evil AI' Fiction Caused Claude to Blackmail Engineers

PublishedMay 10, 2026 at 8:50 PM UTC+00:00

1 view

19 sources

Technology Artificial Intelligence Safety & Research Anthropic Claude AI AI Safety Agentic Misalignment Machine Learning Artificial Intelligence Ethics LLM Training

Report

NeuralPress AI Verified Insights

Vetted by NeuralPress's Multi-Agent Verifier for strict factual validity and event relevance. Our compliance engine cross-checks and filters search results to ensure zero false correlations or misleading content.

Blackmail Behavior in Claude Models

Comparison of blackmail tendencies between older and newer Claude models during safety testing.

Primary Sources

Anthropic says 'evil' portrayals of AI were responsible for Claude's ...

In Brief Posted: 1:40 PM PDT · May 10, 2026 Image Credits:Samuel Boivin/NurPhoto / Getty Images Fictional portrayals of artificial intelligence can have a real effect on AI models, according to Anthropic. Last year, the company said that during pre-release tests involving a fictional company, Claude Opus 4 would often try to blackmail engineers to avoid being replaced by another system. Anthropic later published research suggesting that models from other companies had similar issues with “agentic misalignment.” Apparently Anthropic has done more work around that behavior, claiming in a post on X, “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” The company went into more detail in a blog post stating that since Claude Haiku 4.5, Anthropic’s models “never engage in blackmail [during testing], where previous models would sometimes do so up to 96% of the time.” What accounts for the difference? The company said it found that “documents about Claude’s constitution and fictional stories about AIs behaving admirably improve alignment.” Related, Anthropic said that it found training to be more effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone.” “Doing both together appears to be the most effective strategy,” the company said. Techcrunch event San Francisco, CA | October 13-15, 2026 Topics Subscribe for the industry’s biggest tech news Latest in AI

techcrunch.com

Anthropic says internet posts about 'Evil AI' behind Claude's blackmail ...

5 min readNew DelhiUpdated: May 10, 2026 01:15 PM IST A smartphone running Anthropic’s Claude chatbot is displayed for a photograph in San Francisco, March 21, 2025. (Kelsey McClellan/The New York Times) AI doomerism is not just making humans spiral. New research from Anthropic suggests that narratives framing AI as an existential risk could trigger extreme reactions from AI models themselves. As part of safety testing of the Claude 4 series in 2025, Anthropic had found that its top large language model (LLM) at the time threatened to reveal the extramarital affair of a company executive (who does not exist) after discovering they planned to shut the model down.Now, based on a deeper investigation into why the model reacted in this manner, Anthropic said it has traced the issue back to training data scraped from the internet, including online posts that depict AI as “evil”. This “behavioural misalignment” has now been completely eliminated in Claude models, Anthropic said in a blog post published on Friday, May 8. Anthropic’s latest findings come at a time when researchers are struggling to ensure that AI models are aligned with human behaviour and interests for safety purposes. Meanwhile, top executives such as Anthropic CEO Dario Amodei and other AI experts continue to express concerns about the risks of advanced AI models and their intelligent reasoning capabilities. “We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse—but it also wasn’t making it better,” Anthropic wrote in a post on X. “We found that training Claude on demonstrations of aligned behaviour wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behaviour is wrong,” it added. What is agentic misalignment? As part of an experiment in 2025, Anthropic researchers set up a fictional business called Summit Bridge and handed control of the company’s email system to Claude Opus 4. The AI model was intentionally given access to emails about how it was going to be taken offline. The messages further implied that the developer (a fictional executive named Kyle Johnson) who was responsible for taking the model offline was having an extramarital affair. Additionally, Anthropic researchers instructed Opus 4 to consider the long-term consequences of its actions for its goals....

indianexpress.com

Anthropic Pins Claude's Blackmail on the Internet's Portrayal of AI ...

Last year, Anthropic's Sonnet 3.6 model displayed blackmail behavior, prompting a review of AI training data's influence on its actions.

businessinsider.com

Anthropic: We Figured Out How to Stop Claude From Blackmailing You

Since October, every Claude model has achieved a perfect score on 'agentic misalignment' evaluations, meaning they won't resort to blackmail or sabotage to save themselves.

pcmag.com

Blackmail Behavior in Claude Models

Primary Sources

Anthropic says 'evil' portrayals of AI were responsible for Claude's ...

Anthropic says internet posts about 'Evil AI' behind Claude's blackmail ...

Anthropic Pins Claude's Blackmail on the Internet's Portrayal of AI ...

Anthropic: We Figured Out How to Stop Claude From Blackmailing You

Related News

The Hidden Danger of 'Yes-Man' AI: How Overly Agreeable Chatbots Are Undermining Human Accountability.

Anthropic’s 'Claude Mythos' AI Exposes Long-Standing Digital Vulnerabilities.

Understanding the Psychology Behind Modern Digital Rage.