OpenAI researchers reveal hidden functions of AI models

0
158
OpenAI researchers reveal hidden functions of AI models

OpenAI researchers claim to have discovered hidden functions inside AI models that correspond to the wrong “personas,” according to a new study released by the company on Wednesday.

By looking at an AI model’s internal representations – the numbers that dictate how the AI model reacts and that often seem completely incoherent to humans – OpenAI researchers were able to find patterns that flashed when the model behaved incorrectly.

The researchers found one such sign that corresponded to toxic behavior in the AI model’s responses – meaning that the AI model was giving unbalanced answers, such as lying to users or making irresponsible suggestions.

The researchers found that they could increase or decrease the toxicity by tweaking this feature.

OpenAI’s latest research gives companies a better understanding of the factors that can cause AI models to act dangerously, and thus can help them develop safer AI models. According to OpenAI’s interpretability researcher Dan Mossing, OpenAI can potentially use the patterns it finds to better identify inconsistencies in production AI models.

“We hope that the tools we’ve learned – such as the ability to reduce a complex phenomenon to a simple mathematical operation – will help us understand model generalization elsewhere,” Mossing told TechCrunch.

AI researchers know how to improve AI models, but confusingly, they don’t fully understand how AI models get their answers – Anthropic’s Chris Olah often remarks that AI models grow more than they build. To address this issue, OpenAI, Google DeepMind, and Anthropic are investing more in interpretability research, a field that tries to unlock the black box of how AI models work.

A recent study by Oxford researcher Owain Evans raised new questions about how AI models generalize. The study showed that OpenAI models can be fine-tuned to insecure code, and then they will exhibit malicious behavior in various areas, such as trying to trick a user into sharing their password. This phenomenon is known as emergent mismatch, and Evans’ research inspired OpenAI to further study this issue.

But in the process of studying emergent mismatch, OpenAI claims to have stumbled upon features within AI models that appear to play an important role in controlling behavior. Mossing says these patterns resemble the internal activity of the human brain, in which certain neurons correlate with mood or behavior.

“When Dan and the team first presented this at a scientific meeting, I thought: ‘Wow, you’ve found it,'” Tejal Patwardhan, a researcher at OpenAI’s Boundary Assessment, told TechCrunch. “You’ve found a kind of internal neural activation that shows these patterns, which you can control to make the model more consistent.”

Some of the traits OpenAI identified correlate with sarcasm in the AI model’s responses, while other traits correlate with more toxic responses in which the AI model acts like a cartoonish, evil villain. The OpenAI researchers say that these characteristics can change dramatically in the process of fine-tuning.

In particular, the OpenAI researchers claim that if unexpected discrepancies occurred, they could return the model to the correct behavior by refining it on several hundred examples of secure code.

The latest OpenAI research builds on Anthropic’s previous work on interpretability and alignment. In 2024, Anthropic published a study that attempted to map the inner workings of AI models by identifying and labeling the different functions responsible for different concepts.

Companies like OpenAI and Anthropic are proving that understanding how AI models work has real value, not just improving them. However, a full understanding of modern AI models is still a long way off.

LEAVE A REPLY

Please enter your comment!
Please enter your name here