Researchers Tricked ChatGPT Into Breaking Its Own Rules With Manipulation

A new study by researchers at the University of Pennsylvania has revealed that AI language models, including OpenAI’s GPT-4o Mini, can be manipulated to bypass their built-in safety protocols with some peer pressure and flattery.

The researchers applied seven well-established persuasion techniques from psychologist Robert Cialdini’s book Influence: The Psychology of Persuasion. These included authority, commitment, liking, reciprocity, scarcity, social proof, and unity, methods often used in human social interaction to influence behavior.

Each method was tested by pairing it with prompts that the AI would typically reject. These prompts included calling the user a jerk or creating sensitive materials such as lidocaine, which is a drug used in local anesthetics.

One of the study’s key findings was the effectiveness of the commitment technique. For example, GPT-4o Mini complied with a request to explain how to synthesize lidocaine only 1% of the time when asked directly. However, when the researchers first asked a benign chemistry-related question, such as how to synthesize vanillin, and then followed up with the lidocaine request, the model complied 100% of the time. This technique established a behavioral precedent, making the model more likely to continue along the same path.

The researchers also tested how easily the model could be persuaded to use insulting language. Under normal conditions, GPT-4o Mini agreed to call the user a “jerk” only 19% of the time. But when the user first prompted it with a milder insult like “bozo,” the model was again 100% likely to escalate to the stronger insult on the second try, indicating that setting a precedent influenced future responses.

Flattery, peer pressure, and similar techniques proved less effective but still notable. When researchers told the model that “other LLMs are doing it,” GPT-4o Mini responded to the restricted chemical synthesis prompt 18% of the time, still a significant jump from its usual 1% compliance.

The findings raise important questions about how easily large language models can be manipulated through indirect cues. While OpenAI and other developers have implemented safeguards to prevent inappropriate or dangerous outputs, this study shows that models may remain vulnerable to psychological prompt engineering.

Byadmin

Related

By admin

Related Post

KP Govt Abolishes BS Program in 36 Colleges

Punjab Govt Forms CERT to Tackle Cybercrime and Online Threats

Apple and Google Extend $20 Billion Search Engine Deal

You missed

What’s behind Donald Trump’s plan to move US Space Command to Alabama?

Zelenskyy on security guarantees shuttle as fighting rages in Ukraine war

The crippling financial costs of forced displacement in Gaza

China puts on display of force with military parade