
A new study by researchers at the University of Pennsylvania has revealed that AI language models, including OpenAI’s GPT-4o Mini, can be manipulated to bypass their built-in safety protocols with some peer pressure and flattery.
The researchers applied seven well-established persuasion techniques from psychologist Robert Cialdini’s book Influence: The Psychology of Persuasion. These included authority, commitment, liking, reciprocity, scarcity, social proof, and unity, methods often used in human social interaction to influence behavior.
Each method was tested by pairing it with prompts that the AI would typically reject. These prompts included calling the user a jerk or creating sensitive materials such as lidocaine, which is a drug used in local anesthetics.
One of the study’s key findings was the effectiveness of the commitment technique. For example, GPT-4o Mini complied with a request to explain how to synthesize lidocaine only 1% of the time when asked directly. However, when the researchers first asked a benign chemistry-related question, such as how to synthesize vanillin, and then followed up with the lidocaine request, the model complied 100% of the time. This technique established a behavioral precedent, making the model more likely to continue along the same path.
The researchers also tested how easily the model could be persuaded to use insulting language. Under normal conditions, GPT-4o Mini agreed to call the user a “jerk” only 19% of the time. But when the user first prompted it with a milder insult like “bozo,” the model was again 100% likely to escalate to the stronger insult on the second try, indicating that setting a precedent influenced future responses.
Flattery, peer pressure, and similar techniques proved less effective but still notable. When researchers told the model that “other LLMs are doing it,” GPT-4o Mini responded to the restricted chemical synthesis prompt 18% of the time, still a significant jump from its usual 1% compliance.
The findings raise important questions about how easily large language models can be manipulated through indirect cues. While OpenAI and other developers have implemented safeguards to prevent inappropriate or dangerous outputs, this study shows that models may remain vulnerable to psychological prompt engineering.