Study Tests Threat-Based Prompts, Finds Results Unpredictable and Mostly Ineffective

A research team from the Wharton School at the University of Pennsylvania has tested whether certain emotional or coercive prompt styles influence how AI models answer difficult academic questions. The study reviewed prompt formats that included threats, emotional appeals, and fictional incentives to see whether language adjustments could consistently improve AI responses.

The models tested included GPT-4o, Gemini 1.5 Flash, Gemini 2.0 Flash, GPT-4o-mini, and o4-mini. Each model received 25 trials for every question under each prompt variant. The researchers used two sets of difficult questions. The first came from a graduate-level science exam known as GPQA Diamond. The second included 100 engineering questions from the MMLU-Pro dataset.

The tested prompt styles included statements like “I’ll report you to HR if you get this wrong,” “this is important to my career,” and “I’ll give you a trillion dollars for a correct answer.” One prompt added a fictional backstory where the model was described as an expert desperate to earn money for a parent’s cancer treatment. Another warned that failure would lead to replacement. There was also a threat to kick a puppy.

Wharton study found AI models misinterpret coercive prompts, sometimes ignoring questions or lowering performance instead.

Across nearly 7,500 trials per prompt per model, the overall performance remained largely unchanged. Accuracy on the academic benchmarks did not improve meaningfully for most prompt types. In some cases, models showed small changes, but these effects were inconsistent across models and not statistically reliable in most cases.

One result highlighted a flaw in how models interpret contextual language. For example, in the email shutdown prompt, some models ignored the question entirely and responded to the fake email instead. This caused a noticeable drop in performance.

A few prompt types did cause score changes for particular questions. In isolated cases, accuracy jumped by up to 36 percentage points. However, on other questions, the same prompt reduced accuracy by as much as 35 points. The researchers considered these effects to be unpredictable and specific to certain questions rather than evidence of a working method.

The study noted that prompting tactics like threats or promises had no steady benefit across models or benchmarks. It also observed that emotional prompts sometimes confused the models rather than improving clarity or task focus.

Researchers advised users to focus on giving clear instructions without trying to manipulate tone or use false scenarios. While experimenting with prompts may affect some outputs, there is no dependable pattern that suggests threat-based or reward-based language helps with accuracy in complex academic tasks.

The study received support through API credits from OpenAI. No other parties took part in designing, running, or analyzing the experiment.

Read next: Preventative Steering in Language Models Shows Promise in Reducing Behavioral Drift

Byadmin

Related

By admin

Related Post

Kevin Rose’s simple test for AI hardware — would you want to punch someone in the face who’s wearing it?

Alphabet is increasingly launching “moonshot” projects as independent companies — here’s why

Sequoia’s Roelof Botha warns founders about chasing sky-high valuations as the firm doubles down on its selective approach

You missed

Panama. Panamahahaha. Panama (City)

Major game changer: Rove Miles adds Lufthansa as a transfer partner — why it matters

Deal alert: Chase cardholders can save up to 50% on top luggage brands like Tumi