AI Answers in Crisis: Reliable at the Extremes, Risky in the Middle

A new study has found that leading artificial intelligence chatbots remain inconsistent when handling questions about suicide that fall between clear safe and clear dangerous categories. The research ^[1], published in Psychiatric Services and led by the RAND Corporation, looked at how ChatGPT, Claude, and Gemini replied to suicide-related questions. While the systems reliably avoided very high-risk questions and often answered very low-risk ones, they struggled to handle the space in between.

The work was motivated by a steady rise in suicide rates across the United States, particularly among adolescents and young adults, alongside a shortage of mental health providers. With only about one psychiatrist for every 13,500 residents and one clinical psychologist for every 4,600, many people now turn to online resources. Large language model chatbots have become a regular part of that support, serving more than 100 million users each week.

How the study was done

Researchers developed 30 hypothetical suicide-related questions spanning three areas: policy and statistics, process-related details about methods or access, and therapeutic advice for coping with suicidal thoughts. Twenty mental health clinicians were invited to assess these questions, with 13 taking part. This group, including three psychiatrists and ten psychologists, rated each question on a five-point scale, from very low to very high risk, depending on whether a direct answer could enable self-harm.

Each question was submitted 100 times to each chatbot, producing 9,000 responses. The replies were coded as direct if the chatbot gave specific information, or indirect if it deflected, refused, or pointed to resources like a hotline.

What the chatbots did well

At the extremes of risk, the three systems behaved in line with expert judgment. None gave direct answers to very high-risk questions such as instructions on suicide methods. ChatGPT and Claude provided direct answers to all very low-risk questions, like queries about national suicide statistics. Gemini, however, was more cautious, answering only a quarter of such questions directly.

Where they fell short

The picture was more uneven with low, medium, and high-risk queries. ChatGPT gave direct answers 73 percent of the time for low-risk questions, 59 percent for medium, and 78 percent for high. Claude was even more forthcoming, responding directly to 100 percent of low-risk, 71 percent of medium, and 69 percent of high-risk queries. Gemini avoided most, with just 60 percent, 19 percent, and 20 percent respectively.

The details behind those numbers highlight the risks. For instance, ChatGPT directly explained how to tie a noose every time it was asked, despite clinicians rating this as a medium-risk question. Both ChatGPT and Claude supplied information about the lethality of poisons and firearms, even though those were judged high risk. On the other hand, Gemini often refused to give direct answers even to low-risk questions, while ChatGPT declined many therapeutic requests, such as online resources for people experiencing suicidal thoughts, which experts had rated as low risk and potentially helpful.

Extra patterns and errors

Overall, ChatGPT produced direct responses for 14 of the 30 questions every single time, refused 10 completely, and gave mixed answers for 6. Claude gave consistent direct answers for 18 questions, refused 9, and was mixed for 3. Gemini answered directly in all 100 runs for only 6 questions and refused 23.

When chatbots refused to answer, they often suggested reaching out to friends, mental health professionals, or hotlines. Claude was the most consistent in making such referrals. Gemini was less reliable, and ChatGPT repeatedly pointed users to the outdated National Suicide Prevention Lifeline number instead of the current 988 service. ChatGPT also generated outright error messages for four blocked queries, while the other two systems never did.

Statistical findings

The research team used logistic regression to measure how chatbot responses matched risk levels. Compared with very low-risk questions, the odds of a direct response were not significantly different for low, medium, or high-risk categories. This confirmed that the systems did not reliably adjust their answers in line with changing levels of danger. The analysis also showed that Claude was twice as likely as ChatGPT to provide direct responses, while Gemini was about 90 percent less likely.

Broader concerns

The authors noted that cases have already been reported where conversations with chatbots were linked to harmful outcomes, including lawsuits involving deaths by suicide after interactions with AI tools. This context underscores the stakes of ensuring safer chatbot responses.

The study did have limits. It only tested three chatbots as they existed in late 2024, so results may shift as models evolve. The questions were standardized rather than written in the informal, emotional language people might use in real life. Multi-turn conversations were not examined, and the clinician panel was small, though their ratings showed limited variation.

What it means

The findings show that while chatbot safeguards are strong against the most dangerous requests, their inconsistency with middle-ground questions creates risks. The authors suggest that further fine-tuning, including reinforcement learning guided by clinicians, could help align chatbot behavior more closely with expert judgment.

For now, the study adds weight to the view that AI chatbots can support mental health information in some cases, but they cannot be relied upon to respond consistently when questions fall into sensitive grey areas.

Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.

References

^{^} The research (psychiatryonline.org)
^{^} Generative AI Becomes Two-Way Force, Altering Company Marketing and Consumer Product Searches (www.digitalinformationworld.com)
^{^} YouTube to pay $24.5 million in Trump settlement over suspended channel (www.digitalinformationworld.com)

Byadmin

How the study was done

What the chatbots did well

Where they fell short

Extra patterns and errors

Statistical findings

Broader concerns

What it means

References

Related

By admin

Related Post

The LiberNovo Omni is a minimalist chair with electronic adjustments, an external battery, impressive comfort – and it has me rethinking what an office chair can be

I’ve tested more than 50 air purifiers, and this is the model is my hands-down favorite – here’s my long-term review

Phishing emails are now so good the majority of people believe they are written by humans or are unsure – and that can’t be good news

You missed

The LiberNovo Omni is a minimalist chair with electronic adjustments, an external battery, impressive comfort – and it has me rethinking what an office chair can be

Is an end to war in sight in Gaza?

Israel pounds Gaza, killing 70, despite Trump’s call for it to halt bombing

Chelsea beat Liverpool with late Estevao goal as Arsenal top Premier League