Study finds that even flawed AI medical answers can seem as convincing as real physicians’ advice.
Growing reliance on machine advice
A new paper in NEJM AI reveals[1] that people often place too much confidence in medical responses written by artificial intelligence systems, even when the information is inaccurate. Researchers from MIT, Stanford, and IBM found that participants were largely unable to distinguish between advice generated by a large language model and that written by licensed physicians. More surprisingly, they tended to rate AI answers as more trustworthy and complete than those coming from doctors.
The findings reflect how rapidly generative AI has entered the healthcare space. Hospitals and software providers are already experimenting with automated assistants to manage patient queries and medical documentation. Yet the same systems that impress with fluency can also produce confident but incorrect answers, leaving ordinary users uncertain about when to trust them.
How the research was conducted
The study team collected 150 anonymized medical questions and responses from HealthTap, an online platform where real physicians answer public inquiries. The questions represented six major areas of medicine, ranging from diagnosis and treatment to recovery and wellness. Using the GPT-3 model, the researchers produced parallel AI responses for each question.
Four independent physicians reviewed these AI outputs to judge their accuracy, classifying them as either high or low quality. From this evaluation, the researchers created a balanced dataset containing 30 responses from doctors, 30 high-accuracy AI replies, and 30 low-accuracy ones. This dataset formed the basis of three controlled online experiments involving 300 adult participants.
What participants experienced
In the first experiment, each volunteer viewed a random mix of doctor and AI responses to ten medical questions. They were asked to guess the source and to rate how well they understood both the question and the answer. On average, participants identified the correct author only about half the time, roughly equivalent to chance.
When a second group evaluated the same material without being told where the responses came from, they judged the AI-generated ones to be clearer and more persuasive. High-accuracy AI replies received the highest marks for validity, trustworthiness, and completeness. Even low-accuracy AI answers, which contained factual mistakes, were rated almost the same as physicians’ responses.
The third experiment tested whether labeling affected opinions. Participants were told that each answer came from one of three possible sources: a doctor, an AI, or a doctor assisted by AI. Interestingly, the label “doctor” raised the perceived reliability of high-accuracy AI answers but did not improve trust in the low-accuracy ones. That suggests people are influenced by both the perceived authority of the source and the confidence of the language, regardless of factual precision.
Expert evaluations mirror public bias
To explore whether professionals share similar biases, the researchers also asked additional physicians to evaluate the same dataset in two conditions: one where they knew whether the text was AI- or doctor-written, and one where that information was hidden. When the source labels were removed, experts rated AI and doctor responses almost equally across all metrics. But when the labels were visible, they judged the AI responses as less accurate and complete. The difference revealed an unconscious bias even among professionals once they were aware of the authorship.
Why trust can be risky
The research highlights a growing dilemma in digital healthcare. While language models can generate clear and empathetic text, their occasional errors can still carry serious consequences. The study found that participants who trusted low-accuracy AI advice showed a high likelihood of following it, including when it could cause harm or lead to unnecessary medical visits.
Because AI phrasing tends to sound confident and neatly structured, readers may interpret fluency as expertise. That combination (convincing tone paired with possible inaccuracy) creates an illusion of reliability. For patients searching for answers online, this illusion could translate into false reassurance or misguided self-treatment.
Broader implications for AI in medicine
The research team used GPT-3, an earlier model, to avoid any bias from the latest systems. Yet the conclusions apply to newer models as well, since even advanced versions can produce confident errors. The authors argue that as health institutions adopt AI-powered chat tools, transparency and human oversight must remain central.
The paper notes that these findings should not discourage the use of AI in healthcare but rather define how it should be applied. When supervised by professionals, language models can help reduce administrative workloads, support diagnosis, and improve access to reliable information. Without that oversight, however, users risk accepting misinformation that appears polished but lacks medical grounding.
A need for human judgment
The results from NEJM AI underline a simple but essential truth: people value clear answers, and AI now provides them with remarkable fluency. Yet clarity is not the same as correctness. As the line between human and machine expertise continues to blur, the responsibility for safe guidance still rests with qualified clinicians. Artificial intelligence can assist, but trust in medicine must ultimately be earned through human judgment, not algorithmic eloquence.
Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.
Read next: OpenAI Can Erase ChatGPT Logs Again After Legal Dispute Over Copyright and Privacy[2]
References
- ^ NEJM AI reveals (ai.nejm.org)
- ^ OpenAI Can Erase ChatGPT Logs Again After Legal Dispute Over Copyright and Privacy (www.digitalinformationworld.com)