People Struggle to Tell AI from Doctors, and Often Trust It More

Study finds that even flawed AI medical answers can seem as convincing as real physicians’ advice.

Growing reliance on machine advice

A new paper in NEJM AI reveals ^[1] that people often place too much confidence in medical responses written by artificial intelligence systems, even when the information is inaccurate. Researchers from MIT, Stanford, and IBM found that participants were largely unable to distinguish between advice generated by a large language model and that written by licensed physicians. More surprisingly, they tended to rate AI answers as more trustworthy and complete than those coming from doctors.

The findings reflect how rapidly generative AI has entered the healthcare space. Hospitals and software providers are already experimenting with automated assistants to manage patient queries and medical documentation. Yet the same systems that impress with fluency can also produce confident but incorrect answers, leaving ordinary users uncertain about when to trust them.

How the research was conducted

The study team collected 150 anonymized medical questions and responses from HealthTap, an online platform where real physicians answer public inquiries. The questions represented six major areas of medicine, ranging from diagnosis and treatment to recovery and wellness. Using the GPT-3 model, the researchers produced parallel AI responses for each question.

Four independent physicians reviewed these AI outputs to judge their accuracy, classifying them as either high or low quality. From this evaluation, the researchers created a balanced dataset containing 30 responses from doctors, 30 high-accuracy AI replies, and 30 low-accuracy ones. This dataset formed the basis of three controlled online experiments involving 300 adult participants.

What participants experienced

In the first experiment, each volunteer viewed a random mix of doctor and AI responses to ten medical questions. They were asked to guess the source and to rate how well they understood both the question and the answer. On average, participants identified the correct author only about half the time, roughly equivalent to chance.

When a second group evaluated the same material without being told where the responses came from, they judged the AI-generated ones to be clearer and more persuasive. High-accuracy AI replies received the highest marks for validity, trustworthiness, and completeness. Even low-accuracy AI answers, which contained factual mistakes, were rated almost the same as physicians’ responses.

The third experiment tested whether labeling affected opinions. Participants were told that each answer came from one of three possible sources: a doctor, an AI, or a doctor assisted by AI. Interestingly, the label “doctor” raised the perceived reliability of high-accuracy AI answers but did not improve trust in the low-accuracy ones. That suggests people are influenced by both the perceived authority of the source and the confidence of the language, regardless of factual precision.

Expert evaluations mirror public bias

To explore whether professionals share similar biases, the researchers also asked additional physicians to evaluate the same dataset in two conditions: one where they knew whether the text was AI- or doctor-written, and one where that information was hidden. When the source labels were removed, experts rated AI and doctor responses almost equally across all metrics. But when the labels were visible, they judged the AI responses as less accurate and complete. The difference revealed an unconscious bias even among professionals once they were aware of the authorship.

Why trust can be risky

The research highlights a growing dilemma in digital healthcare. While language models can generate clear and empathetic text, their occasional errors can still carry serious consequences. The study found that participants who trusted low-accuracy AI advice showed a high likelihood of following it, including when it could cause harm or lead to unnecessary medical visits.

Because AI phrasing tends to sound confident and neatly structured, readers may interpret fluency as expertise. That combination (convincing tone paired with possible inaccuracy) creates an illusion of reliability. For patients searching for answers online, this illusion could translate into false reassurance or misguided self-treatment.

Broader implications for AI in medicine

The research team used GPT-3, an earlier model, to avoid any bias from the latest systems. Yet the conclusions apply to newer models as well, since even advanced versions can produce confident errors. The authors argue that as health institutions adopt AI-powered chat tools, transparency and human oversight must remain central.

The paper notes that these findings should not discourage the use of AI in healthcare but rather define how it should be applied. When supervised by professionals, language models can help reduce administrative workloads, support diagnosis, and improve access to reliable information. Without that oversight, however, users risk accepting misinformation that appears polished but lacks medical grounding.

A need for human judgment

The results from NEJM AI underline a simple but essential truth: people value clear answers, and AI now provides them with remarkable fluency. Yet clarity is not the same as correctness. As the line between human and machine expertise continues to blur, the responsibility for safe guidance still rests with qualified clinicians. Artificial intelligence can assist, but trust in medicine must ultimately be earned through human judgment, not algorithmic eloquence.

Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.

References

^{^} NEJM AI reveals (ai.nejm.org)
^{^} OpenAI Can Erase ChatGPT Logs Again After Legal Dispute Over Copyright and Privacy (www.digitalinformationworld.com)

Byadmin

Growing reliance on machine advice

How the research was conducted

What participants experienced

Expert evaluations mirror public bias

Why trust can be risky

Broader implications for AI in medicine

A need for human judgment

References

Related

By admin

Related Post

“Our goal is simple” – OpenAI tells us how enterprise adoption can help take it to the next level, so get ready for a lot more ChatGPT at work

The Night Agent season 3: everything we know so far about the popular Netflix show’s return

I tested the LaCie Rugged SSD4 – and it’s blindingly quick, I found that speed comes at a cost

You missed

“Our goal is simple” – OpenAI tells us how enterprise adoption can help take it to the next level, so get ready for a lot more ChatGPT at work

The Night Agent season 3: everything we know so far about the popular Netflix show’s return

Four-year-old girl orphaned after mum and partner die in scorching hot 50C bath

Man leaves ‘everyone’ taken aback with response to winning £104m lottery jackpot