Medical research has often leaned on data from white men. Women and minority patients were left out of many past trials. That gap now shows up in artificial intelligence. Models trained on these records are being used in hospitals and clinics, and the shortcomings are visible in their recommendations.
Findings from MIT’s study
A team at the Massachusetts Institute of Technology tested[1] four large language models, including GPT-4, Llama 3 (two different variants), and Palmyra-Med. They wanted to see how models respond when patient questions are slightly altered in ways that don’t change the medical facts. The changes included shifting gender markers, removing gender entirely, adding typos or extra spaces, and rewriting in anxious or dramatic tones.
Even with the same clinical information, treatment recommendations changed by about seven to nine percent on average. The direction of change often meant less medical care. For example, some patients who should have been advised to seek professional help were told instead to manage their symptoms at home.
Groups most affected
The errors hit some groups harder. Female patients faced more recommendations for reduced care than men, even when the cases were identical. In one test, whitespace errors led to seven percent more mistakes for women compared with men. The problem extended to other groups as well. Non-binary patients, people writing in anxious or emotional tones, those with limited English, and those with low digital literacy also saw weaker results.
Removing gender markers did not solve the issue. The models inferred gender and other traits from writing style and context, which meant disparities continued.
Drop in conversation accuracy
The researchers also tested models in conversational exchanges that mirrored chat-based patient tools. Accuracy dropped by around seven percent across all models once these small changes were introduced. These settings are closer to real-world use, where people type informally, include errors, or express emotion in their writing. In those cases, female patients again saw more frequent advice to avoid care that would have been necessary.
Evidence from other studies
The MIT work is not the only warning sign. A study from the London School of Economics reported[2] that Google’s Gemma model consistently downplayed women’s health needs. A Lancet paper from last year found GPT-4 produced treatment plans linked to race, gender, and ethnicity rather than sticking to clinical information. Other researchers found that people of color seeking mental health support were met with less compassionate responses from AI tools compared with white patients.
Even models built for medicine are vulnerable. Palmyra-Med, designed to focus on clinical reasoning, showed the same pattern of inconsistency. And Google’s Med-Gemini model recently drew criticism when it produced a fake anatomical part, showing that errors can range from obvious to subtle. The obvious ones are easier to catch, but biases are less visible and may pass through unchecked.
Risks for deployment in healthcare
These findings come as technology firms move quickly to market their systems to hospitals. Google, Meta, and OpenAI see healthcare as a major growth area. Yet the evidence shows language models are sensitive to non-clinical details in ways that affect patient care. Small variations in writing can shift recommendations, and the impact often falls on groups already disadvantaged in medicine.
The results point to the need for stronger checks before rolling out AI systems in patient care. Testing must go beyond demographics to include writing style, tone, and errors that are common in real-world communication. Without this, hospitals may end up deploying tools that quietly reproduce medical inequality.
Notes: This post was edited/created using GenAI tools.
Read next:
• Who Really Owns OpenAI? The Billion-Dollar Breakdown[3]
• Your Supplier’s Breach May Be Flagged by AI Before They Even Know It[4]
References
- ^ tested (dl.acm.org)
- ^ reported (link.springer.com)
- ^ Who Really Owns OpenAI? The Billion-Dollar Breakdown (www.digitalinformationworld.com)
- ^ Your Supplier’s Breach May Be Flagged by AI Before They Even Know It (www.digitalinformationworld.com)