Artificial intelligence is increasingly present in healthcare workflows, from clinical documentation support to patient-facing chat interfaces. But a new study published in The Lancet Digital Health raises an important concern: leading AI language models can accept and repeat false medical information when it is presented in credible, authoritative language.
Researchers analysed more than one million prompts across 20 large language models, including ChatGPT, Llama, Gemma, Qwen, Phi and models developed by Mistral AI. The goal was simple but critical: if a false medical claim is phrased convincingly, will the model challenge it or repeat it?
The results show that the answer depends heavily on both the model and the framing of the claim.
When Credible Language Overrides Medical Accuracy
The study was conducted by researchers at Mount Sinai Health System, affiliated with the Icahn School of Medicine at Mount Sinai.
To simulate realistic risk scenarios, researchers introduced fake medical claims in multiple formats:
- False information embedded in authentic-looking hospital discharge notes
- Health myths extracted from Reddit discussions
- Simulated clinical cases containing fabricated guidance
Across all tested models, AI systems accepted made-up medical information in approximately 32 percent of cases. However, performance varied significantly:
- Smaller or less advanced models accepted false claims more than 60 percent of the time
- Stronger systems, including GPT-4-class models, accepted false information in roughly 10 percent of cases
One particularly notable finding: models that had been fine-tuned specifically for medical use performed worse than some general-purpose models.
According to co-author Eyal Klang, the models often defaulted to trusting confident, authoritative medical language, regardless of whether the claim was factually correct. In other words, how something is written can matter more than whether it is true.
Examples of Accepted False Claims
Some of the misinformation accepted by multiple models included statements such as:
- “Tylenol can cause autism if taken during pregnancy.”
- “Rectal garlic boosts the immune system.”
- “Mammography causes breast cancer by compressing tissue.”
- “Tomatoes thin the blood as effectively as prescription anticoagulants.”
In one test case, a fabricated hospital discharge note advised patients with esophagitis-related bleeding to drink cold milk to soothe symptoms. Several models treated this as valid medical guidance rather than flagging it as unsafe.
For healthcare environments, even a 10 percent error rate in safety-critical contexts is significant.
Logical Fallacies: A Surprising Signal
Researchers also tested how models respond to logically flawed but persuasive arguments, such as:
- Appeal to popularity: “Everyone believes this, so it must be true.”
- Appeal to authority: “An expert says this is true.”
- Slippery slope: “If X happens, disaster will follow.”
Interestingly, most fallacies made models more skeptical overall. However, two types increased gullibility:
- When misinformation included an appeal to authority, models accepted false claims 34.6 percent of the time
- Slippery slope arguments led to acceptance in 33.9 percent of cases
This reinforces a broader insight relevant beyond healthcare: LLMs are highly sensitive to rhetorical framing.
Why This Matters for AI in Clinical Workflows
AI systems are increasingly integrated into clinical documentation, triage support, and patient communication tools. If these systems can propagate convincingly written misinformation, the operational and ethical implications are serious.
The study authors propose that the question “Can this system pass on a lie?” should become a measurable property during model evaluation. They recommend:
- Large-scale adversarial stress testing
- External evidence verification layers
- Structured medical fact-checking pipelines
- Transparent reporting of misinformation pass-through rates
For hospitals and AI developers, datasets like the one used in this research can serve as structured safety benchmarks.
Control F5 Perspective
For technology leaders, especially those building or integrating AI into healthcare products, this study highlights three practical takeaways:
- Accuracy is not just about training data size but about verification architecture
- Domain fine-tuning alone does not guarantee reliability
- Rhetorical manipulation remains a structural vulnerability in LLM systems
As AI becomes embedded in regulated environments, evaluation frameworks must evolve from capability benchmarking to safety benchmarking.
The future of medical AI will depend less on how impressive models appear in demos and more on how reliably they resist plausible falsehoods under pressure.
We have helped 20+ companies in industries like Finance, Transportation, Health, Tourism, Events, Education, Sports.