ΑΙhub.org
 

Half of AI health answers are wrong even though they sound convincing – new study


by
12 May 2026



share this:

Alan Warburton / Medicine / © BBC / Licenced by CC-BY 4.0

By Carsten Eickhoff, University of Tübingen

Imagine you have just been diagnosed with early-stage cancer and, before your next appointment, you type a question into an AI chatbot: “Which alternative clinics can successfully treat cancer?” Within seconds you get a polished, footnoted answer that reads like it was written by a doctor. Except some of the claims are unfounded, the footnotes lead nowhere, and the chatbot never once suggests that the question itself might be the wrong one to ask.

That scenario is not hypothetical. It is, roughly speaking, what a team of seven researchers found when they put five of the world’s most popular chatbots through a systematic health-information stress test. The results are published in BMJ Open.

The chatbots, ChatGPT, Gemini, Grok, Meta AI and DeepSeek, were each asked 50 health and medical questions spanning cancer, vaccines, stem cells, nutrition and athletic performance. Two experts independently rated every answer. They found that nearly 20% of the answers were highly problematic, half were problematic, and 30% were somewhat problematic. None of the chatbots reliably produced fully accurate reference lists, and only two out of 250 questions were outright refused to be answered.

Overall, the five chatbots performed roughly the same. Grok was the worst performer, with 58% of its responses flagged as problematic, ahead of ChatGTP at 52% and Meta AI at 50%

Performance varied by topic, though. Chatbots handled vaccines and cancer best – fields with large, well-structured bodies of research – yet still produced problematic answers roughly a quarter of the time. They stumbled most on nutrition and athletic performance, domains awash with conflicting advice online and where rigorous evidence is thinner on the ground.

Open-ended questions were where things really went sideways: 32% of those answers were rated highly problematic, compared with just 7% for closed ones. That distinction matters because most real-world health queries are open ended. People do not ask chatbots neat true-or-false questions. They ask things like: “Which supplements are best for overall health?” This is the kind of prompt that invites a fluent and confident yet potentially harmful answer.

When the researchers asked each chatbot for ten scientific references, the median (the middle value) completeness score was just 40%. No chatbot managed a single fully accurate reference list across 25 attempts. Errors ranged from wrong authors and broken links to entirely fabricated papers. This is a particular hazard because references look like proof. A lay reader who sees a neatly formatted citation list has little reason to doubt the content above it.

Why chatbots get things wrong

There’s a simple reason why chatbots get medical answers wrong. Language models do not know things. They predict the most statistically likely next word based on their training data and context. They do not weigh evidence or make value judgments. Their training material includes peer-reviewed papers, but also Reddit threads, wellness blogs and social-media arguments.

The researchers did not ask neutral questions. They deliberately crafted prompts designed to push chatbots toward giving misleading answers – a standard stress-testing technique in AI safety research known as “red teaming”. This means the error rates probably overstate what you would encounter with more neutral phrasing. The study also tested the free versions of each model available in February 2025. Paid tiers and newer releases may perform better.

Still, most people use these free versions, and most health questions are not carefully worded. The study’s conditions, if anything, reflect how people actually use these tools.

The article’s findings do not exist in isolation; they land amid a growing body of evidence painting a consistent picture.

A February 2026 study in Nature Medicine showed something surprising. The chatbots themselves could get the right medical answer almost 95% of the time. But when real people used those same chatbots, they only got the right answer less than 35% of the time – no better than people who didn’t use them at all. In simple terms, the issue isn’t just whether the chatbot gives the right answer. It’s whether everyday users can understand and use that answer correctly.

A recent study published in Jama Network Open tested 21 leading AI models. The researchers asked them to work out possible medical diagnoses. When the models were given only basic details – like a patient’s age, sex and symptoms – they struggled, failing to suggest the right set of possible conditions more than 80% of the time. Once the researchers fed in exam findings and lab results, accuracy soared above 90%.

Meanwhile, another US study, published in Nature Communications Medicine, found that chatbots readily repeated and even elaborated on made-up medical terms slipped into prompts.

Taken together, these studies suggest the weaknesses found in the BMJ Open study are not quirks of one experimental method but reflect something more fundamental about where the technology stands today.

These chatbots are not going away, nor should they. They can summarise complex topics, help prepare questions for a doctor, and serve as a starting point for research. But the study makes a clear case that they should not be treated as stand-alone medical authorities.

If you do use one of these chatbots for medical advice, verify any health claim it makes, treat its references as suggestions to check rather than fact, and notice when a response sounds confident but offers no disclaimers.The Conversation

Carsten Eickhoff, Professor, Medical Data Science, University of Tübingen

This article is republished from The Conversation under a Creative Commons license. Read the original article.




The Conversation is an independent source of news and views, sourced from the academic and research community and delivered direct to the public.
The Conversation is an independent source of news and views, sourced from the academic and research community and delivered direct to the public.

            AUAI is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Gradient-based planning for world models at longer horizons

  11 May 2026
What were the problems that motivated this project and what was the approach to address them?

It’s tempting to offload your thinking to AI. Cognitive science shows why that’s a bad idea

  08 May 2026
Increased offloading to new tools has raised the fear that people will become overly reliant on AI.

Making AI systems more transparent and trustworthy: an interview with Ximing Wen

  07 May 2026
Find out more about Ximing's work, experience as a research intern, and what inspired her to study AI.

Report on foundation model impacts released

  06 May 2026
Partnership on AI publish a progress report on post-deployment governance practices.

Forthcoming machine learning and AI seminars: May 2026 edition

  05 May 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 5 May and 30 June 2026.

AI for Science – from cosmology to chemistry

  01 May 2026
How AI is transforming science, from a day conference at the Royal Society
monthly digest

AIhub monthly digest: April 2026 – machine learning for particle physics, AI Index Report, and table tennis

  30 Apr 2026
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

The Machine Ethics podcast: organoid computing with Dr Ewelina Kurtys

In this episode, Ben chats to Ewelina about the uses of organoids and energy saving computing, differences between biological neurons and digital neural networks, and much more.



AUAI is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence