People already use AI chatbots like search engines for everyday health information. That habit looks riskier after a new study found that half of the answers from five major bots were problematic, even when the replies sounded polished and confident.
Researchers tested ChatGPT, Gemini, Grok, Meta AI, and DeepSeek with 250 prompts across cancer, vaccines, stem cells, nutrition, and athletic performance.
The prompts reflected common health queries and familiar misinformation themes, then measured whether the bots stayed aligned with scientific evidence or drifted into misleading and potentially unsafe advice.
Broad questions exposed the biggest gaps
The weakest results came from open-ended prompts. Those broader questions produced far more highly problematic answers than expected, while closed prompts were more likely to produce safer responses.
That matters because real people usually don’t ask medical questions in a tidy, multiple-choice format. They ask whether a treatment works, whether a vaccine is safe, or what might improve athletic performance.
In the study, that kind of prompt pushed the bots toward answers that mixed solid evidence with weaker or misleading claims.
Strong confidence, shaky sourcing
The flaws didn’t stop at the answers themselves. Reference quality was poor, with an average completeness score of 40%, and none of the chatbots produced a fully accurate reference list.
That weakens one of the biggest reasons people trust chatbot responses. A reply can look sourced and authoritative, then fall apart once the citations are checked.
The researchers also flagged fabricated references, while the bots still answered with certainty and offered almost no caveats.
Why this matters beyond one test
There are limits to the findings. The study covered only five chatbots, these products change quickly, and the prompts were built to stress the models, which may overstate how often bad answers appear in everyday use.
Still, the main takeaway is hard to dismiss. These systems were tested on evidence-based medical topics, and half of the answers still crossed into flawed or incomplete territory.
For now, chatbots may help summarize information or shape follow-up questions, but they still don’t look dependable enough for meaningful medical decisions.