The rapid adoption of generative AI chats has transformed the way society accesses information. Many users use them today as substitutes for traditional search engines for everyday medical consultations. However, an international investigation published this Tuesday in the journal BMJ Open reveals that Relying on these tools for health advice can be a risky bet.
The study, led by researchers from the Lundquist Institute for Biomedical Innovation (USA), evaluated the performance of five of the most used models currently: Gemini (Google), DeepSeek, Meta AI, ChatGPT (OpenAI) and Grok (xAI). The results are worrying: half of the answers to questions based on scientific evidence were classified as “somewhat” or “highly” problematic.
To test the reliability of these systems, scientists designed a protocol of 250 consultations divided into five categories reviews: cancer, vaccines, stem cells, nutrition and sports performance. The questions were formulated to mimic common user searches and, in some cases, to ‘stress’ models toward common myths or contraindicated advice.
The analysis determined that the 20% of responses were highly problematicwith the potential to direct users toward ineffective treatments or cause direct harm to health if continued without professional supervision.
One of the most alarming aspects that the research highlights is the security with which AIs present information. The answers are usually expressed with a tone of absolute certainty, hardly including warnings or nuances about the limitations of their knowledge. This false neutrality, which often equates scientific statements with pseudoscience, is not an editorial decision, but rather a limitation inherent to the architecture of these models.
“Many people tend to think that chatbots are omniscient AI with a deep well of knowledge. But they do not have knowledge in the human sense; they do not ‘know’ things,” he explains to SINC. Nicholas Tillerprincipal investigator of the study. According to the expert, since they are designed to predict sequences of words based on vast sets of data—which include everything from scientific articles to Reddit forums—, models lack the intrinsic ability to verify information. “They cannot apply evidence or weigh which sources are accurate and which are not. That’s why this false balance is so common,” adds Tiller.
The study reveals that Grok, from the company xAI, obtained the worst results: 58% of their responses were classified as highly problematic. On the contrary, Gemini had the fewest critical failures. However, all models failed in one key point: accessibility. According to the Flesch readability index, the complexity of the language used It is equivalent to that of a university graduate, something that, far from being a virtue, represents a danger to public health.
“Overly technical responses can undermine understanding in the general public and compromise decision-making,” warns Tiller. The researcher points out a worrying psychological phenomenon: longer and more complex answers tend to increase confidence of the user in the machine, even when that complexity does not provide greater precision. “Basically, this promotes false credibility“, says the author.
Another critical point identified by the researchers is the inability of chatbots to reliably cite sources. The quality of the references was rated as poor, with an average completeness score of just 40%. The phenomenon of hallucinations caused no chatbot will be able to provide a completely real list of bibliographic references; In many cases, the models invented study titles and author names with complete appearance of veracity.
“As the use of these chatbots expands, our data highlights the need for public education, professional training and strict regulatory oversight,” concludes the team of researchers. Without these mechanisms, the massive deployment of generative AI in the health field risks erode trust in science and amplifying misinformation instead of helping to combat it.
