A recent study has highlighted significant limitations in ChatGPT’s ability to provide reliable medical guidance. Researchers found that the AI chatbot correctly identified disease-related symptoms in only 49–61% of cases, raising concerns over its suitability as a primary health resource. The work assessed tool responses across diverse medical scenarios and compared them with evaluations by licensed physicians, revealing a gap in accuracy despite the tool’s accessible language and conversational ease.
Retrieval-Augmented Generation (RAG) can improve accuracy and reliability of these tools.
A key finding was that, despite accurate term recognition, the unique identifiers (IDs) ChatGPT provided for diseases, drugs, or genes were frequently invalid or redundant. This issue could limit the usefulness of AI outputs in clinical or research settings where standardized identifiers are essential for interoperability and data integrity. The analysis covered multiple models—GPT-4 Turbo, GPT-4, GPT-4o, and GPT-4o-mini—and found some variability in performance between them.
The authors noted that the study’s scope was partly constrained by funding limitations, as the cost of generating large-scale datasets with proprietary AI models was high. They suggested that future research should explore open-source large language models (LLMs) as a cost-effective and transparent alternative. Incorporating methods like Retrieval-Augmented Generation (RAG), which connects AI models to authoritative medical databases in real time, was proposed as a way to improve accuracy and reliability.
Experts caution that while generative AI can assist in preliminary health information searches, it should never replace consultation with a qualified healthcare provider. The study reinforces the need for ongoing refinement of AI models, transparency about limitations, and the integration of medical domain-specific data to ensure safe and trustworthy use in healthcare.