Back

Performance of Large Language Models as a Tool for Primary Care Consultations: Evaluation Study

Pascual, N.; Fernandez-Pichel, M.; Losada, D. E.; Garcia-Orosa, B.; Gude, F.; Costa Lathan, C.; Sueiro Justel, J.; Gomez Fontenla, A.; Lastra Perez, M.; Alonso Garcia,, F.

2026-05-04 health informatics

10.64898/2026.04.29.26352082 medRxiv

Show abstract

Since the release of the first ChatGPT model in 2022, large language models (LLMs) have evolved significantly, and an increasing number of users now turn to these generative information systems for inquiries as sensitive and consequential as those related to health. The primary objective is to identify the main strengths and weaknesses of generative AI systems when responding to information needs as critical as those arising in the health domain. The study was structured using a question-answer format, in which each question corresponded to a user query and each answer represented the output generated by a model in response. The study employed a human evaluation framework involving two distinct panels of clinical experts from different specialties. The evaluation criteria encompassed three dimensions: adherence to medical consensus; presence or absence of inappropriate or incorrect information; and the potential to cause harm to users. GPT-4o mini, Llama 3, and MedLlama 3 were selected as three representative systems for the experiments. This study presents a detailed analysis of the performance of widely used contemporary large language models in addressing common health-related queries posed by online users. The results reinforce the potential of LLMs as tools for online health information seeking among non-expert users. However, the performance limitations identified underscore the need for further studies to monitor the future development of these models. Among them, performance issues have been identified in areas where users may be more vulnerable, leading to the retrieval of clinically incorrect information, particularly in matters relating to rare diseases. Furthermore, it has been noted that these models can become trapped in obsolete medical knowledge due to continuous scientific progress.

Performance of Large Language Models as a Tool for Primary Care Consultations: Evaluation Study

Matching journals