Back

Performance of Large Language Models as a Tool for Primary Care Consultations: Evaluation Study

Pascual, N.; Fernandez-Pichel, M.; Losada, D. E.; Garcia-Orosa, B.; Gude, F.; Costa Lathan, C.; Sueiro Justel, J.; Gomez Fontenla, A.; Lastra Perez, M.; Alonso Garcia,, F.

2026-05-04 health informatics
10.64898/2026.04.29.26352082 medRxiv
Show abstract

Since the release of the first ChatGPT model in 2022, large language models (LLMs) have evolved significantly, and an increasing number of users now turn to these generative information systems for inquiries as sensitive and consequential as those related to health. The primary objective is to identify the main strengths and weaknesses of generative AI systems when responding to information needs as critical as those arising in the health domain. The study was structured using a question-answer format, in which each question corresponded to a user query and each answer represented the output generated by a model in response. The study employed a human evaluation framework involving two distinct panels of clinical experts from different specialties. The evaluation criteria encompassed three dimensions: adherence to medical consensus; presence or absence of inappropriate or incorrect information; and the potential to cause harm to users. GPT-4o mini, Llama 3, and MedLlama 3 were selected as three representative systems for the experiments. This study presents a detailed analysis of the performance of widely used contemporary large language models in addressing common health-related queries posed by online users. The results reinforce the potential of LLMs as tools for online health information seeking among non-expert users. However, the performance limitations identified underscore the need for further studies to monitor the future development of these models. Among them, performance issues have been identified in areas where users may be more vulnerable, leading to the retrieval of clinically incorrect information, particularly in matters relating to rare diseases. Furthermore, it has been noted that these models can become trapped in obsolete medical knowledge due to continuous scientific progress.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
PLOS Digital Health
91 papers in training set
Top 0.2%
9.0%
2
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
7.1%
3
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.5%
6.3%
4
npj Digital Medicine
97 papers in training set
Top 1.0%
4.8%
5
Computers in Biology and Medicine
120 papers in training set
Top 0.5%
4.8%
6
Scientific Reports
3102 papers in training set
Top 25%
4.8%
7
International Journal of Medical Informatics
25 papers in training set
Top 0.2%
4.8%
8
Frontiers in Digital Health
20 papers in training set
Top 0.2%
4.2%
9
Journal of Medical Internet Research
85 papers in training set
Top 1%
4.1%
10
PLOS ONE
4510 papers in training set
Top 36%
3.9%
50% of probability mass above
11
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.5%
12
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.5%
13
Biology Methods and Protocols
53 papers in training set
Top 0.4%
2.7%
14
BMJ Health & Care Informatics
13 papers in training set
Top 0.3%
2.3%
15
JMIR Medical Informatics
17 papers in training set
Top 0.5%
2.3%
16
Journal of Personalized Medicine
28 papers in training set
Top 0.2%
1.9%
17
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 1%
1.7%
18
JAMIA Open
37 papers in training set
Top 0.9%
1.6%
19
Healthcare
16 papers in training set
Top 0.7%
1.6%
20
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.3%
1.5%
21
Frontiers in Public Health
140 papers in training set
Top 6%
1.3%
22
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
23
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.6%
1.2%
24
JMIR Formative Research
32 papers in training set
Top 1%
1.2%
25
Cureus
67 papers in training set
Top 4%
0.9%
26
Data in Brief
13 papers in training set
Top 0.3%
0.9%
27
Bioengineering
24 papers in training set
Top 1%
0.8%
28
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
29
Bioinformatics
1061 papers in training set
Top 10%
0.7%
30
BMC Medical Education
20 papers in training set
Top 0.9%
0.7%