Back

Performance of open-source large language models on nephrology self-assessment program

Ahangaran, M.; Jia, S.; Chitalia, S.; Athavale, A.; Francis, J. M.; O'Donnell, M. W.; Bavi, S. R.; Gupta, U. D.; Kolachalama, V. B.

2026-04-16 nephrology
10.64898/2026.04.16.26348910 medRxiv
Show abstract

Background: Large Language Models (LLMs) have demonstrated strong performance in medical question-answering tasks, highlighting their potential for clinical decision support and medical education. However, their effectiveness in subspecialty areas such as nephrology remains underexplored. In this study, we assess the performance of open-source LLMs in answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP) to better understand their capabilities and limitations within this specialized clinical domain. Methods: We evaluated the performance of five open-source large language models (LLMs): PodGPT which a podcast-pretrained model focused on STEMM disciplines, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, and Gemma-2-9B-it. Each model was tested on its ability to answer multiple-choice questions derived from the NephSAP. Model performance was quantified using accuracy, defined as the proportion of correctly answered questions. In addition, the quality of the models explanatory responses was assessed using several natural language processing (NLP) metrics: Bilingual Evaluation Understudy (BLEU), Word Error Rate (WER), cosine similarity, and Flesch-Kincaid Grade Level (FKGL). For qualitative analysis, three board-certified nephrologists reviewed 40 randomly selected model responses to identify factual and clinical reasoning errors, with performance summarized as average error ratios based on the proportion of error-associated words per response. Results: Among the evaluated models, PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. Qualitative analysis showed that PodGPT had the lowest factual error rate (0.017), while Llama and Falcon achieved the lowest reasoning error rates (0.038). Conclusions: This study highlights the importance of STEMM-based training to enhance the reasoning capabilities and reliability of LLMs in clinical contexts, supporting the development of more effective AI-driven decision-support tools in nephrology and other medical specialties.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 12%
15.0%
2
Bioinformatics Advances
184 papers in training set
Top 0.1%
10.3%
3
Scientific Reports
3102 papers in training set
Top 9%
8.6%
4
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.3%
8.4%
5
International Journal of Medical Informatics
25 papers in training set
Top 0.3%
4.1%
6
iScience
1063 papers in training set
Top 3%
4.1%
50% of probability mass above
7
PLOS Digital Health
91 papers in training set
Top 0.6%
4.1%
8
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.7%
3.7%
9
Biology Methods and Protocols
53 papers in training set
Top 0.4%
2.7%
10
Healthcare
16 papers in training set
Top 0.3%
2.4%
11
JMIR Medical Informatics
17 papers in training set
Top 0.6%
1.9%
12
Wellcome Open Research
57 papers in training set
Top 0.6%
1.9%
13
BMC Medical Education
20 papers in training set
Top 0.5%
1.8%
14
JAMA Network Open
127 papers in training set
Top 2%
1.7%
15
npj Digital Medicine
97 papers in training set
Top 2%
1.7%
16
Frontiers in Public Health
140 papers in training set
Top 5%
1.7%
17
Journal of Clinical Medicine
91 papers in training set
Top 4%
1.4%
18
Cureus
67 papers in training set
Top 3%
1.3%
19
Cancer Medicine
24 papers in training set
Top 1%
1.0%
20
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.9%
21
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
22
Bioengineering
24 papers in training set
Top 1%
0.8%
23
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.8%
24
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
25
PLOS Neglected Tropical Diseases
378 papers in training set
Top 5%
0.7%
26
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%
27
BMC Medicine
163 papers in training set
Top 7%
0.7%
28
BioMed Research International
25 papers in training set
Top 4%
0.7%
29
F1000Research
79 papers in training set
Top 5%
0.7%
30
BMC Bioinformatics
383 papers in training set
Top 8%
0.5%