Back

Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations

Cook, D. A.; Laack, T. A.; Pankratz, V. S.

2026-03-24 medical education
10.64898/2026.03.20.26348918 medRxiv
Show abstract

Purpose: Evaluate large language models (LLMs) for scoring medical student essays, and compare various prompting techniques and models. Methods: OpenAI GPT scored 51 medical student reflection essays (15 real, 36 fabricated) using a previously-reported 6-point rubric (April-May 2025). We compared 29 prompt-model conditions by systematically varying the LLM prompts (including the persona, scoring rubric, few-shot learning [exemplars], chain-of-thought reasoning, and temperature), fine-tuning, and model (including GPT-4.1, GPT-4.1-mini, GPT-o4-mini, and GPT-4-Turbo). Outcomes were accuracy (compared with human raters, measured using single-score intraclass correlation coefficient [ICC] and mean absolute difference [MAD; zero indicates perfect agreement]), within-condition reproducibility, and cost. Results: Across all conditions, it took mean (SD) 3.73 (3.12) seconds to score 1 essay. The cost to score 100 essays was USD $0.04 for GPT-4.1-mini, $0.21 for GPT-4.1, $0.57 for GPT-4.1 with 3 exemplars, and $2.00 for fine-tuned GPT-4.1. When the one-time cost of fine-tuning was amortized across 10,000 essays, the cost for fine-tuned GPT-4.1 was $0.20 per 100. Accuracy was "almost perfect" (ICC >0.80) for 28/29 conditions (97%). Fine-tuned models were more accurate than non-fine-tuned models (MAD difference -0.24 [95% CI, -0.34, -0.14]). Conditions with exemplars were more accurate than those without (MAD difference -0.44 [CI, -0.57, -0.31]). Accuracy progressively decreased as 6, 3, 1, and 0 rubric levels were explicitly defined in the prompt (P<.001). Contrary to hypotheses, accuracies for chain-of-thought prompts and variations in temperature and persona were not significantly different from the baseline prompt. Reproducibility ICC was >0.80 for 28/29 conditions (97%). Discussion: Automated LLM essay scoring demonstrated near-perfect accuracy and reproducibility for most prompt-model conditions. Fine-tuned models and prompts with exemplars had higher accuracy but higher cost. Fine-tuned models had lower per-essay costs for larger essay volumes. For smaller volumes, non-fine-tuned GPT-4.1 provided excellent results at moderate cost. GPT-4.1-mini provided very good results at low cost.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 2%
14.8%
2
Journal of Medical Internet Research
85 papers in training set
Top 0.3%
10.4%
3
PLOS Digital Health
91 papers in training set
Top 0.2%
9.4%
4
npj Digital Medicine
97 papers in training set
Top 0.6%
7.4%
5
PLOS ONE
4510 papers in training set
Top 26%
6.5%
6
International Journal of Medical Informatics
25 papers in training set
Top 0.3%
4.0%
50% of probability mass above
7
JAMA Network Open
127 papers in training set
Top 0.8%
3.8%
8
Journal of Clinical and Translational Science
11 papers in training set
Top 0.1%
3.7%
9
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 19%
3.7%
10
Research Synthesis Methods
20 papers in training set
Top 0.1%
2.1%
11
iScience
1063 papers in training set
Top 11%
1.9%
12
Healthcare
16 papers in training set
Top 0.6%
1.7%
13
Nature Human Behaviour
85 papers in training set
Top 2%
1.7%
14
PLOS Biology
408 papers in training set
Top 11%
1.5%
15
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.5%
16
Journal of Clinical Epidemiology
28 papers in training set
Top 0.4%
1.1%
17
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.1%
18
Cancer Medicine
24 papers in training set
Top 1%
1.1%
19
BMC Bioinformatics
383 papers in training set
Top 6%
1.0%
20
European Journal of Human Genetics
49 papers in training set
Top 0.9%
1.0%
21
Open Forum Infectious Diseases
134 papers in training set
Top 2%
0.9%
22
BMJ Open
554 papers in training set
Top 11%
0.9%
23
Frontiers in Medicine
113 papers in training set
Top 6%
0.8%
24
Biology Methods and Protocols
53 papers in training set
Top 2%
0.8%
25
BMC Public Health
147 papers in training set
Top 6%
0.8%
26
Communications Psychology
20 papers in training set
Top 0.3%
0.8%
27
BMC Medical Education
20 papers in training set
Top 0.9%
0.7%
28
Journal of Cognitive Neuroscience
119 papers in training set
Top 2%
0.5%