Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations
Cook, D. A.; Laack, T. A.; Pankratz, V. S.
Show abstract
Purpose: Evaluate large language models (LLMs) for scoring medical student essays, and compare various prompting techniques and models. Methods: OpenAI GPT scored 51 medical student reflection essays (15 real, 36 fabricated) using a previously-reported 6-point rubric (April-May 2025). We compared 29 prompt-model conditions by systematically varying the LLM prompts (including the persona, scoring rubric, few-shot learning [exemplars], chain-of-thought reasoning, and temperature), fine-tuning, and model (including GPT-4.1, GPT-4.1-mini, GPT-o4-mini, and GPT-4-Turbo). Outcomes were accuracy (compared with human raters, measured using single-score intraclass correlation coefficient [ICC] and mean absolute difference [MAD; zero indicates perfect agreement]), within-condition reproducibility, and cost. Results: Across all conditions, it took mean (SD) 3.73 (3.12) seconds to score 1 essay. The cost to score 100 essays was USD $0.04 for GPT-4.1-mini, $0.21 for GPT-4.1, $0.57 for GPT-4.1 with 3 exemplars, and $2.00 for fine-tuned GPT-4.1. When the one-time cost of fine-tuning was amortized across 10,000 essays, the cost for fine-tuned GPT-4.1 was $0.20 per 100. Accuracy was "almost perfect" (ICC >0.80) for 28/29 conditions (97%). Fine-tuned models were more accurate than non-fine-tuned models (MAD difference -0.24 [95% CI, -0.34, -0.14]). Conditions with exemplars were more accurate than those without (MAD difference -0.44 [CI, -0.57, -0.31]). Accuracy progressively decreased as 6, 3, 1, and 0 rubric levels were explicitly defined in the prompt (P<.001). Contrary to hypotheses, accuracies for chain-of-thought prompts and variations in temperature and persona were not significantly different from the baseline prompt. Reproducibility ICC was >0.80 for 28/29 conditions (97%). Discussion: Automated LLM essay scoring demonstrated near-perfect accuracy and reproducibility for most prompt-model conditions. Fine-tuned models and prompts with exemplars had higher accuracy but higher cost. Fine-tuned models had lower per-essay costs for larger essay volumes. For smaller volumes, non-fine-tuned GPT-4.1 provided excellent results at moderate cost. GPT-4.1-mini provided very good results at low cost.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.