Back

Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects

van Allen, Z.; Forgues-Martel, S.; Venables, M. J.; Ghanney, Y.; Villeneuve, A.; Dongmo, J.; Ahmed, M.; Archibald, D.; Jolin-Dahel, K.

2026-03-04 medical education
10.64898/2026.03.04.26346878 medRxiv
Show abstract

BackgroundDelivering timely, high-quality feedback on resident scholarly projects is labour-intensive, especially in large programmes. We developed an AI-assisted evaluation system, powered by the open-weight LLaMA-3.1 large-language model (LLM), to generate formative feedback on Family Medicine residents scholarly projects and compared its performance with expert human evaluators. MethodsWe evaluated whether the AI-generated feedback achieves comparable quality to expert feedback. The tool ingests heterogeneous resident submissions (PDFs, scans, photographs) via OCR and produces section-by-section feedback aligned with programme rubrics. In a three-phase study we evaluated 240 feedback reports (Short, Question and Timeline, Final; n = 80 each). Within each phase, 40 reports were AI-generated and 40 produced by research experts across four project types: Quality Improvement, Survey-Based, Research, and Literature Review. Blinded raters used a 25-item survey across five constructs: understanding & reasoning, trust & confidence, quality of information, expression style & persona, safety & harm. ResultsSurvey reliability was high across phases ( = .71-.98). Human feedback generally out-scored AI. In short reports, humans led on quality (Mean {+/-} SD; 4.14 {+/-} 0.57 vs 3.09 {+/-} 1.05) and trust (3.96 {+/-} 0.71 vs 2.78 {+/-} 1.15). In final reports, differences become small for quality (4.09 {+/-} 0.65 vs 3.49 {+/-} 0.68) and persona (4.16 {+/-} 0.40 vs 3.91 {+/-} 0.50), while AI was preferred for safety (4.50 {+/-} 0.60 vs 4.36 {+/-} 0.56). Performance varied by project type: in survey-based final reports the AI led on quality (4.28 {+/-} 0.50 vs 3.98 {+/-} 0.44) and safety (4.58 {+/-} 0.40 vs 4.24 {+/-} 0.67), whereas in quality-improvement short reports humans were markedly superior in reasoning (4.27 {+/-} 0.68 vs 2.33 {+/-} 1.00). ConclusionsAn open-weight LLM with curated prompts can generate rubric-aligned feedback at scale that approaches the quality of expert human feedback. While expert feedback remained superior overall, AI surpassed humans in selected contexts and safety assessments. Performance of the tool will increase over time as newer and more capable open-weight models are released. Our code and systems prompts are open source.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Journal of Medical Internet Research
85 papers in training set
Top 0.3%
10.6%
2
PLOS ONE
4510 papers in training set
Top 20%
9.2%
3
Journal of Clinical and Translational Science
11 papers in training set
Top 0.1%
8.5%
4
npj Digital Medicine
97 papers in training set
Top 0.6%
8.5%
5
PLOS Digital Health
91 papers in training set
Top 0.3%
6.4%
6
BMC Medical Education
20 papers in training set
Top 0.2%
4.9%
7
Scientific Reports
3102 papers in training set
Top 23%
4.9%
50% of probability mass above
8
Research Synthesis Methods
20 papers in training set
Top 0.1%
4.0%
9
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.6%
10
BMJ Open
554 papers in training set
Top 8%
2.1%
11
Healthcare
16 papers in training set
Top 0.6%
1.7%
12
Open Forum Infectious Diseases
134 papers in training set
Top 1%
1.7%
13
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.5%
14
Journal of General Internal Medicine
20 papers in training set
Top 0.6%
1.5%
15
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.5%
16
BMJ
49 papers in training set
Top 0.7%
1.5%
17
Behavior Research Methods
25 papers in training set
Top 0.1%
1.3%
18
BMC Public Health
147 papers in training set
Top 4%
1.3%
19
PLOS Biology
408 papers in training set
Top 12%
1.3%
20
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.2%
21
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.2%
22
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 4%
1.2%
23
Cancer Medicine
24 papers in training set
Top 1%
1.1%
24
BMC Bioinformatics
383 papers in training set
Top 6%
0.8%
25
Frontiers in Public Health
140 papers in training set
Top 8%
0.8%
26
Bioinformatics
1061 papers in training set
Top 9%
0.8%
27
Frontiers in Medicine
113 papers in training set
Top 7%
0.8%
28
GigaScience
172 papers in training set
Top 4%
0.7%
29
JAMA Network Open
127 papers in training set
Top 5%
0.5%
30
iScience
1063 papers in training set
Top 40%
0.5%