Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects
van Allen, Z.; Forgues-Martel, S.; Venables, M. J.; Ghanney, Y.; Villeneuve, A.; Dongmo, J.; Ahmed, M.; Archibald, D.; Jolin-Dahel, K.
Show abstract
BackgroundDelivering timely, high-quality feedback on resident scholarly projects is labour-intensive, especially in large programmes. We developed an AI-assisted evaluation system, powered by the open-weight LLaMA-3.1 large-language model (LLM), to generate formative feedback on Family Medicine residents scholarly projects and compared its performance with expert human evaluators. MethodsWe evaluated whether the AI-generated feedback achieves comparable quality to expert feedback. The tool ingests heterogeneous resident submissions (PDFs, scans, photographs) via OCR and produces section-by-section feedback aligned with programme rubrics. In a three-phase study we evaluated 240 feedback reports (Short, Question and Timeline, Final; n = 80 each). Within each phase, 40 reports were AI-generated and 40 produced by research experts across four project types: Quality Improvement, Survey-Based, Research, and Literature Review. Blinded raters used a 25-item survey across five constructs: understanding & reasoning, trust & confidence, quality of information, expression style & persona, safety & harm. ResultsSurvey reliability was high across phases ( = .71-.98). Human feedback generally out-scored AI. In short reports, humans led on quality (Mean {+/-} SD; 4.14 {+/-} 0.57 vs 3.09 {+/-} 1.05) and trust (3.96 {+/-} 0.71 vs 2.78 {+/-} 1.15). In final reports, differences become small for quality (4.09 {+/-} 0.65 vs 3.49 {+/-} 0.68) and persona (4.16 {+/-} 0.40 vs 3.91 {+/-} 0.50), while AI was preferred for safety (4.50 {+/-} 0.60 vs 4.36 {+/-} 0.56). Performance varied by project type: in survey-based final reports the AI led on quality (4.28 {+/-} 0.50 vs 3.98 {+/-} 0.44) and safety (4.58 {+/-} 0.40 vs 4.24 {+/-} 0.67), whereas in quality-improvement short reports humans were markedly superior in reasoning (4.27 {+/-} 0.68 vs 2.33 {+/-} 1.00). ConclusionsAn open-weight LLM with curated prompts can generate rubric-aligned feedback at scale that approaches the quality of expert human feedback. While expert feedback remained superior overall, AI surpassed humans in selected contexts and safety assessments. Performance of the tool will increase over time as newer and more capable open-weight models are released. Our code and systems prompts are open source.
Matching journals
The top 7 journals account for 50% of the predicted probability mass.