Back

PhysiCase: Development and dual-layer validation of synthetic cases for health professional education: A pilot study leveraging Generative AI

Komolafe, O. O.; Roberts, A. C.; Shelley, J.; Tawiah, A. K.

2026-06-09 rehabilitation medicine and physical therapy
10.64898/2026.06.07.26355114 medRxiv
Show abstract

High-quality, domain-specific datasets are foundational to advancing educational tools and AI systems in healthcare, yet assembling case repositories from real-world clinical records faces substantial privacy, ethical, and licensing barriers. Synthetic data generation offers a compelling pathway forward, but educational cases require rigorous validation to ensure clinical plausibility and pedagogical utility. This pilot study introduces PhysiCase, a dual-layer validation pipeline for synthetic case generation and evaluates the feasibility of combining automated LLM-based screening with expert educator review. We generated 128 synthetic musculoskeletal(MSK) cases using four frontier large language models (GPT-4.1, GPT-4o, Google Gemini 2.5 Pro, and Llama 4 Scout) across 28 clinical conditions. Cases underwent automated quality screening using an "LLM-as-judge" framework (DeepEval) assessing prompt alignment, JSON correctness, answer relevance, bias, toxicity, and completeness. Ninety cases (70.3%) passed automated filtering and proceeded to expert evaluation by four MSK physiotherapy educators, who rated medical accuracy, realism, fidelity, relevance, and usability on 5-point Likert scales. GPT-4.1 demonstrated the highest automated pass rate (96\%) and strongest expert ratings (medical accuracy 4.10/5, usability 4.38/5), while Llama 4 Scout showed the lowest pass rate (33.3%) and expert ratings. Expert-evaluated cases achieved strong content validity indices for usability (97.5%), relevance (97.5%), and realism (95%), though medical accuracy showed greater variance (CVI 87.5%). Cross-layer correlation analysis revealed that automated completeness metrics moderately aligned with expert usability ratings , while answer relevance and prompt alignment showed weak or negative correlations with clinical correctness. Qualitative analysis identified three primary failure modes: reductive logic, biomechanical inconsistency, and administrative/contextual gaps. The dual-layer validation framework proved methodologically viable: automated screening efficiently reduced expert review burden, while human judgment remained indispensable for detecting subtle clinical reasoning failures. LLM-generated synthetic cases has the potential to meet practical educational needs for MSK physiotherapy, but expert validation is essential to safeguard clinical accuracy. These findings support a scalable division of labour for synthetic case development, with targeted improvements to prompting and automated reasoning checks needed to address identified "nuance gaps." The code for this paper is available on https://github.com/kwid-ai/PhysiCase

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
PLOS Digital Health
91 papers in training set
Top 0.1%
27.0%
2
npj Digital Medicine
97 papers in training set
Top 0.2%
19.4%
3
Scientific Reports
3102 papers in training set
Top 21%
5.1%
50% of probability mass above
4
PLOS ONE
4510 papers in training set
Top 32%
4.5%
5
Frontiers in Digital Health
20 papers in training set
Top 0.1%
4.5%
6
Advanced Science
249 papers in training set
Top 4%
4.1%
7
iScience
1063 papers in training set
Top 4%
3.7%
8
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
2.0%
9
Annals of Biomedical Engineering
34 papers in training set
Top 0.6%
1.8%
10
Healthcare
16 papers in training set
Top 0.5%
1.8%
11
Peer Community Journal
254 papers in training set
Top 2%
1.5%
12
Journal of NeuroEngineering and Rehabilitation
28 papers in training set
Top 0.6%
1.4%
13
Cureus
67 papers in training set
Top 3%
1.3%
14
JCI Insight
241 papers in training set
Top 5%
1.3%
15
eLife
5422 papers in training set
Top 51%
1.0%
16
BMJ Open
554 papers in training set
Top 11%
1.0%
17
Nature Medicine
117 papers in training set
Top 4%
0.9%
18
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 2%
0.9%
19
GigaScience
172 papers in training set
Top 3%
0.8%
20
Nature Communications
4913 papers in training set
Top 62%
0.8%
21
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.7%
22
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
23
Scientific Data
174 papers in training set
Top 3%
0.7%
24
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.5%
25
JMIR Formative Research
32 papers in training set
Top 2%
0.5%
26
MethodsX
14 papers in training set
Top 0.6%
0.5%
27
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.5%