Back

Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Wittlinger, S.; Meerjansen, J.; Wolf, F.; Wiest, I. C.; Ebert, M. P.; Siegel, F.; Belle, S.

2026-05-06 health systems and quality improvement
10.64898/2026.05.04.26352392 medRxiv
Show abstract

ObjectiveStructured extraction from clinical free-text depends on human annotators whose labels are susceptible to errors and knowledge-driven mistakes; exhaustive quality control is impractical at scale. We evaluate whether disagreement among multiple locally hosted large language models (LLMs) can prioritize human annotations for targeted review. MethodsMultiple LLMs independently extract the same set of structured variables annotated by a human reviewer. For each annotation, an agreement score counts the LLMs matching the human label. Using four locally hosted LLMs (Gemma 3 27B, DeepSeek-R1 70B, GPT-OSS 120B, Mistral Large 3), we evaluated this approach on 910 German-language colonoscopy reports describing endoscopic mucosal resection, with five structured variables per case (anatomical location, two diameters, resection technique, multiple polyps), yielding 4,550 annotations and a 377-case adjudication sample. A stratified sample oversampling low-agreement strata was adjudicated blinded by an experienced reviewer and analyzed with prevalence-adjusted estimates ResultsHuman error rates rose as LLM agreement fell, from 0% at scores 3-4 to 76% at score 0. The lowest-agreement stratum was only 6.5% of annotations yet concentrated an estimated 80% of errors. The multi-LLM disagreement score achieved a prevalence-adjusted AUC-ROC of 0.991 (95% CI 0.987-0.994) and AUC-PR of 0.893 (95% CI 0.851-0.929) for error detection. DiscussionMulti-LLM disagreement outperformed single models and provided graded operating points for risk-stratified review. ConclusionMulti-LLM disagreement provides a scalable quality-control signal for targeted review of the highest-yield cases. Because all models run locally, the framework is GDPR-compliant; its language- and task-agnostic design supports application across clinical domains.

Matching journals

The top 15 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 27%
6.4%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.4%
6.4%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.9%
4
European Heart Journal - Digital Health
15 papers in training set
Top 0.1%
4.3%
5
npj Digital Medicine
97 papers in training set
Top 1%
3.6%
6
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
3.6%
7
Scientific Reports
3102 papers in training set
Top 39%
3.3%
8
Journal of Medical Internet Research
85 papers in training set
Top 2%
3.1%
9
JAMA Network Open
127 papers in training set
Top 1%
3.1%
10
PLOS Biology
408 papers in training set
Top 4%
3.1%
11
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.4%
12
Nature Communications
4913 papers in training set
Top 47%
2.1%
13
Nature Medicine
117 papers in training set
Top 2%
1.8%
14
Nature
575 papers in training set
Top 11%
1.7%
15
Healthcare
16 papers in training set
Top 0.6%
1.7%
50% of probability mass above
16
Nature Human Behaviour
85 papers in training set
Top 2%
1.7%
17
PLOS Digital Health
91 papers in training set
Top 1%
1.7%
18
Med
38 papers in training set
Top 0.3%
1.7%
19
npj Precision Oncology
48 papers in training set
Top 0.6%
1.7%
20
Communications Medicine
85 papers in training set
Top 0.3%
1.7%
21
BMJ Open Quality
15 papers in training set
Top 0.5%
1.5%
22
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.5%
1.3%
23
iScience
1063 papers in training set
Top 19%
1.3%
24
Bioinformatics
1061 papers in training set
Top 8%
1.2%
25
Journal of Infection
71 papers in training set
Top 2%
1.2%
26
The Lancet Digital Health
25 papers in training set
Top 0.7%
1.2%
27
BMJ Open
554 papers in training set
Top 10%
1.2%
28
Trials
25 papers in training set
Top 1%
1.0%
29
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.5%
1.0%
30
British Journal of General Practice
22 papers in training set
Top 0.5%
0.9%