Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Wittlinger, S.; Meerjansen, J.; Wolf, F.; Wiest, I. C.; Ebert, M. P.; Siegel, F.; Belle, S.

2026-05-06 health systems and quality improvement

10.64898/2026.05.04.26352392 medRxiv

Show abstract

ObjectiveStructured extraction from clinical free-text depends on human annotators whose labels are susceptible to errors and knowledge-driven mistakes; exhaustive quality control is impractical at scale. We evaluate whether disagreement among multiple locally hosted large language models (LLMs) can prioritize human annotations for targeted review. MethodsMultiple LLMs independently extract the same set of structured variables annotated by a human reviewer. For each annotation, an agreement score counts the LLMs matching the human label. Using four locally hosted LLMs (Gemma 3 27B, DeepSeek-R1 70B, GPT-OSS 120B, Mistral Large 3), we evaluated this approach on 910 German-language colonoscopy reports describing endoscopic mucosal resection, with five structured variables per case (anatomical location, two diameters, resection technique, multiple polyps), yielding 4,550 annotations and a 377-case adjudication sample. A stratified sample oversampling low-agreement strata was adjudicated blinded by an experienced reviewer and analyzed with prevalence-adjusted estimates ResultsHuman error rates rose as LLM agreement fell, from 0% at scores 3-4 to 76% at score 0. The lowest-agreement stratum was only 6.5% of annotations yet concentrated an estimated 80% of errors. The multi-LLM disagreement score achieved a prevalence-adjusted AUC-ROC of 0.991 (95% CI 0.987-0.994) and AUC-PR of 0.893 (95% CI 0.851-0.929) for error detection. DiscussionMulti-LLM disagreement outperformed single models and provided graded operating points for risk-stratified review. ConclusionMulti-LLM disagreement provides a scalable quality-control signal for targeted review of the highest-yield cases. Because all models run locally, the framework is GDPR-compliant; its language- and task-agnostic design supports application across clinical domains.

Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Matching journals