Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models
Tan, J.; Tang, P. H.
Show abstract
BackgroundPa4ediatric pneumonia is a major cause of childhood morbidity and mortality. Chest X-rays (CXR) are central to diagnosis, but shortages of specialist radiologists can delay reporting. Multimodal large language models (MLLMs) may assist clinical workflows by analysing images and communicating findings, although their diagnostic performance remains below state-of-the-art classifiers. ObjectiveTo evaluate whether ensemble strategies improve MLLM diagnostic performance for paediatric radiological pneumonia detection on CXRs. MethodsIn this retrospective study, paediatric CXRs from two datasets (balanced and real-world) at KK Womens and Childrens Hospital were analysed. Images were independently reviewed by two board-certified radiologists, with pneumonia severity assigned to three classes using a predefined consensus algorithm. Fifteen MedGemma-4B-it agents classified each CXR into five likelihood categories, which were mapped to the three severity classes for evaluation. Majority voting, soft voting and GPTOSS-20B aggregation were compared with baseline average agent performance. The primary outcome was One-vs-Rest (OvR) AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohens {kappa} and One-vs-One (OvO) AUROC. ResultsThe balanced dataset contained 900 CXRs and the real-world dataset 1300 CXRs. Soft voting significantly improved OvR-AUROC compared with baseline in both datasets (Balanced: 0.829>0.764; 95%CI=0.752-0.779; P=0.0002. Real-world: 0.728>0.655; 95%CI=0.638-0.679; P=0.0003). Soft voting also improved accuracy, Cohens {kappa}, OvO-AUROC in both datasets and F1-score in the balanced dataset. ConclusionSoft voting enhances MedGemmas diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our systems high specificity supports triage by flagging high-risk radiological pneumonia cases. Clinical ImpactO_LIPaediatric CXRs often face reporting delays exceeding 24 hours due to radiologist shortages. C_LIO_LIOur proposed MLLM ensemble framework achieves better than average MLLM diagnostic discrimination for radiological pneumonia without requiring cloud-based systems. C_LIO_LISoft-voting aggregation enhances diagnostic discriminatory effectiveness for paediatric pneumonia severity, while preserving explainable outputs. C_LIO_LIOur system acts as a decision support tool that identifies higher-risk pneumonia cases for urgent review, supporting safer triage. C_LI
Matching journals
The top 6 journals account for 50% of the predicted probability mass.