MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.