Back

MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation

Hakata, Y.; Oikawa, M.; Fujisawa, S.

2026-04-22 health informatics
10.64898/2026.04.20.26351338 medRxiv
Show abstract

Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.2%
18.8%
2
Nature Communications
4913 papers in training set
Top 13%
12.7%
3
Nature Biomedical Engineering
42 papers in training set
Top 0.1%
6.5%
4
Nature Machine Intelligence
61 papers in training set
Top 0.5%
4.9%
5
Scientific Reports
3102 papers in training set
Top 30%
4.0%
6
Medical Image Analysis
33 papers in training set
Top 0.3%
3.6%
50% of probability mass above
7
PLOS Digital Health
91 papers in training set
Top 0.7%
3.6%
8
PLOS ONE
4510 papers in training set
Top 45%
2.6%
9
Nature Medicine
117 papers in training set
Top 1%
2.4%
10
Bioinformatics
1061 papers in training set
Top 6%
2.1%
11
Patterns
70 papers in training set
Top 0.6%
1.9%
12
Nature Methods
336 papers in training set
Top 4%
1.7%
13
Science Advances
1098 papers in training set
Top 17%
1.7%
14
Communications Biology
886 papers in training set
Top 8%
1.7%
15
NeuroImage
813 papers in training set
Top 4%
1.7%
16
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.5%
17
Journal of Biomedical Informatics
45 papers in training set
Top 1%
1.2%
18
Science Translational Medicine
111 papers in training set
Top 4%
1.1%
19
Nature Computational Science
50 papers in training set
Top 1%
1.0%
20
IEEE Transactions on Biomedical Engineering
38 papers in training set
Top 0.7%
1.0%
21
Communications Medicine
85 papers in training set
Top 0.6%
1.0%
22
Scientific Data
174 papers in training set
Top 2%
0.8%
23
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.8%
24
iScience
1063 papers in training set
Top 31%
0.8%
25
Med
38 papers in training set
Top 0.8%
0.8%
26
IEEE Transactions on Medical Imaging
18 papers in training set
Top 0.5%
0.8%
27
The Lancet Digital Health
25 papers in training set
Top 1%
0.7%
28
Expert Systems with Applications
11 papers in training set
Top 0.5%
0.7%
29
Advanced Science
249 papers in training set
Top 20%
0.7%
30
Computers in Biology and Medicine
120 papers in training set
Top 5%
0.7%