Back

The NLP-to-Expert Gap in Chest X-ray AI

Fisher, G. R.

2026-03-02 radiology and imaging
10.64898/2026.02.27.26347261 medRxiv
Show abstract

In previous work, we achieved state-of-the-art performance on ChestX-ray14 (ROC-AUC 0.940, F1 0.821) using pretraining diversity and clinical metric optimization. Applying the same methodology to CheXpert, we received similar results when using NLP valuation and test data--but when evaluated against expert radiologist labels, performance was only 0.75-0.87 ROC-AUC. The models had learned to match the automated NLP labeling system, not to diagnose disease. This paper documents our investigation into this failure and our suggested resolution. We identify the NLP-to-Expert generalization gap: a systematic divergence between models optimized on labels extracted from radiology reports and their agreement with board-certified radiologists. More surprisingly, we discovered that directly optimizing for small expert-labeled validation sets can be counterproductive-- models with lower validation scores often generalized better to held-out expert test data. Four findings emerged: First, expert-labeled images for at least the validation and testing datasets, even if not for training, were vital in revealing the gap between NLP agreement and diagnostic accuracy. Without them, our models appeared excellent while failing to generalize to clinical judgment. Second, less training is better. Short training (1-5 epochs) outperformed extended training (60+ epochs) because longer training doesnt improve the model--it memorizes the labelers mistakes. Third, ImageNet features are sufficient. Freezing the pretrained backbone and training only the classifier achieved 0.891 ROC-AUC--matching models with full fine-tuning. The rapid convergence we observed wasnt the model learning chest X-ray features; it was the classifier calibrating to already-sufficient visual representations. Fourth, regularization beats optimization. Label smoothing and frozen backbones--methods that prevent overfitting--outperformed direct metric optimization on small validation sets. The 200 expert-labeled validation images in CheXpert are too few to optimize directly; they are better used as a compass than a target. With these insights, we improved from 0.823 to 0.917 ROC-AUC, exceeding Stanfords official baseline (0.907).

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
The Lancet Digital Health
25 papers in training set
Top 0.1%
17.3%
2
Nature Communications
4913 papers in training set
Top 17%
10.3%
3
npj Digital Medicine
97 papers in training set
Top 0.5%
10.0%
4
Nature Machine Intelligence
61 papers in training set
Top 0.4%
6.3%
5
Scientific Reports
3102 papers in training set
Top 19%
6.3%
50% of probability mass above
6
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
6.2%
7
Nature Medicine
117 papers in training set
Top 0.5%
4.8%
8
PLOS Digital Health
91 papers in training set
Top 0.5%
4.8%
9
Journal of Medical Imaging
11 papers in training set
Top 0.1%
2.7%
10
Patterns
70 papers in training set
Top 0.6%
2.0%
11
eBioMedicine
130 papers in training set
Top 1.0%
1.9%
12
PLOS ONE
4510 papers in training set
Top 51%
1.9%
13
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 30%
1.9%
14
Medical Physics
14 papers in training set
Top 0.4%
1.7%
15
eLife
5422 papers in training set
Top 43%
1.7%
16
Communications Medicine
85 papers in training set
Top 0.3%
1.6%
17
npj Precision Oncology
48 papers in training set
Top 0.7%
1.5%
18
Nature Computational Science
50 papers in training set
Top 1%
1.2%
19
European Radiology
14 papers in training set
Top 0.6%
0.9%
20
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
21
GigaScience
172 papers in training set
Top 3%
0.9%
22
Science Advances
1098 papers in training set
Top 27%
0.9%
23
Frontiers in Medicine
113 papers in training set
Top 7%
0.7%
24
Nature
575 papers in training set
Top 16%
0.7%
25
JAMA Network Open
127 papers in training set
Top 5%
0.7%
26
IEEE Access
31 papers in training set
Top 1%
0.7%
27
JAMIA Open
37 papers in training set
Top 2%
0.7%