Back

Replication of an open-access deep learning system for screening mammography: Reduced performance mitigated by retraining on local data

Condon, J. J. J.; Oakden-Rayner, L.; Hall, K. A.; Reintals, M.; Holmes, A.; Carneiro, G.; Palmer, L. J.

2021-06-01 radiology and imaging
10.1101/2021.05.28.21257892 medRxiv
Show abstract

AimTo assess the generalisability of a deep learning (DL) system for screening mammography developed at New York University (NYU), USA (1, 2) in a South Australian (SA) dataset. Methods and MaterialsClients with pathology-proven lesions (n=3,160) and age-matched controls (n=3,240) were selected from women screened at BreastScreen SA from January 2010 to December 2016 (n clients=207,691) and split into training, validation and test subsets (70%, 15%, 15% respectively). The primary outcome was area under the curve (AUC), in the SA Test Set 1 (SATS1), differentiating invasive breast cancer or ductal carcinoma in situ (n=469) from age-matched controls (n=490) and benign lesions (n=44). The NYU system was tested statically, after training without transfer learning (TL), after retraining with TL and without (NYU1) and with (NYU2) heatmaps. ResultsThe static NYU1 model AUCs in the NYU test set (NYTS) and SATS1 were 83.0%(95%CI=82.4%-83.6%)(2) and 75.8%(95%CI=72.6%-78.8%), respectively. Static NYU2 AUCs in the NYTS and SATS1 were 88.6%(95%CI=88.3%-88.9%)(2) and 84.5%(95%CI=81.9%-86.8%), respectively. Training of NYU1 and NYU2 without TL achieved AUCs in the SATS1 of 65.8% (95%CI=62.2%-69.1%) and 85.9%(95%CI=83.5%-88.2%), respectively. Retraining of NYU1 and NYU2 with TL resulted in AUCs of 82.4%(95%CI=79.7-84.9%) and 86.3%(95%CI=84.0-88.5%) respectively. ConclusionWe did not fully reproduce the reported performance of NYU on a local dataset; local retraining with TL approximated this level of performance. Optimising models for local clinical environments may improve performance. The generalisation of DL systems to new environments may be challenging. Key ContributionsIn this study, the original performance of deep learning models for screening mammography was reduced in an independent clinical population. Deep learning (DL) systems for mammography require local testing and may benefit from local retraining. An openly available DL system approximates human performance in an independent dataset. There are multiple potential sources of reduced deep learning system performance when deployed to a new dataset and population.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Diagnostics
48 papers in training set
Top 0.1%
19.4%
2
PLOS Digital Health
91 papers in training set
Top 0.1%
15.3%
3
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
8.7%
4
Scientific Reports
3102 papers in training set
Top 13%
7.1%
50% of probability mass above
5
PLOS ONE
4510 papers in training set
Top 26%
6.6%
6
The Lancet Digital Health
25 papers in training set
Top 0.1%
3.7%
7
Cancers
200 papers in training set
Top 2%
2.7%
8
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.4%
2.0%
9
JAMA Network Open
127 papers in training set
Top 2%
2.0%
10
BMC Medicine
163 papers in training set
Top 3%
1.9%
11
eBioMedicine
130 papers in training set
Top 1%
1.8%
12
npj Digital Medicine
97 papers in training set
Top 2%
1.5%
13
European Radiology
14 papers in training set
Top 0.4%
1.5%
14
BMJ Open
554 papers in training set
Top 10%
1.5%
15
Frontiers in Medicine
113 papers in training set
Top 4%
1.4%
16
Nature Communications
4913 papers in training set
Top 56%
1.3%
17
Frontiers in Oncology
95 papers in training set
Top 3%
1.0%
18
JNCI Cancer Spectrum
10 papers in training set
Top 0.4%
0.9%
19
BMC Health Services Research
42 papers in training set
Top 2%
0.9%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
21
Cancer Epidemiology, Biomarkers & Prevention
17 papers in training set
Top 0.5%
0.9%
22
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
23
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.8%
0.8%
24
Science Translational Medicine
111 papers in training set
Top 5%
0.8%
25
GigaScience
172 papers in training set
Top 4%
0.7%
26
Frontiers in Neuroinformatics
38 papers in training set
Top 1%
0.5%
27
Communications Medicine
85 papers in training set
Top 2%
0.5%
28
Journal of Magnetic Resonance Imaging
14 papers in training set
Top 0.7%
0.5%