Replication of an open-access deep learning system for screening mammography: Reduced performance mitigated by retraining on local data
Condon, J. J. J.; Oakden-Rayner, L.; Hall, K. A.; Reintals, M.; Holmes, A.; Carneiro, G.; Palmer, L. J.
Show abstract
AimTo assess the generalisability of a deep learning (DL) system for screening mammography developed at New York University (NYU), USA (1, 2) in a South Australian (SA) dataset. Methods and MaterialsClients with pathology-proven lesions (n=3,160) and age-matched controls (n=3,240) were selected from women screened at BreastScreen SA from January 2010 to December 2016 (n clients=207,691) and split into training, validation and test subsets (70%, 15%, 15% respectively). The primary outcome was area under the curve (AUC), in the SA Test Set 1 (SATS1), differentiating invasive breast cancer or ductal carcinoma in situ (n=469) from age-matched controls (n=490) and benign lesions (n=44). The NYU system was tested statically, after training without transfer learning (TL), after retraining with TL and without (NYU1) and with (NYU2) heatmaps. ResultsThe static NYU1 model AUCs in the NYU test set (NYTS) and SATS1 were 83.0%(95%CI=82.4%-83.6%)(2) and 75.8%(95%CI=72.6%-78.8%), respectively. Static NYU2 AUCs in the NYTS and SATS1 were 88.6%(95%CI=88.3%-88.9%)(2) and 84.5%(95%CI=81.9%-86.8%), respectively. Training of NYU1 and NYU2 without TL achieved AUCs in the SATS1 of 65.8% (95%CI=62.2%-69.1%) and 85.9%(95%CI=83.5%-88.2%), respectively. Retraining of NYU1 and NYU2 with TL resulted in AUCs of 82.4%(95%CI=79.7-84.9%) and 86.3%(95%CI=84.0-88.5%) respectively. ConclusionWe did not fully reproduce the reported performance of NYU on a local dataset; local retraining with TL approximated this level of performance. Optimising models for local clinical environments may improve performance. The generalisation of DL systems to new environments may be challenging. Key ContributionsIn this study, the original performance of deep learning models for screening mammography was reduced in an independent clinical population. Deep learning (DL) systems for mammography require local testing and may benefit from local retraining. An openly available DL system approximates human performance in an independent dataset. There are multiple potential sources of reduced deep learning system performance when deployed to a new dataset and population.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.