Back

Real-World Data for Predicting Rapid Relapse Triple Negative Cancer: A Study Using NCDB and EHR Data

Jonnalagadda, P.; Obeng-Gyasi, S.; Stover, D. G.; Andersen, B. L.; Rahurkar, S.

2026-01-30 oncology
10.64898/2026.01.28.26345096 medRxiv
Show abstract

BackgroundMany patients with triple-negative breast cancer (TNBC), particularly those who are older, Black, or insured by Medicaid, do not receive guideline-concordant treatment, despite its association with up to 4x higher survival. Early identification of patients at risk for rapid relapse may enable timely interventions and improve outcomes. This study applies machine learning (ML) to real-world data to predict risk of rapid relapse in TNBC. MethodsWe trained various ML models (logistic regression, decision trees, random forests, XGBoost, naive Bayes, support vector machines) using National Cancer Database (NCDB) data and fine-tuned them using electronic health record (EHR) data from a cancer registry. Class imbalance was addressed using synthetic minority oversampling technique (SMOTE). Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristics area under the curve ROC AUC, accuracy, and F1 scores. Transfer learning, cross-validation, and threshold optimization were applied to enhance the ensemble models performance on clinical data. ResultsInitial models trained on NCDB data exhibited high NPV but low sensitivity and PPV. SMOTE and hyperparameter tuning produced modest improvements. External testing on EHR data from a cancer registry had similar model performance. After applying transfer learning, cross-validation, and threshold optimization using the clinical data, the ensemble model achieved higher performance. The optimized ensemble model achieved a sensitivity of 0.87, specificity of 0.99, PPV of 0.90, NPV of 0.98, ROC AUC of 0.99, accuracy of 0.98, and F1-score of 0.88. This optimized model, leveraging readily available clinical data, demonstrated superior performance compared to initial NCDB-trained models and those reported in extant literature. ConclusionsTransfer learning and threshold optimization effectively adapted ML models trained on NCDB data to an independent real-world clinical dataset from a single site, producing a high-performing model for predicting rapid relapse in TNBC. This model, potentially translatable to fast health interoperability resources (FHIR)-compatible workflows, represents a promising tool for identifying patients at high risk. Future work should include prospective external validation, evaluation of integration into clinical workflows, and implementation studies to determine whether the model improves care processes such as timely patient navigation and treatment planning. Author SummaryIn this study, we set out to understand which patients with triple-negative breast cancer might experience a rapid return of their disease. Many people with this aggressive form of cancer do not receive the treatments that are known to improve survival, especially patients who are older, Black, or insured through public programs. Being able to identify those at highest risk early in their care could help health teams provide timely support and ensure that patients receive the treatments they need. To do this, we used information from a large national cancer database to build computer-based models that learn from patterns in patient data. We then refined these models using real medical records from a cancer center to make sure they worked well in everyday clinical settings. After adjusting and improving the models, we developed a tool that can correctly identify most patients who are likely to have a rapid return of their cancer. Our hope is that this type of tool could eventually be built into routine care and help guide timely follow-up, support services, and treatment planning. More testing in real clinical environments will be important to understand how well the tool improves care and outcomes for patients.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
12.2%
2
PLOS ONE
4510 papers in training set
Top 19%
10.0%
3
Frontiers in Oncology
95 papers in training set
Top 0.7%
4.8%
4
Scientific Reports
3102 papers in training set
Top 28%
4.3%
5
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
4.3%
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.6%
4.3%
7
PeerJ
261 papers in training set
Top 1%
4.3%
8
Cancer Medicine
24 papers in training set
Top 0.3%
3.9%
9
Biology Methods and Protocols
53 papers in training set
Top 0.3%
3.5%
50% of probability mass above
10
PLOS Computational Biology
1633 papers in training set
Top 11%
3.2%
11
JAMA Network Open
127 papers in training set
Top 1%
3.2%
12
BMC Health Services Research
42 papers in training set
Top 0.8%
2.8%
13
npj Digital Medicine
97 papers in training set
Top 1%
2.7%
14
Breast Cancer Research
32 papers in training set
Top 0.3%
2.3%
15
Cancer Epidemiology, Biomarkers & Prevention
17 papers in training set
Top 0.2%
2.1%
16
BMC Cancer
52 papers in training set
Top 1%
1.9%
17
European Journal of Cancer
10 papers in training set
Top 0.2%
1.8%
18
Annals of Biomedical Engineering
34 papers in training set
Top 0.7%
1.7%
19
JMIR Formative Research
32 papers in training set
Top 0.9%
1.6%
20
Frontiers in Bioinformatics
45 papers in training set
Top 0.3%
1.5%
21
JMIR Medical Informatics
17 papers in training set
Top 1.0%
1.3%
22
Cancers
200 papers in training set
Top 4%
1.2%
23
BMJ Open
554 papers in training set
Top 11%
1.2%
24
Frontiers in Digital Health
20 papers in training set
Top 1.0%
1.1%
25
Journal of Personalized Medicine
28 papers in training set
Top 0.9%
0.9%
26
BMC Infectious Diseases
118 papers in training set
Top 5%
0.9%
27
BMC Research Notes
29 papers in training set
Top 0.4%
0.9%
28
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
29
iScience
1063 papers in training set
Top 30%
0.8%
30
Frontiers in Artificial Intelligence
18 papers in training set
Top 1.0%
0.6%