Back

Multimodal BEHRT: Transformers for Multimodal Electronic Health Records to predict breast cancer prognosis

MBAYE, N. M.; Danziger, M.; Toussaint, A.; Dumas, E.; Guerin, J.; Hamy-Petit, A.-S.; Reyal, F.; Rosen-Zvi, M.; AZENCOTT, C.-A.

2024-09-23 health informatics
10.1101/2024.09.18.24312984 medRxiv
Show abstract

BackgroundBreast cancer is a complex disease that affects millions of people and is the leading cause of cancer death worldwide. There is therefore still a need to develop new tools to improve treatment outcomes for breast cancer patients. Electronic Health Records (EHRs) contain a wealth of information about patients, from pathological reports to biological measurements, that could be useful towards this end but remain mostly unexploited. Recent methodological developments in deep learning, however, open the way to developing new methods to leverage this information to improve patient care. MethodsIn this study, we propose M-BEHRT, a Multimodal BERT for Electronic Health Record (EHR) data based on BEHRT, itself an architecture based on the popular natural langugage architecture BERT (Bidirectional Encoder Representations from Transformers). M-BEHRT models multimodal patient trajectories as a sequence of medical visits, which comprise a variety of information ranging from clinical features, results from biological lab tests, medical department and procedure, and the content of free-text medical reports. M-BEHRT uses a pretraining task analog to a masked language model to learn a representation of patient trajectories from data that includes data that is unlabeled due to censoring, and is then fine-tuned to the classification task at hand. Finally, we used a gradient-based attribution method -to highlight which parts of the input patient trajectory were most relevant for the prediction. ResultsWe apply M-BEHRT to a retrospective cohort of about 15 000 breast cancer patients from Institut Curie (Paris, France) treated with adjuvant chemotherapy, using patient trajectories for up to one year after surgery to predict disease-free survival (DFS). M-BEHRT achieves an AUC-ROC of 0.77 [0.70-0.84] on a held-out data set for the prediction of DFS 3 years after surgery, compared to 0.67 [0.58-0.75] for the Nottingham Prognostic Index (NPI) and for a random forest (p-values = 0.031 and 0.050 respectively). In addition, we identified subsets of patients for which M-BEHRT performs particularly well such as older patients with at least one lymph node affected. ConclusionIn conclusion, we proposed a novel deep learning algorithm to learn from multimodal EHR data. Learning from about 15 000 patient records, our model achieves state-of-the-art performance on two classification tasks. The EHR data used to perform these tasks was more homogeneous compared to other datasets used for pretraining, as it exclusively comprised adjuvant treated breast cancer patients. This highlights both the potential of EHR data for improving our understanding of breast cancer and the ability of transformer-based architectures to learn from EHR data containing much fewer than the millions of records typically used in currently published studies. The representation of patient trajectories used by M-BEHRT captures their sequential aspect, and opens new research avenues for understanding complex diseases and improving patient care.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
14.2%
2
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
10.0%
3
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
8.3%
4
International Journal of Medical Informatics
25 papers in training set
Top 0.1%
7.1%
5
Bioinformatics
1061 papers in training set
Top 4%
6.7%
6
Scientific Reports
3102 papers in training set
Top 15%
6.7%
50% of probability mass above
7
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
4.8%
8
Biology Methods and Protocols
53 papers in training set
Top 0.2%
4.2%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
3.0%
10
JMIR Medical Informatics
17 papers in training set
Top 0.5%
2.0%
11
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
2.0%
12
PLOS ONE
4510 papers in training set
Top 52%
1.8%
13
Frontiers in Digital Health
20 papers in training set
Top 0.7%
1.6%
14
npj Digital Medicine
97 papers in training set
Top 2%
1.6%
15
Communications Medicine
85 papers in training set
Top 0.3%
1.5%
16
Nature Communications
4913 papers in training set
Top 57%
1.1%
17
Expert Systems with Applications
11 papers in training set
Top 0.3%
0.9%
18
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.9%
19
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.8%
0.9%
20
JAMIA Open
37 papers in training set
Top 1%
0.9%
21
Life
27 papers in training set
Top 0.2%
0.9%
22
iScience
1063 papers in training set
Top 30%
0.8%
23
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
24
BMC Infectious Diseases
118 papers in training set
Top 5%
0.7%
25
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
26
BMC Medical Genomics
36 papers in training set
Top 1%
0.7%
27
Patterns
70 papers in training set
Top 3%
0.7%
28
Nature Medicine
117 papers in training set
Top 5%
0.7%
29
Frontiers in Genetics
197 papers in training set
Top 10%
0.7%
30
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.4%
0.7%