Back

Predicting Rectal Cancer Patient Survival with Dutch Radiology Reports using Natural Language Processing (NLP): The Role of Pretrained Language Models

Cai, L.; Zhang, T.; Beets-Tan, R.; Brunekreef, J.; Teuwen, J.

2026-01-30 health informatics
10.64898/2026.01.23.26344428 medRxiv
Show abstract

The use of Electronic Health Records (EHRs) has increased significantly in recent years. However, a substantial portion of the clinical data remains in unstructured text formats, especially in the context of radiology. This limits the application of EHRs for automated analysis in oncology research. Pretrained language models have been utilized to extract feature embeddings from these reports for downstream clinical applications, such as treatment response and survival prediction. However, a thorough investigation into which pretrained models produce the most effective features for rectal cancer survival prediction has not yet been done. This study explores the performance of five Dutch pretrained language models, including two publicly available models (RobBERT and MedRoBERTa.nl) and three developed in-house for the purpose of this study (RecRoBERT, BRecRoBERT, and BRec2RoBERT) with training on distinct Dutch-only corpora, in predicting overall survival and disease-free survival outcomes in rectal cancer patients. Our results showed that our in-house developed BRecRoBERT, a RoBERTa-based language model trained from scratch on a combination of Dutch breast and rectal cancer corpora, delivered the best predictive performance for both survival tasks, achieving a C-index of 0.65 (0.57, 0.73) for overall survival and 0.71 (0.64, 0.78) for disease-free survival. It outperformed models trained on general Dutch corpora (RobBERT) or Dutch hospital clinical notes (MedRoBERTa.nl). BRecRoBERT demonstrated the potential capability to predict survival in rectal cancer patients using Dutch radiology reports at diagnosis. This study highlights the value of pretrained language models that incorporate domain-specific knowledge for downstream clinical applications. Furthermore, it proves that utilizing data from related domains can improve the quality of feature embeddings for certain clinical tasks, particularly in situations where domain-specific data is scarce.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.1%
18.1%
2
Artificial Intelligence in Medicine
15 papers in training set
Top 0.1%
17.7%
3
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
9.8%
4
Scientific Reports
3102 papers in training set
Top 20%
6.2%
50% of probability mass above
5
International Journal of Medical Informatics
25 papers in training set
Top 0.3%
4.7%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
4.7%
7
Biology Methods and Protocols
53 papers in training set
Top 0.1%
4.7%
8
JMIR Medical Informatics
17 papers in training set
Top 0.4%
3.0%
9
Computers in Biology and Medicine
120 papers in training set
Top 1%
2.3%
10
Journal of Medical Internet Research
85 papers in training set
Top 2%
2.3%
11
Informatics in Medicine Unlocked
21 papers in training set
Top 0.4%
1.8%
12
JAMIA Open
37 papers in training set
Top 0.8%
1.7%
13
PLOS ONE
4510 papers in training set
Top 55%
1.6%
14
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.3%
1.6%
15
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.3%
16
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.5%
1.3%
17
Frontiers in Digital Health
20 papers in training set
Top 1.0%
1.1%
18
PLOS Digital Health
91 papers in training set
Top 2%
0.9%
19
Cancer Medicine
24 papers in training set
Top 1%
0.9%
20
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
21
BMJ Health & Care Informatics
13 papers in training set
Top 0.9%
0.8%
22
Cureus
67 papers in training set
Top 5%
0.8%
23
npj Precision Oncology
48 papers in training set
Top 1%
0.7%
24
npj Digital Medicine
97 papers in training set
Top 4%
0.7%
25
Bioinformatics
1061 papers in training set
Top 10%
0.7%
26
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 3%
0.6%