Back

Retrospective cohort study extracting coexisting background breast-lesion features from stage I-III invasive breast cancer

Lim, R. J. Y.; Nitar, P.; Lau, K. W.; Leong, L. C. H.; Lim, G. H.; Tan, V. K. M.; Tan, B. K. T.; Tan, E. Y.; Goh, S. S. N.; Hartman, M.; Wong, F. Y.; Li, J.; Joint Breast Cancer Registry,

2026-05-22 oncology
10.64898/2026.05.19.26353633 medRxiv
Show abstract

Background Background breast features are frequently noted in pathology reports alongside invasive breast cancer but rarely factor into prognosis or treatment decisions. Their relationship to tumor characteristics and patient outcomes remains incompletely characterised. Methods We conducted a retrospective cohort study of 7,603 patients with Stage I-III invasive breast cancer (diagnosed 1991-2022, age <80 years) from the Joint Breast Cancer Registry in Singapore. Natural language processing (NLP) was applied to 9,754 free-text pathology reports to extract co-existing background breast features, with accuracy validated by dual-reviewer assessment of 200 reports. Unsupervised hierarchical clustering grouped extracted features into three categories. Associations with tumor characteristics were assessed by multinomial logistic regression, and ten-year overall survival by Cox proportional hazards models (median follow-up 9.6 years; 620 deaths). Results Here we show that NLP-based extraction of background breast features from routine pathology reports achieves an accuracy of over 90% across features. Lobular neoplasia and benign proliferative changes are associated with less aggressive tumor characteristics, whereas early neoplastic and papillary lesions are more prevalent in HER2-enriched and luminal B tumor subtypes. Benign proliferative changes are associated with better survival in age- and year-adjusted models (hazard ratio 0.91, 95% CI 0.86-0.97), but this association is attenuated after adjustment for stage and subtype. Conclusions NLP-enabled extraction of background breast features from pathology text is feasible at scale. These features reflect tumor biology but do not independently add prognostic information beyond established clinical variables.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Breast Cancer Research
32 papers in training set
Top 0.1%
16.7%
2
JNCI Cancer Spectrum
10 papers in training set
Top 0.1%
8.0%
3
Scientific Reports
3102 papers in training set
Top 20%
6.1%
4
Cancer Epidemiology, Biomarkers & Prevention
17 papers in training set
Top 0.1%
6.0%
5
Cancers
200 papers in training set
Top 0.9%
6.0%
6
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
3.4%
7
PLOS ONE
4510 papers in training set
Top 41%
3.4%
8
Nature Communications
4913 papers in training set
Top 42%
3.4%
50% of probability mass above
9
Diagnostics
48 papers in training set
Top 0.5%
3.4%
10
International Journal of Cancer
42 papers in training set
Top 0.3%
3.4%
11
European Journal of Cancer
10 papers in training set
Top 0.1%
2.6%
12
Annals of Oncology
13 papers in training set
Top 0.3%
2.6%
13
Frontiers in Oncology
95 papers in training set
Top 2%
2.6%
14
BMC Cancer
52 papers in training set
Top 0.9%
2.3%
15
Cancer Medicine
24 papers in training set
Top 0.6%
2.0%
16
British Journal of Cancer
42 papers in training set
Top 0.7%
2.0%
17
Clinical Cancer Research
58 papers in training set
Top 0.9%
1.8%
18
npj Breast Cancer
18 papers in training set
Top 0.1%
1.7%
19
JAMA Network Open
127 papers in training set
Top 3%
1.4%
20
BMC Research Notes
29 papers in training set
Top 0.2%
1.4%
21
The Journal of Pathology
22 papers in training set
Top 0.2%
1.4%
22
iScience
1063 papers in training set
Top 23%
1.1%
23
npj Digital Medicine
97 papers in training set
Top 3%
0.9%
24
Frontiers in Bioinformatics
45 papers in training set
Top 0.7%
0.9%
25
International Journal of Epidemiology
74 papers in training set
Top 3%
0.8%
26
JNCI: Journal of the National Cancer Institute
16 papers in training set
Top 0.7%
0.7%
27
Journal of Personalized Medicine
28 papers in training set
Top 1%
0.7%
28
European Journal of Human Genetics
49 papers in training set
Top 1%
0.7%
29
JCO Precision Oncology
14 papers in training set
Top 0.4%
0.7%
30
PeerJ
261 papers in training set
Top 16%
0.7%