Back

Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions

Xu, L.; Kefella, Y.; Zhang, Y.; Conrad, R. D.; Anderson, K. E.; Krysan, K.; Liu, G.; Kane, E.; Pennycuick, A.; Janes, S. M.; Reid, M. E.; Burks, E. J.; Billatos, E.; Mazzilli, S. A.; Kolachalama, V. B.; Beane, J. E.

2025-06-12 oncology
10.1101/2025.06.06.25328492
Show abstract

Molecular and cellular alterations to the normal pseudostratified columnar bronchial epithelium results in the development of bronchial premalignant lesions representing a spectrum of histology from normal to hyperplasia, metaplasia, dysplasia (mild, moderate, and severe), carcinoma in situ and invasive carcinoma. Several studies have identified molecular alterations associated with lesion histology and progression. The broad and continuous spectrum of histologic and molecular changes makes reproducible stratification of lesions across multiple studies challenging. Here we propose a transformer-based framework that flexibly utilizes transcriptomic and histologic patterns to distinguish lesions with bronchial dysplasia or worse from normal, hyperplasia, and metaplasia. We leveraged H&E whole slide images (WSIs) of endobronchial biopsies and bulk gene expression data (GE) from previously published studies and on-going lung precancer atlas efforts obtained from patients as high-risk for lung cancer. Models trained using both WSIs and GE compared to a single data modality had higher performance. On an external testing dataset of WSIs, the area under the ROC curve (AUROC) of the model trained on WSIs plus GE was 0.761{+/-}0.015 compared to 0.690{+/-}0.027 for model trained on WSIs. On external testing datasets of GE, the AUROC of the model trained on WSIs plus GE was 0.890{+/-}0.023 versus 0.816{+/-}0.032 for a model trained on GE. Based on these results, we leveraged data across 4 studies to train a flexible fusion model that allows one or both data modalities to be used in training. The model achieved an AUROC of 0.809{+/-}0.036 on external testing WSIs data and 0.903{+/-}0.022 on external testing GE data. Despite model training on a binary label, model probabilities are associated with histologic grade and the model identifies gene expression alterations associated with bronchial dysplasia across multiple studies. This framework maps bronchial premalignant lesions that contain at least one data modality into a spectrum of disease. In the future, a framework trained on multiple data modalities may be useful in predicting premalignant disease severity, progression, and interception agent efficacy.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
npj Precision Oncology
based on 14 papers
Top 0.1%
14.1%
2
Nature Communications
based on 483 papers
Top 4%
13.3%
3
Scientific Reports
based on 701 papers
Top 20%
8.1%
4
JCO Clinical Cancer Informatics
based on 14 papers
Top 0.4%
5.0%
5
PLOS ONE
based on 1737 papers
Top 68%
5.0%
6
Cancers
based on 57 papers
Top 4%
3.1%
7
iScience
based on 74 papers
Top 2%
2.5%
50% of probability mass above
8
Computers in Biology and Medicine
based on 39 papers
Top 3%
2.5%
9
Frontiers in Oncology
based on 34 papers
Top 4%
2.4%
10
Clinical Cancer Research
based on 22 papers
Top 2%
2.4%
11
International Journal of Radiation Oncology*Biology*Physics
based on 13 papers
Top 1%
2.0%
12
PLOS Computational Biology
based on 141 papers
Top 6%
2.0%
13
The Lancet Digital Health
based on 25 papers
Top 2%
1.4%
14
Cancer Epidemiology, Biomarkers & Prevention
based on 14 papers
Top 2%
1.4%
15
Modern Pathology
based on 10 papers
Top 0.7%
1.3%
16
Journal for ImmunoTherapy of Cancer
based on 14 papers
Top 2%
1.3%
17
Cancer Medicine
based on 17 papers
Top 3%
1.3%
18
eLife
based on 262 papers
Top 27%
0.8%
19
PeerJ
based on 46 papers
Top 9%
0.8%
20
Communications Biology
based on 36 papers
Top 4%
0.8%
21
Briefings in Bioinformatics
based on 11 papers
Top 0.5%
0.8%
22
Diagnostics
based on 36 papers
Top 5%
0.8%
23
JCO Precision Oncology
based on 11 papers
Top 2%
0.8%
24
Genomics, Proteomics & Bioinformatics
based on 10 papers
Top 2%
0.7%
25
Breast Cancer Research
based on 11 papers
Top 2%
0.7%
26
JNCI: Journal of the National Cancer Institute
based on 13 papers
Top 2%
0.7%
27
Biology Methods and Protocols
based on 19 papers
Top 3%
0.7%
28
Proceedings of the National Academy of Sciences
based on 100 papers
Top 14%
0.7%
29
Cell Reports
based on 25 papers
Top 2%
0.7%
30
Radiotherapy and Oncology
based on 11 papers
Top 2%
0.7%