Back

Leveraging Large Language Models to Extract Prognostic Pathology Features in Ewing Sarcoma

Huang, J.; Batool, A.; Gu, Z.; Zhao, Z.; Yao, B.; Black, J.; Davis, J.; al-Ibraheemi, A.; DuBois, S.; Barkauskas, D.; Ramakrishnan, S.; Hall, D.; Grohar, P.; Xie, Y.; Xiao, G.; Leavey, P. J.

2026-03-19 bioinformatics
10.64898/2026.02.20.707103 bioRxiv
Show abstract

Background: Current risk stratification for Ewing sarcoma relies heavily on clinical factors such as metastatic status, failing to capture histologic heterogeneity as a potential prognostic indicator. Although pathology reports contain rich biological data, this information remains locked in unstructured narrative text, limiting large-scale retrospective analyses. We aimed to validate the utility of Large Language Models (LLMs) for scalable data abstraction and to identify prognostic histologic features from a large multi-institutional cohort. Methods: We conducted a retrospective cohort study using data from six Children's Oncology Group (COG) clinical trials. We utilized an LLM-based pipeline (OpenAI o3) to extract structured variables, including immunohistochemical (IHC) markers and CD99 staining patterns - from digitized, Optical Character Recognition (OCR)-processed pathology reports. Extraction accuracy was validated against a human-annotated ground truth (n=200) and cross-validated against senior experts (n=48). We assessed the association between extracted features and Overall Survival (OS) using Kaplan-Meier analysis and multivariable Cox proportional hazards regression, adjusting for metastatic status. Findings: We analyzed 931 diagnostic pathology reports spanning over 19-years. The LLM achieved a weighted average accuracy of 94% across 17 IHC markers; in a cross-validation subset, the LLM outperformed human annotators (weighted average accuracy over 15 IHC markers: LLM o3: 98.1%, a resident specialist 91.4%, and a senior expert 95.9%). Survival analysis identified Neuron-Specific Enolase (NSE) and S100 as significant prognostic biomarkers. After adjusting for metastatic status, NSE positivity was associated with significantly inferior survival (HR 2.15, 95% CI 1.15 - 4.02, p=0.016); this risk was most pronounced in patients with non-metastatic disease (HR 5.64, p=0.0055). Conversely, S100 positivity was associated with improved survival (HR 0.58, 95% CI 0.34-1.00, p=0.046). Interpretation: LLM-assisted extraction of pathology variables is highly accurate and scalable, capable of unlocking "dark data" from historical clinical trials. We identified NSE as a potent risk factor and S100 as a protective marker in Ewing sarcoma, particularly in localized disease. These findings suggest that AI-derived histologic data can refine risk stratification and, if validated, warrant inclusion in future prospective trials.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
10.3%
2
Scientific Reports
3102 papers in training set
Top 6%
10.3%
3
Nature Communications
4913 papers in training set
Top 21%
8.6%
4
PLOS ONE
4510 papers in training set
Top 27%
6.4%
5
npj Precision Oncology
48 papers in training set
Top 0.1%
3.7%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
7
Cancers
200 papers in training set
Top 2%
3.7%
8
Bioinformatics
1061 papers in training set
Top 6%
2.7%
9
JNCI Cancer Spectrum
10 papers in training set
Top 0.2%
2.1%
50% of probability mass above
10
Cancer Research Communications
46 papers in training set
Top 0.3%
1.9%
11
JCO Precision Oncology
14 papers in training set
Top 0.1%
1.9%
12
Breast Cancer Research
32 papers in training set
Top 0.3%
1.7%
13
Nucleic Acids Research
1128 papers in training set
Top 11%
1.7%
14
BMC Bioinformatics
383 papers in training set
Top 5%
1.5%
15
Leukemia
39 papers in training set
Top 0.5%
1.5%
16
Cancer Research
116 papers in training set
Top 2%
1.5%
17
Clinical Cancer Research
58 papers in training set
Top 1%
1.4%
18
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.3%
19
Frontiers in Oncology
95 papers in training set
Top 3%
1.0%
20
Genome Medicine
154 papers in training set
Top 6%
1.0%
21
npj Breast Cancer
18 papers in training set
Top 0.1%
1.0%
22
Frontiers in Immunology
586 papers in training set
Top 6%
0.9%
23
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
24
International Journal of Cancer
42 papers in training set
Top 1%
0.8%
25
Modern Pathology
21 papers in training set
Top 0.4%
0.8%
26
American Journal of Respiratory and Critical Care Medicine
39 papers in training set
Top 0.9%
0.7%
27
The Journal of Pathology
22 papers in training set
Top 0.5%
0.7%
28
Journal of Translational Medicine
46 papers in training set
Top 3%
0.7%
29
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%
30
eLife
5422 papers in training set
Top 61%
0.7%