Back

Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms

McPhaul, T.; Kreimeyer, K.; Baris, A.; Botsis, T.

2026-03-23 health informatics
10.64898/2026.03.20.26348915 medRxiv
Show abstract

Cancer data standardization requires converting unstructured pathology reports into structured registry variables, a mostly manual and resource-intensive task. We evaluated two automated extraction platforms: Brim Analytics, an LLM-based system that guides and orchestrates abstraction, and DeepPhe, an ontology-driven system. Using 330 pancreatic adenocarcinoma and 34 breast cancer pathology reports from Johns Hopkins Hospital, we assessed both under deployment-realistic conditions. Brim Analytics achieved high accuracy across seven registry variables in pancreatic cancer (mean 96.7%), including T stage (96.4%) and histologic grade (97.0%), with a 3.0 p.p. decline on breast cancer (mean 93.7%). DeepPhe performed comparably for N stage (96.4% pancreatic, 94.1% breast) but had notable T stage deficits (83.6% pancreatic, 70.6% breast). Per-report processing times averaged 0.9 s (Brim, pancreatic), 4.6 s (Brim, breast), 1.1 s (DeepPhe, pancreatic), and 3.5 s (DeepPhe, breast). These results indicate that LLM-based extraction can achieve high accuracy across cancer types and support automated data workflows.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
18.6%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
10.1%
3
Scientific Reports
3102 papers in training set
Top 10%
8.4%
4
JAMIA Open
37 papers in training set
Top 0.2%
6.4%
5
Bioinformatics
1061 papers in training set
Top 4%
4.8%
6
npj Digital Medicine
97 papers in training set
Top 0.9%
4.8%
50% of probability mass above
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.6%
4.3%
8
PLOS ONE
4510 papers in training set
Top 40%
3.6%
9
Nature Communications
4913 papers in training set
Top 40%
3.6%
10
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.7%
11
BMC Bioinformatics
383 papers in training set
Top 4%
1.9%
12
International Journal of Medical Informatics
25 papers in training set
Top 0.7%
1.9%
13
The Lancet Digital Health
25 papers in training set
Top 0.3%
1.8%
14
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.7%
15
Med
38 papers in training set
Top 0.3%
1.7%
16
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.7%
17
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.3%
18
iScience
1063 papers in training set
Top 22%
1.2%
19
GigaScience
172 papers in training set
Top 2%
1.2%
20
Patterns
70 papers in training set
Top 2%
0.9%
21
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
22
Scientific Data
174 papers in training set
Top 2%
0.9%
23
Cancer Medicine
24 papers in training set
Top 1%
0.7%
24
Data in Brief
13 papers in training set
Top 0.5%
0.7%
25
Journal of Pathology Informatics
13 papers in training set
Top 0.4%
0.7%
26
BMJ Health & Care Informatics
13 papers in training set
Top 1.0%
0.7%