Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms

McPhaul, T.; Kreimeyer, K.; Baris, A.; Botsis, T.

2026-03-23 health informatics

10.64898/2026.03.20.26348915 medRxiv

Show abstract

Cancer data standardization requires converting unstructured pathology reports into structured registry variables, a mostly manual and resource-intensive task. We evaluated two automated extraction platforms: Brim Analytics, an LLM-based system that guides and orchestrates abstraction, and DeepPhe, an ontology-driven system. Using 330 pancreatic adenocarcinoma and 34 breast cancer pathology reports from Johns Hopkins Hospital, we assessed both under deployment-realistic conditions. Brim Analytics achieved high accuracy across seven registry variables in pancreatic cancer (mean 96.7%), including T stage (96.4%) and histologic grade (97.0%), with a 3.0 p.p. decline on breast cancer (mean 93.7%). DeepPhe performed comparably for N stage (96.4% pancreatic, 94.1% breast) but had notable T stage deficits (83.6% pancreatic, 70.6% breast). Per-report processing times averaged 0.9 s (Brim, pancreatic), 4.6 s (Brim, breast), 1.1 s (DeepPhe, pancreatic), and 3.5 s (DeepPhe, breast). These results indicate that LLM-based extraction can achieve high accuracy across cancer types and support automated data workflows.

Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms

Matching journals