Back

Data Extraction from Oncology Imaging Reports by Large Language Models: A Comparative Accuracy Study

Passweg, L. P.; Schwenke, J. M.; Schoenenberger, C. M.; Locher, F.; Picker, J.; Dieterle, M.; Thiele, B.; Hasler, D.; Danelli, A.; Schmitt, A. M.; Heye, T.; Stojanov, T.; Briel, M.; Kasenda, B.

2025-12-30 health informatics
10.64898/2025.12.30.25343206
Show abstract

ImportanceManual data extraction from clinical text is resource intensive. Locally hosted large language models (LLMs) may offer a privacy-preserving solution, but their performance on non-English data remains unclear. ObjectiveTo investigate whether the classification accuracy of locally hosted LLMs is non-inferior to human accuracy when determining metastasis status and treatment response from German radiology reports. DesignIn this retrospective comparative accuracy study, five locally hosted LLMs (llama3.3:70b, mistral-small:24b, qwq:32b, qwen3:32b, and gpt-oss:120b) were compared against humans. To calculate accuracy, a ground truth was established via duplicate human extraction and adjudication of discrepancies by a senior oncologist. Both initial human extraction and LLM outputs were compared against this ground truth. SettingThe study was conducted at a tertiary referral hospital in Switzerland; data processing and analyses took place inside the hospital network. Participants400 randomly sampled radiology reports from adult cancer patients (CT, MRI, PET) generated between January 2023 and May 2025. ExposuresAutomated classification of metastasis status and treatment response by LLMs using a standardized prompt pipeline compared to manual human review. Main Outcomes and MeasuresPrimary outcomes were non-inferiority (5 percentage points [pp] margin) of LLM classification accuracy compared with human accuracy for metastasis status (presence/absence by anatomical site) and treatment response categories. Secondary outcomes included accuracy for primary tumor diagnosis, radiological absence of tumor, and extraction time per report. ResultsThe analysis included 400 reports from 317 patients (mean age 63 years, 32% women). On the test set (n=300), human accuracy for metastasis status was 98.4% (95% CI 98.0%-98.8%). All LLMs were non-inferior; gpt-oss:120b performed best (97.6% accuracy; difference:xs -0.8pp [90% CI, -1.3 to -0.3 pp]). For response to treatment, human accuracy was 86.0% (95% CI 83.2%-88.8%). All LLMs were inferior; the most accurate model, gpt-oss:120b, achieved 78.3% (difference -7.7 pp [90% CI, -11.6 to -3.8 pp]). Mean human time per report was 120 seconds vs 11-63 seconds for LLMs. Conclusion and RelevanceIn this study, LLMs were non-inferior to human accuracy for classification of metastasis status but were inferior for response to treatment assessment. gpt-oss:120b was the most accurate among tested LLMs. Study RegistrationOSF: 45PVQ Key PointsO_ST_ABSQuestionC_ST_ABSCan locally hosted large language models (LLMs) match human performance when extracting sites of metastases and response to treatment from radiology reports of cancer patients? FindingsIn this preregistered, single center study of 300 German radiology reports, all evaluated LLMs were non-inferior to humans in extracting the presence or absence of metastasis by organ site, but LLMs were inferior to humans in classification of response to treatment. MeaningLLMs can be suitable for classification of metastasis status, whereas more caution is warranted for more complex tasks where additional clinical reasoning may be required.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
based on 14 papers
Top 0.1%
15.2%
2
Journal of the American Medical Informatics Association
based on 53 papers
Top 1%
10.1%
3
BMC Medical Informatics and Decision Making
based on 36 papers
Top 1%
7.5%
4
npj Digital Medicine
based on 85 papers
Top 3%
6.3%
5
Journal of Medical Internet Research
based on 81 papers
Top 4%
4.5%
6
JAMA Network Open
based on 125 papers
Top 4%
4.5%
7
BMJ Open
based on 553 papers
Top 25%
4.5%
50% of probability mass above
8
BMJ Health & Care Informatics
based on 13 papers
Top 0.4%
2.9%
9
The Lancet Digital Health
based on 25 papers
Top 0.5%
2.9%
10
JMIR Medical Informatics
based on 16 papers
Top 2%
2.8%
11
Journal of Biomedical Informatics
based on 37 papers
Top 3%
2.4%
12
Journal of Clinical Epidemiology
based on 29 papers
Top 0.9%
2.4%
13
JAMIA Open
based on 35 papers
Top 4%
2.4%
14
BMC Medical Research Methodology
based on 41 papers
Top 2%
2.3%
15
International Journal of Medical Informatics
based on 25 papers
Top 3%
2.3%
16
Scientific Reports
based on 701 papers
Top 68%
1.8%
17
Scientific Data
based on 30 papers
Top 2%
1.6%
18
PLOS Digital Health
based on 88 papers
Top 8%
1.6%
19
PLOS ONE
based on 1737 papers
Top 92%
1.3%
20
Frontiers in Digital Health
based on 18 papers
Top 3%
1.2%
21
Genetics in Medicine
based on 57 papers
Top 5%
0.8%
22
Cancer Medicine
based on 17 papers
Top 3%
0.8%
23
Nature Communications
based on 483 papers
Top 41%
0.8%
24
BMJ
based on 49 papers
Top 6%
0.8%
25
Informatics in Medicine Unlocked
based on 11 papers
Top 3%
0.7%