Back

Do Large Language Models Read or Remember? Analyzing LLM Performance in Biomedical Text Mining With Progressive Content Removal and Counterfactual Results

Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, d. R.; Foerster, R.; Schroeder, C.

2026-02-03 health informatics
10.64898/2026.02.02.26345347 medRxiv
Show abstract

PurposeLarge language models (LLMs) can classify biomedical documents accurately, but strong performance does not prove they are using the supplied text rather than identifier-triggered parametric knowledge. We tested whether oncology trial-success classification reflects "reading" of abstract evidence or "remembering" of known trials. MethodsWe used a corpus of 250 two-arm oncology randomized controlled trials from seven major journals (2005 - 2023) and asked the flagship models of three commercial vendors (OpenAI, Google, and Anthropic) to output a single label indicating whether the primary endpoint was met. For each trial we created five deterministic inputs: title+abstract (baseline), title-only, DOI-only, counterfactual title+abstract with the primary endpoint outcome minimally flipped, and the same counterfactual title+abstract paired with the original DOI to induce an identifier-text conflict. ResultsWith full title+abstract, models achieved near-ceiling performance (accuracy and F1 Score 0.96 - 0.97) and high format adherence (97.2 - 100%). Performance degraded stepwise with content removal (title-only accuracy and F1 Score 0.79 - 0.88, DOI-only 0.63 - 0.67), consistent with above-chance identifier-driven signal. Under counterfactual results, models followed the edited evidence (accuracy and F1 Score 0.96 - 0.99 against inverted labels). Adding the real DOI minimally affected GPT (accuracy and F1 Score {approx} 0.99) but modestly reduced Gemini (accuracy and F1 Score {approx} 0.97) and Claude (accuracy and F1 Score {approx} 0.95), mainly via lower sensitivity. ConclusionLLMs robustly track explicit endpoint statements in abstracts, yet identifiers can support above-chance predictions and occasionally compete with textual evidence. Progressive ablations plus counterfactual conflicts provide a practical, reproducible audit for grounding in biomedical LLM evaluations.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
22.6%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.7%
3
The Lancet Digital Health
25 papers in training set
Top 0.1%
6.4%
4
npj Digital Medicine
97 papers in training set
Top 0.9%
4.9%
50% of probability mass above
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.1%
6
BMC Bioinformatics
383 papers in training set
Top 3%
3.1%
7
Bioinformatics
1061 papers in training set
Top 6%
2.5%
8
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
1.9%
9
BMC Medical Research Methodology
43 papers in training set
Top 0.5%
1.8%
10
Scientific Reports
3102 papers in training set
Top 58%
1.7%
11
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.7%
12
eBioMedicine
130 papers in training set
Top 1%
1.7%
13
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.7%
14
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
15
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
16
PLOS ONE
4510 papers in training set
Top 58%
1.3%
17
BMC Medicine
163 papers in training set
Top 5%
1.2%
18
JAMIA Open
37 papers in training set
Top 1%
1.2%
19
Journal of Clinical Epidemiology
28 papers in training set
Top 0.4%
1.2%
20
PLOS Computational Biology
1633 papers in training set
Top 21%
1.0%
21
Nature Communications
4913 papers in training set
Top 60%
0.9%
22
Cancer Medicine
24 papers in training set
Top 1%
0.8%
23
BMJ Health & Care Informatics
13 papers in training set
Top 0.8%
0.8%
24
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.8%
0.7%
25
Annals of Internal Medicine
27 papers in training set
Top 1%
0.7%
26
Frontiers in Digital Health
20 papers in training set
Top 1%
0.7%
27
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.6%
28
European Respiratory Journal
54 papers in training set
Top 2%
0.6%
29
PLOS Digital Health
91 papers in training set
Top 3%
0.5%
30
Patterns
70 papers in training set
Top 4%
0.5%