Do Large Language Models Read or Remember? Analyzing LLM Performance in Biomedical Text Mining With Progressive Content Removal and Counterfactual Results

Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, d. R.; Foerster, R.; Schroeder, C.

2026-02-03 health informatics

10.64898/2026.02.02.26345347 medRxiv

Show abstract

PurposeLarge language models (LLMs) can classify biomedical documents accurately, but strong performance does not prove they are using the supplied text rather than identifier-triggered parametric knowledge. We tested whether oncology trial-success classification reflects "reading" of abstract evidence or "remembering" of known trials. MethodsWe used a corpus of 250 two-arm oncology randomized controlled trials from seven major journals (2005 - 2023) and asked the flagship models of three commercial vendors (OpenAI, Google, and Anthropic) to output a single label indicating whether the primary endpoint was met. For each trial we created five deterministic inputs: title+abstract (baseline), title-only, DOI-only, counterfactual title+abstract with the primary endpoint outcome minimally flipped, and the same counterfactual title+abstract paired with the original DOI to induce an identifier-text conflict. ResultsWith full title+abstract, models achieved near-ceiling performance (accuracy and F1 Score 0.96 - 0.97) and high format adherence (97.2 - 100%). Performance degraded stepwise with content removal (title-only accuracy and F1 Score 0.79 - 0.88, DOI-only 0.63 - 0.67), consistent with above-chance identifier-driven signal. Under counterfactual results, models followed the edited evidence (accuracy and F1 Score 0.96 - 0.99 against inverted labels). Adding the real DOI minimally affected GPT (accuracy and F1 Score {approx} 0.99) but modestly reduced Gemini (accuracy and F1 Score {approx} 0.97) and Claude (accuracy and F1 Score {approx} 0.95), mainly via lower sensitivity. ConclusionLLMs robustly track explicit endpoint statements in abstracts, yet identifiers can support above-chance predictions and occasionally compete with textual evidence. Progressive ablations plus counterfactual conflicts provide a practical, reproducible audit for grounding in biomedical LLM evaluations.

Do Large Language Models Read or Remember? Analyzing LLM Performance in Biomedical Text Mining With Progressive Content Removal and Counterfactual Results

Matching journals