Do Large Language Models Read or Remember? Analyzing LLM Performance in Biomedical Text Mining With Progressive Content Removal and Counterfactual Results
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, d. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeLarge language models (LLMs) can classify biomedical documents accurately, but strong performance does not prove they are using the supplied text rather than identifier-triggered parametric knowledge. We tested whether oncology trial-success classification reflects "reading" of abstract evidence or "remembering" of known trials. MethodsWe used a corpus of 250 two-arm oncology randomized controlled trials from seven major journals (2005 - 2023) and asked the flagship models of three commercial vendors (OpenAI, Google, and Anthropic) to output a single label indicating whether the primary endpoint was met. For each trial we created five deterministic inputs: title+abstract (baseline), title-only, DOI-only, counterfactual title+abstract with the primary endpoint outcome minimally flipped, and the same counterfactual title+abstract paired with the original DOI to induce an identifier-text conflict. ResultsWith full title+abstract, models achieved near-ceiling performance (accuracy and F1 Score 0.96 - 0.97) and high format adherence (97.2 - 100%). Performance degraded stepwise with content removal (title-only accuracy and F1 Score 0.79 - 0.88, DOI-only 0.63 - 0.67), consistent with above-chance identifier-driven signal. Under counterfactual results, models followed the edited evidence (accuracy and F1 Score 0.96 - 0.99 against inverted labels). Adding the real DOI minimally affected GPT (accuracy and F1 Score {approx} 0.99) but modestly reduced Gemini (accuracy and F1 Score {approx} 0.97) and Claude (accuracy and F1 Score {approx} 0.95), mainly via lower sensitivity. ConclusionLLMs robustly track explicit endpoint statements in abstracts, yet identifiers can support above-chance predictions and occasionally compete with textual evidence. Progressive ablations plus counterfactual conflicts provide a practical, reproducible audit for grounding in biomedical LLM evaluations.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.