Enhancing Medical Knowledge in Large Language Models via Supervised Continued Pretraining on Clinical Notes
Weissenbacher, D.; Shabbir, M.; Campbell, I. M.; Berdahl, C. T.; Gonzalez-Hernandez, G.
Show abstract
Background: Large language models (LLMs) contain limited professional medical knowledge, as large-scale training on clinical text has not yet been possible due to restricted access. Objectives: To continue pre-training an open-access instruct LLM on de-identified medical notes and evaluate the resulting impact on real-world clinical decision-making tasks and standard benchmarks. Methods: Using 500K de-identified clinical notes from Cedars-Sinai Health System, we fine-tuned a Qwen3-4B Instruct model with supervised learning to generate medical decision-making (MDM) paragraphs from patient presentations, and evaluated it on assigned-diagnosis prediction, in-hospital cardiac-arrest mention detection, and a suite of general and biomedical benchmarks. Results: The fine-tuned model produced MDMs that closely resembled those written by physicians and outperformed the base-instruct model and larger clinically untrained models (Qwen3-32B and Llama-3.1-405B Instruct) on assigned-diagnosis prediction, the task most aligned with its training objective. On the task of detecting in-hospital cardiac arrest mentions, the model initially exhibited mild label collapse, but a brief task-specific fine-tuning stage resolved this issue and allowed it to surpass all competitors. The model also demonstrated global general knowledge retention on biomedical and general-domain evaluation benchmarks compared to the baseline. Conclusion: Supervised full fine-tuning on clinical notes allowed the model to incorporate medical knowledge without sacrificing general-domain abilities, and to transfer this knowledge to unseen biomedical tasks without wholesale loss of general-domain abilities, while revealing collapse-related failure modes that motivate more principled strategies for clinical specialization.
Matching journals
The top 9 journals account for 50% of the predicted probability mass.