A Longitudinal Clinical Foundation Model on Nationwide Veteran Health Trajectories
Zamora-Resendiz, R.; Yin, J.; Kimbrel, N. A.; Beckham, J. C.; Crivelli, S.
Show abstract
We present VA-LLM, a 1.62-billion-parameter autoregressive transformer pre-trained from scratch on 1.74 trillion tokens of clinical text spanning 22 years of care for 13.8 million patients in the Veterans Health Administration, with mortality outcomes confirmed through the National Death Index for 7.8 million patients. In a retrospective-prospective evaluation on 107,555 withheld patients, VA-LLM achieved higher 5-year AUPRC than Llama-2 (7 billion parameters), BioGPT _large (1.57 billion parameters), and GatorTron (3.91 billion parameters), matching GatorTron's 100,000-patient performance with only 10,000 labeled patients. In a clinical validation against the VA's operational Care Assessment Need (CAN) score on 5.5 million patients one year beyond the pre-training corpus, VA-LLM achieved a 90-day mortality AUROC of 90.00% versus 87.74% (p < 0.001) and a 45% relative improvement in AUPRC; post-hoc recalibration recovered calibration comparable to CAN (Brier 0.0091 versus 0.0093) without sacrificing discrimination. Across 21 pre-training checkpoints, discriminative performance correlated more strongly with cumulative mortality experience (CME), the total person-years contributed by patients with confirmed deaths, than with token count ({Delta}R2 = 0.15; Williams p < 10-6). Performance plateaued once marginal cohorts added fewer confirmed deaths, even as pre-training loss continued to decrease. These findings suggest that the clinical composition of pre-training data, particularly the completeness of documented patient trajectories, correlates with predictive performance more closely than corpus size alone.
Matching journals
The top 4 journals account for 50% of the predicted probability mass.