Automated Extraction of Mortality Information from Publicly Available Sources Using Language Models
Al-Garadi, M. A.; LeNoue-Newton, M.; Matheny, M. E.; McPheeters, M.; Whitaker, J. M.; Deere, J. A.; McLemore, M. F.; Westerman, D.; Khan, M. S.; Hernandez-Munoz, J. J.; Wang, X.; Kuzucan, A.; Desai, R. J.; Reeves, R.
Show abstract
BackgroundMortality is a critical variable in healthcare research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant challenges. Conventional sources such as the National Death Index (NDI) and electronic health records (EHRs) often suffer from data lags, missing fields, or incomplete coverage, limiting their utility in time-sensitive or large-scale studies. With the growing use of social media, crowdfunding platforms, and online memorials, publicly available digital content has emerged as a potential supplementary source for mortality surveillance. Despite this potential, accurate tools for extracting mortality information from such unstructured data sources remain underdeveloped. ObjectiveTo develop scalable approaches using natural language processing (NLP) and large language models (LLM) for the extraction of mortality information from publicly available online data sources, including social media platforms, crowdfunding websites, and online obituaries, and to evaluate their performance across various sources. MethodsData were collected from public posts on X (formerly Twitter), GoFundMe campaigns, memorial websites (EverLoved.com and TributeArchive.com), and online obituaries from 2015 to 2022, focusing on U.S.-based content relevant to mortality. We developed an NLP pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then employed a few-shot learning (FSL) approach with LLMs to identify primary and secondary causes of death. Model performance was assessed using precision, recall, F1-score, and accuracy metrics, with human-annotated labels serving as the reference standard for the transformer-based model and a human adjudicator blinded to labeling source for the FSL model reference standard. ResultsThe best-performing model obtained a micro-averaged F1-score of 0.88 (95% CI, 0.86-0.90) in extracting mortality information. The FSL-LLM approach demonstrated high accuracy in identifying primary CoD across various online sources. For GoFundMe, the FSL-LLM achieved 95.9% accuracy for primary cause identification, compared to 97.9% for human annotators. In obituaries, FSL-LLM accuracy was 96.5% for primary causes, while human accuracy was 99.0%. For memorial websites, FSL-LLM achieved 98.0% accuracy for primary causes, with human accuracy at 99.5%. ConclusionsThis study demonstrates the feasibility of using advanced NLP and LLM techniques to extract mortality data from publicly available online sources. These methods can significantly enhance the timeliness, completeness, and granularity of mortality surveillance, offering a valuable complement to traditional data systems. By enabling earlier detection of mortality signals and improving CoD classification across large populations, this approach may support more responsive public health monitoring and medical product safety assessments. Further work is needed to validate these findings in real-world healthcare settings and facilitate the integration of digital data sources into national public health surveillance systems.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.