MedSDoH: A Rule-Based System for Extracting Social Determinants of Health from Multi-site EHRs Based on the OHNLP Framework
Ahn, J.; Fu, S.; Palacios, D. M.; Jeong, H.-H.; Wang, L.; Swartz, M. C.; Tosur, M.; Redondo, M. J.; Wu, X.; Yue, Z.; Kakadiaris, A.; Wang, N.; Li, Z.; Huang, M.; Wen, A.; Harris, D.; Wang, Y.; Kwak, M. J.; Liu, Z.; Liu, H.
Show abstract
ObjectiveSocial Determinants of Health (SDoH) are critical to patient care and population health. Despite their importance, SDoH information is frequently embedded within unstructured clinical text such as patient-reported information or social worker notes, which limits its use on clinical decision-making and resource allocation. Although transformer-based models represent the current state of the art, their scalability, computational requirements, and limited transparency pose barriers to large-scale multi-site clinical implementation. In this context, rule-based NLP systems remain valuable, particularly when explainability, reproducibility, and rapid customization are essential. MethodsMedSDoH was developed within the Open Health Natural Language Processing (OHNLP) Framework using literature-derived SDoH resources, standardized domain definitions, and expert-curated rulesets. Large language models (LLMs) were used during development to assist with rule generation and lexicon expansion. Rules were iteratively refined against a gold-standard annotated corpus from two health systems and then evaluated on independent datasets. ResultThe final system included 942 regular expression rules spanning 22 SDoH domains. On validation on two external datasets, MedSDoH demonstrated generalizability and comparable performance across sites. The system has been made publicly available so research community can collaboratively contribute to the maintenance and extension through disease- or site-specific adaptations. ConclusionMedSDoH is a computationally efficient and open-source system for large-scale SDoH extraction from clinical text. It is well-suited for multi-site adaptation and deployment in resource-constrained settings.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.