dgiLIT: A Method for Prioritization and AI Curation of Drug-Gene Interactions
Cannon, M. J.; Bratulin, A.; Stevenson, J. S.; Perry, K.; Coffman, A.; Kiwala, S.; Schimmelpfennig, L.; Costello, H.; McMichael, J. F.; Griffith, M.; Griffith, O. L.; Wagner, A. H.
Show abstract
IMPORTANCEThe Drug-Gene Interaction Database (DGIdb) has a long history of driving hypothesis generation for biomedical research through the careful curation of drug-gene interaction data from primary and secondary sources with supporting literature. Recent advances in large-language model (LLM) and artificial intelligence (AI) technologies have enabled new paradigms for knowledge extraction and biocuration. The accelerating growth of biomedical literature presents a significant challenge for maintaining up-to-date interaction data. With more than 38 million citations indexed in PubMed alone, new strategies must evolve to identify and incorporate new interaction data into DGIdb. OBJECTIVEIdentify new cost-effective AI curation strategies for incorporating new drug-gene interactions into DGIdb. METHODSWe present a methodology that leverages deterministic natural language processing techniques, existing harmonization frameworks, and AI-assisted curation to systematically narrow the literature space and identify new drug-gene interactions from published studies for inclusion in DGIdb. RESULTSWe demonstrate the use of lemmatization to prioritize a set of 100 abstracts containing high amounts of interaction words for downstream AI curation. From our set of abstracts, we were then able to identify 137 drug-gene interactions via an AI curation task, with 121 (88.3%) of these interactions being completely novel to DGIdb. A human expert evaluator reviewed this interaction set and was able to validate 134 of 137 (97.8%) interactions as being valid based on the text provided. CONCLUSIONTaken together, our results highlight a promising, cost-effective method of ingesting new interactions into DGIdb.
Matching journals
The top 5 journals account for 50% of the predicted probability mass.