Back

dgiLIT: A Method for Prioritization and AI Curation of Drug-Gene Interactions

Cannon, M. J.; Bratulin, A.; Stevenson, J. S.; Perry, K.; Coffman, A.; Kiwala, S.; Schimmelpfennig, L.; Costello, H.; McMichael, J. F.; Griffith, M.; Griffith, O. L.; Wagner, A. H.

2026-01-19 bioinformatics
10.64898/2026.01.16.699733 bioRxiv
Show abstract

IMPORTANCEThe Drug-Gene Interaction Database (DGIdb) has a long history of driving hypothesis generation for biomedical research through the careful curation of drug-gene interaction data from primary and secondary sources with supporting literature. Recent advances in large-language model (LLM) and artificial intelligence (AI) technologies have enabled new paradigms for knowledge extraction and biocuration. The accelerating growth of biomedical literature presents a significant challenge for maintaining up-to-date interaction data. With more than 38 million citations indexed in PubMed alone, new strategies must evolve to identify and incorporate new interaction data into DGIdb. OBJECTIVEIdentify new cost-effective AI curation strategies for incorporating new drug-gene interactions into DGIdb. METHODSWe present a methodology that leverages deterministic natural language processing techniques, existing harmonization frameworks, and AI-assisted curation to systematically narrow the literature space and identify new drug-gene interactions from published studies for inclusion in DGIdb. RESULTSWe demonstrate the use of lemmatization to prioritize a set of 100 abstracts containing high amounts of interaction words for downstream AI curation. From our set of abstracts, we were then able to identify 137 drug-gene interactions via an AI curation task, with 121 (88.3%) of these interactions being completely novel to DGIdb. A human expert evaluator reviewed this interaction set and was able to validate 134 of 137 (97.8%) interactions as being valid based on the text provided. CONCLUSIONTaken together, our results highlight a promising, cost-effective method of ingesting new interactions into DGIdb.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
18.7%
2
Database
51 papers in training set
Top 0.1%
10.1%
3
PLOS ONE
4510 papers in training set
Top 19%
10.1%
4
BMC Bioinformatics
383 papers in training set
Top 1%
9.1%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.5%
6.3%
50% of probability mass above
6
BioData Mining
15 papers in training set
Top 0.1%
4.8%
7
Scientific Reports
3102 papers in training set
Top 37%
3.6%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.6%
9
Journal of Biomedical Informatics
45 papers in training set
Top 0.6%
2.4%
10
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.4%
11
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
12
Scientific Data
174 papers in training set
Top 0.9%
1.9%
13
GigaScience
172 papers in training set
Top 1%
1.7%
14
Computers in Biology and Medicine
120 papers in training set
Top 3%
1.3%
15
iScience
1063 papers in training set
Top 25%
0.9%
16
JAMIA Open
37 papers in training set
Top 1%
0.8%
17
Nucleic Acids Research
1128 papers in training set
Top 16%
0.8%
18
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
19
PeerJ
261 papers in training set
Top 14%
0.8%
20
Artificial Intelligence in Medicine
15 papers in training set
Top 0.7%
0.7%
21
Research Synthesis Methods
20 papers in training set
Top 0.2%
0.7%
22
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 46%
0.7%
23
BMC Medical Genomics
36 papers in training set
Top 2%
0.6%
24
Cureus
67 papers in training set
Top 6%
0.6%
25
Nature Machine Intelligence
61 papers in training set
Top 4%
0.6%
26
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.6%