Back

DxFit: An ensemble method for identifying EHR diagnoses consistent with a molecular finding

Torene, R. I.; Meltz Murphy, K.; Brandt, T.; Retterer, K.

2026-04-28 genomics
10.64898/2026.04.24.720629 bioRxiv
Show abstract

As population DNA sequencing becomes more common, genomic-first approaches are increasingly used to identify individuals with possible rare genetic disorders. To accurately estimate prevalence and penetrance, these studies often confirm manifestation of the disorder using electronic health records (EHRs). Multiple strategies exist to search the EHR for diagnoses of rare disorders, however, each has its limitations. We have developed a portable, ensemble tool, DxFit, that mines EHR data (ICD codes and structured diagnosis descriptions from billing code and problem list tables) for a diagnosis consistent with a given rare genetic disorder. DxFit combines evidence across four strategies: (1) gene name searches in diagnosis descriptions and notes, (2) ICD conversion to Mondo rare disorder ontology to find exact and nearby matches, (3) word embedding similarity searches, and (4) Jaccard similarity matches. DxFit prioritizes the match type and outputs the most confident match for each participant-disorder pair. On a cohort of 350 participants with a known positive result from diagnostic genetic testing for developmental disorders, DxFit had a sensitivity of 88.7% and specificity of 86.2% using default parameters. Adjusting the linguistic scoring thresholds from 0.8 to 0.7 and allowing for synonymous matches yielded a sensitivity of 92.7% and specificity of 84.5%. Partitioning EHR evidence into windows before and after genetic testing demonstrates, as expected, that the overall DxFit rates increase after testing and the match types become more confident. DxFit is available to the public and has extensive customization options to support a wide range of uses. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=187 HEIGHT=200 SRC="FIGDIR/small/720629v1_ufig1.gif" ALT="Figure 1"> View larger version (41K): org.highwire.dtl.DTLVardef@d71d00org.highwire.dtl.DTLVardef@b11a9eorg.highwire.dtl.DTLVardef@14a9304org.highwire.dtl.DTLVardef@fa23aa_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Genome Medicine
154 papers in training set
Top 0.3%
14.1%
2
Bioinformatics Advances
184 papers in training set
Top 0.2%
9.9%
3
Genetics in Medicine
69 papers in training set
Top 0.2%
9.9%
4
Bioinformatics
1061 papers in training set
Top 3%
9.0%
5
The American Journal of Human Genetics
206 papers in training set
Top 0.8%
6.2%
6
Human Mutation
29 papers in training set
Top 0.2%
3.9%
50% of probability mass above
7
Cell Genomics
162 papers in training set
Top 2%
3.5%
8
BioData Mining
15 papers in training set
Top 0.1%
2.8%
9
BMC Bioinformatics
383 papers in training set
Top 3%
2.7%
10
BMC Medical Genomics
36 papers in training set
Top 0.2%
2.7%
11
GigaScience
172 papers in training set
Top 1.0%
2.1%
12
Nucleic Acids Research
1128 papers in training set
Top 9%
2.0%
13
Genome Research
409 papers in training set
Top 2%
1.8%
14
PLOS ONE
4510 papers in training set
Top 52%
1.7%
15
Scientific Reports
3102 papers in training set
Top 59%
1.7%
16
European Journal of Human Genetics
49 papers in training set
Top 0.7%
1.6%
17
Genetic Epidemiology
46 papers in training set
Top 0.5%
1.3%
18
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
19
npj Genomic Medicine
33 papers in training set
Top 0.6%
1.2%
20
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
21
Computational and Structural Biotechnology Journal
216 papers in training set
Top 9%
0.8%
22
Nature Communications
4913 papers in training set
Top 62%
0.8%
23
Nature Computational Science
50 papers in training set
Top 2%
0.8%
24
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.8%
25
Cell Reports Methods
141 papers in training set
Top 5%
0.7%
26
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
27
BMC Genomics
328 papers in training set
Top 6%
0.7%
28
Genome Biology
555 papers in training set
Top 8%
0.7%
29
eLife
5422 papers in training set
Top 62%
0.6%
30
PLOS Computational Biology
1633 papers in training set
Top 28%
0.6%