Back

Systematic assessment of machine learning-based variant annotation methods for rare variant association testing

Aguirre, M.; Irudayanathan, F. J.; Crow, M.; Hejase, H. A.; Menon, V. K.; Pendergrass, R. K.; McCarthy, M. I.; Fletez-Brant, K.

2026-03-20 bioinformatics
10.64898/2026.03.18.712715 bioRxiv
Show abstract

Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods -- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA -- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein dis-tances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant re-sults that were enriched (1.8-5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our anal-ysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
The American Journal of Human Genetics
206 papers in training set
Top 0.1%
34.7%
2
Genome Medicine
154 papers in training set
Top 0.6%
8.5%
3
Bioinformatics
1061 papers in training set
Top 4%
6.4%
4
Genetic Epidemiology
46 papers in training set
Top 0.2%
4.2%
50% of probability mass above
5
Nature Communications
4913 papers in training set
Top 39%
3.6%
6
PLOS Computational Biology
1633 papers in training set
Top 9%
3.6%
7
Scientific Reports
3102 papers in training set
Top 41%
3.1%
8
Cell Genomics
162 papers in training set
Top 2%
2.9%
9
BMC Bioinformatics
383 papers in training set
Top 3%
2.9%
10
Briefings in Bioinformatics
326 papers in training set
Top 2%
2.9%
11
PLOS ONE
4510 papers in training set
Top 47%
2.1%
12
BioData Mining
15 papers in training set
Top 0.2%
1.9%
13
European Journal of Human Genetics
49 papers in training set
Top 0.5%
1.9%
14
Human Genetics
25 papers in training set
Top 0.2%
1.7%
15
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.3%
16
PLOS Genetics
756 papers in training set
Top 11%
1.2%
17
Genome Biology
555 papers in training set
Top 6%
1.0%
18
BMC Medical Genomics
36 papers in training set
Top 1.0%
0.9%
19
Cell Systems
167 papers in training set
Top 10%
0.9%
20
npj Genomic Medicine
33 papers in training set
Top 0.8%
0.8%
21
Nature Genetics
240 papers in training set
Top 7%
0.8%
22
Frontiers in Genetics
197 papers in training set
Top 9%
0.8%
23
BMC Genomics
328 papers in training set
Top 6%
0.7%
24
Communications Biology
886 papers in training set
Top 26%
0.7%
25
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
26
Genetics
225 papers in training set
Top 5%
0.7%
27
Human Genetics and Genomics Advances
70 papers in training set
Top 1%
0.5%