Back

Machine learning methodology using a masked neural network for robust genetic risk score calculation from noisy and missing data

Squires, S.; Weedon, M. N.; Oram, R. A.

2026-05-20 genetic and genomic medicine
10.64898/2026.05.18.25341725 medRxiv
Show abstract

Purpose: Genetic risk scores (GRSs) are summaries of genetic data that can improve prediction of disease risk and progression. GRSs are increasing available but rely on high quality input data to produce good output results; with noisy or missing inputs the GRS may be inaccurate. We aimed to develop a method to produce a robust estimate of the GRS when input data is missing, noisy or both. Approach: We developed a neural network approach, named masked-MLP, for robust GRS calculation trained on a set of GRS scores calculated on clean data. The masked-MLP includes additional input data and has noise inserted during training, both which make the model more robust. Results: A GRS for type 1 diabetes (T1D) calculated on input data with 10\% of the data corrupted had a Spearman rank correlation to the clean GRS of 0.669 (0.665-0.674) while the equivalent for the masked-MLP was 0.951 (0.950-0.952). For the same data the area under the receiver operating characteristic curve for separation of T1D from population samples fell from 0.919 (0.904-0.932) to 0.808 (0.787-0.827) for the GRS while the masked-MLP fell to 0.910 (0.895-0.924). Conclusions: The masked-MLP was more robust to noise when calculating a GRS than using standard approaches. Our approach has the potential to ensure both improved research and clinical outcomes due to more reliable GRS calculation.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 13%
14.4%
2
Scientific Reports
3102 papers in training set
Top 4%
12.4%
3
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
10.1%
4
International Journal of Epidemiology
74 papers in training set
Top 0.6%
3.6%
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.8%
3.6%
6
Frontiers in Genetics
197 papers in training set
Top 3%
2.6%
7
JAMIA Open
37 papers in training set
Top 0.6%
2.1%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.9%
50% of probability mass above
9
Genetic Epidemiology
46 papers in training set
Top 0.4%
1.9%
10
Bioinformatics
1061 papers in training set
Top 7%
1.8%
11
BMC Medical Genomics
36 papers in training set
Top 0.5%
1.7%
12
BMC Bioinformatics
383 papers in training set
Top 4%
1.7%
13
BMC Genomics
328 papers in training set
Top 2%
1.7%
14
Diabetologia
36 papers in training set
Top 0.6%
1.7%
15
Trials
25 papers in training set
Top 0.9%
1.7%
16
Wellcome Open Research
57 papers in training set
Top 1%
1.5%
17
Journal of Clinical Medicine
91 papers in training set
Top 4%
1.5%
18
Epidemiology and Infection
84 papers in training set
Top 2%
1.3%
19
BioData Mining
15 papers in training set
Top 0.5%
1.2%
20
Human Molecular Genetics
130 papers in training set
Top 2%
1.2%
21
Bioinformatics Advances
184 papers in training set
Top 4%
1.1%
22
Cureus
67 papers in training set
Top 4%
0.9%
23
The Journal of Clinical Endocrinology & Metabolism
35 papers in training set
Top 1%
0.9%
24
International Journal of Environmental Research and Public Health
124 papers in training set
Top 6%
0.8%
25
Genes
126 papers in training set
Top 3%
0.8%
26
PLOS Digital Health
91 papers in training set
Top 3%
0.8%
27
Diabetes Care
12 papers in training set
Top 0.3%
0.8%
28
BMC Medicine
163 papers in training set
Top 7%
0.7%
29
Archives of Clinical and Biomedical Research
28 papers in training set
Top 2%
0.7%
30
Biology
43 papers in training set
Top 3%
0.7%