Back

Information Bottleneck Dominates Adversarial Training for Ancestry-Invariant Polygenic Risk Prediction: Dimensionality, Not Gradient Reversal, Controls the Fairness-Accuracy Tradeoff

Tran, P. P.; Do, A. T.

2026-04-29 genomics
10.64898/2026.04.24.720752 bioRxiv
Show abstract

In adversarial representation learning for fair prediction, the gradient reversal coefficient ({lambda}) is widely treated as the primary control for sensitive-attribute invariance. We show this assumption is wrong. Using a dual-stream architecture for cross-ancestry polygenic risk score (PRS) prediction, we demonstrate that latent dimensionality -- the information bottleneck -- accounts for 8-27 x more variance in ancestry leakage than adversarial strength. Varying{lambda} across a 20 x range changes leakage by only 2.2 percentage points; varying dimensionality across a 16 x range changes it by 46.6 pp. At dimension 8 with no adversarial training ({lambda} = 0), ancestry leakage is 32.9% (chance = 20%): the bottleneck alone achieves near-invariance. The adversary architecture (linear vs deep MLP) is equally irrelevant (0.6 pp range). We validate this finding across two unrelated domains -- genomic ancestry invariance (6 clinical traits, 1000 Genomes, n = 2,504) and EEG subject invariance (pretrained HFTP + Braindecode dual-domain model, 20 subjects) -- observing consistent dimensionality dominance (12.7:1 ratio in EEG). For the genomic application, Stream 1 encodes population structure via DCT-II frequencydomain features (136 coefficients); Stream 2 encodes phenotype signal from top PRS SNPs (PCA to 128 dimensions). The architecture works equally well with standard genomic PCA as the ancestry stream (R2 = 0.217 vs 0.222), confirming the contribution is architectural, not encoding-specific. African-ancestry PRS reconstruction R2 improves on all six traits (e.g., +5.1 pp for coronary artery disease). Linear models achieve higher aggregate R2 but fail catastrophically on cross-ancestry transfer (R2 = - 12.45 for African-ancestry CAD). We emphasize that we predict PRS (a computed score), not disease phenotypes; validation on biobank-scale phenotype data is ongoing. These results suggest the adversarial fairness community has been over-investing in adversary engineering relative to simple capacity control. Practitioners should select latent dimensionality first to set the information budget for the fairness-accuracy tradeoff, then optionally use adversarial training for marginal refinement.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Nature Machine Intelligence
61 papers in training set
Top 0.1%
14.6%
2
Nature Communications
4913 papers in training set
Top 21%
9.1%
3
Nature Neuroscience
216 papers in training set
Top 1%
6.3%
4
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 11%
6.3%
5
Nature
575 papers in training set
Top 5%
4.8%
6
PLOS Computational Biology
1633 papers in training set
Top 7%
4.8%
7
The American Journal of Human Genetics
206 papers in training set
Top 0.9%
4.8%
50% of probability mass above
8
Nature Genetics
240 papers in training set
Top 2%
4.1%
9
eLife
5422 papers in training set
Top 22%
3.9%
10
Science
429 papers in training set
Top 9%
3.6%
11
Cell Genomics
162 papers in training set
Top 1%
3.6%
12
Nature Medicine
117 papers in training set
Top 0.9%
3.6%
13
Nature Computational Science
50 papers in training set
Top 0.3%
2.7%
14
Communications Biology
886 papers in training set
Top 4%
2.6%
15
Frontiers in Genetics
197 papers in training set
Top 4%
1.9%
16
Scientific Reports
3102 papers in training set
Top 55%
1.8%
17
Genome Research
409 papers in training set
Top 2%
1.7%
18
Nature Human Behaviour
85 papers in training set
Top 2%
1.7%
19
Nature Methods
336 papers in training set
Top 5%
1.7%
20
Patterns
70 papers in training set
Top 2%
0.9%
21
Cell Systems
167 papers in training set
Top 11%
0.8%
22
Cell Reports
1338 papers in training set
Top 32%
0.8%
23
Cell
370 papers in training set
Top 16%
0.8%
24
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
25
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.7%
26
Science Advances
1098 papers in training set
Top 30%
0.7%
27
Neuron
282 papers in training set
Top 9%
0.6%