Back

Robust Random Forests for Genomic Prediction: Challenges and Remedies

Lourenco, V. M.; Ogutu, J. O.; Piepho, H.-P.

2026-04-01 bioinformatics
10.64898/2026.03.30.715203 bioRxiv
Show abstract

Data contamination--from recording errors to extreme outliers--can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train-deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summaryMachine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Bioinformatics Advances
184 papers in training set
Top 0.1%
12.1%
2
Bioinformatics
1061 papers in training set
Top 2%
12.1%
3
BMC Bioinformatics
383 papers in training set
Top 2%
6.2%
4
PLOS Computational Biology
1633 papers in training set
Top 6%
6.2%
5
BMC Genomics
328 papers in training set
Top 0.3%
6.2%
6
Methods in Ecology and Evolution
160 papers in training set
Top 0.7%
4.1%
7
New Phytologist
309 papers in training set
Top 2%
3.5%
50% of probability mass above
8
in silico Plants
24 papers in training set
Top 0.1%
3.5%
9
Frontiers in Genetics
197 papers in training set
Top 2%
3.5%
10
PLOS ONE
4510 papers in training set
Top 43%
3.0%
11
Molecular Ecology Resources
161 papers in training set
Top 0.4%
2.6%
12
The Plant Genome
53 papers in training set
Top 0.3%
2.0%
13
Scientific Reports
3102 papers in training set
Top 52%
2.0%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.0%
15
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 31%
1.7%
16
Nature Communications
4913 papers in training set
Top 55%
1.3%
17
Theoretical and Applied Genetics
46 papers in training set
Top 0.3%
1.2%
18
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.6%
1.2%
19
Genome Biology
555 papers in training set
Top 6%
0.9%
20
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.5%
0.9%
21
Genetics
225 papers in training set
Top 4%
0.8%
22
GigaScience
172 papers in training set
Top 3%
0.8%
23
PeerJ
261 papers in training set
Top 14%
0.8%
24
Cell Systems
167 papers in training set
Top 12%
0.7%
25
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
26
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%
27
PLOS Genetics
756 papers in training set
Top 15%
0.7%
28
GENETICS
189 papers in training set
Top 1%
0.7%
29
G3 Genes|Genomes|Genetics
351 papers in training set
Top 3%
0.7%
30
Genetics Selection Evolution
33 papers in training set
Top 0.2%
0.7%