Back

Increasing Phenomic Prediction Efficiency Using A Principal Component Analysis Based Pre-Processing Of Near Infrared Spectra

Bienvenu, C.; Roger, J.-M.; Sene, M.; Castro Pacheco, S. A.; Singer, M.; Felaniaina, B. L.; Terrier, N.; De Bellis, F.; Pot, D.; DE VERDAL, H.; Segura, V.

2026-05-13 genetics
10.64898/2026.05.10.724118 bioRxiv
Show abstract

Phenomic prediction (PP) is a breeding value prediction method using near infrared spectroscopy (NIRS). Spectra pre-processing is a key step in the analysis pipeline of PP and generally involves chemometrics methods. However, there is still little understanding in the genetics community of what pre-processing does and why it increases performances. Consequently, the choice of pre-processing is done either arbitrarily or through a search of the optimal set of methods and associated parameters. In this study, we propose a PCA-based pre-processing method where genetic values of spectra are estimated on a set of principal components instead of individual wavelengths. This way, estimations are based on a few informative and orthogonal features of spectra instead of many correlated, uninformative wavelengths. We tested this new pre-processing method on five data sets representing four plant species (maize, rice, sorghum and grapevine). Results show that it performs as good, or better than the best classical chemometric pre-processing methods in almost all cases. Combining PCA-based and classical chemometric pre-processing methods maximizes predictive ability. Moreover, this pre-processing method opens up possibilities of better understanding and selecting parts of the spectral information that are relevant for the prediction of breeding values. Indeed, components representing together about 1% of spectral variability were found to be responsible for most of PP predictive ability. Plain language summaryCultivated plants are the result of a breeding process during which their genetic values are used to select those to breed. Estimation of breeding values requires heavy experimental means and is time consuming. Phenomic prediction is a low cost and high throughput genetic value estimation method that is increasingly being used. It often uses near infrared spectroscopy measurements as predictors of genetic values that are easy to collect and thus routinely used in many species. However, near infrared spectra generally require pre-processing before being used in prediction. Currently used pre-processing methods arise from the chemometrics community, and still deserve a better in-depth appropriation by geneticists. In this study, we propose a new pre-processing approach that performs as good as or better than the best chemometric pre-processing generally used, reduces computation time, and allows for a better understanding of what parts of spectral information are relevant for prediction. Core IdeasO_LIWorking on principal components of spectra instead of wavelengths increases predictive ability of phenomic prediction and performs as good as or better than classical chemometrics pre-processing C_LIO_LIWorking on principal components of spectra requires less optimization of parameters than chemometrics pre-processing C_LIO_LIAbout 1% of spectral variance is responsible for most of the predictive power of phenomic prediction C_LIO_LIWorking on principal components of spectra pre-processed with classical chemometrics pre-processing can increase predictive ability even more C_LIO_LIPCA-based methods are valuable to optimize predictive ability of phenomic prediction and could be used more widely in the quantitative genetics field C_LI

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 12%
14.7%
2
Frontiers in Plant Science
240 papers in training set
Top 0.7%
12.4%
3
Plant Methods
39 papers in training set
Top 0.1%
7.2%
4
Plant Phenomics
17 papers in training set
Top 0.1%
6.4%
5
Bioinformatics
1061 papers in training set
Top 4%
6.3%
6
BMC Bioinformatics
383 papers in training set
Top 2%
6.3%
50% of probability mass above
7
Scientific Reports
3102 papers in training set
Top 36%
3.6%
8
Frontiers in Genetics
197 papers in training set
Top 2%
3.1%
9
G3
33 papers in training set
Top 0.2%
1.8%
10
G3 Genes|Genomes|Genetics
351 papers in training set
Top 1%
1.7%
11
Current Protocols
13 papers in training set
Top 0.1%
1.7%
12
The Plant Phenome Journal
14 papers in training set
Top 0.1%
1.7%
13
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
14
Crop Science
18 papers in training set
Top 0.2%
1.7%
15
Plant Physiology
217 papers in training set
Top 2%
1.7%
16
PLOS Computational Biology
1633 papers in training set
Top 18%
1.5%
17
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.5%
18
Methods in Ecology and Evolution
160 papers in training set
Top 2%
1.3%
19
Agronomy
18 papers in training set
Top 0.6%
1.2%
20
BMC Genomics
328 papers in training set
Top 4%
1.2%
21
Genetics Selection Evolution
33 papers in training set
Top 0.1%
0.9%
22
Physical Biology
43 papers in training set
Top 2%
0.9%
23
Biology
43 papers in training set
Top 2%
0.9%
24
The Plant Genome
53 papers in training set
Top 0.6%
0.8%
25
Journal of Genetics and Genomics
36 papers in training set
Top 2%
0.7%
26
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 3%
0.7%
27
G3: Genes, Genomes, Genetics
222 papers in training set
Top 1%
0.7%
28
Theoretical and Applied Genetics
46 papers in training set
Top 0.6%
0.6%