Back

Exploring transcriptomic and genomic latent variable correction approaches in differential expression analysis.

Appulingam, Y.; Jammal, J.; Ali, A.; Topp, S.; NYGC ALS Consortium, ; Iacoangeli, A.; Pain, O.

2026-04-08 bioinformatics
10.64898/2026.04.07.716914 bioRxiv
Show abstract

BackgroundDifferential expression analysis is a central tool for studying the biological processes altered in human diseases via transcriptomic signatures. However, transcriptomic datasets are systematically confounded by latent variables from two distinct sources: unmeasured technical and biological heterogeneity within the expression data, and expression differences driven by population stratification. Correction using expression-based surrogate variables (SVs) and genotype-based principal components (PCs) addresses these sources independently, yet no study has directly evaluated their combined use against either method alone within a differential expression framework. In this study we hypothesised that simultaneously including both correction layers would produce more biologically valid and reproducible results than either approach alone, and tested this in two independent RNA-seq datasets of amyotrophic lateral sclerosis (ALS) cases and controls with matching genotype data. ResultsFour nested differential expression models (corrected for PC-only, SV-only, both SV and PC, and neither PCs nor SVs) were evaluated across the KCLBB (96 cases and 52 controls) and ALS Consortium (272 cases and 35 controls) datasets. Models were evaluated on: cross-dataset effect size concordance, cross-dataset replicability quantified by the Jaccard Similarity Index, and biological recall against a curated reference set of 66 known ALS genes. The combined SV+PC framework consistently outperformed simpler models across all metrics. Replicability improved nearly ten-fold compared to the non-corrected model, (Jaccard index: 2.28% to 19.5%), and the combined framework exhibited a statistically significant 2.1% gain over the SV-only model. The biological recall ALS genes recovered doubled comparing to the SV correction alone. Crucially, effect size stability was preserved, with the combined model expanding the shared transcriptomic signal without sacrificing consistency. These findings remained generally robust to PC number in sensitivity analyses. ConclusionsThis study found that SVs and genotype PCs address non-redundant sources of confounding, and we recommend their combined use as standard practice in differential expression analysis where matched genotype data are available. Notably PCs capturing population structure can also be derived directly from RNA-seq data, extending the applicability of this framework to studies lacking matched genotype data. Although this analysis was restricted to ALS datasets, we expect these findings to generalise to other traits.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.7%
10.8%
2
Scientific Reports
3102 papers in training set
Top 5%
10.5%
3
PLOS ONE
4510 papers in training set
Top 23%
7.4%
4
Bioinformatics
1061 papers in training set
Top 4%
7.1%
5
Frontiers in Genetics
197 papers in training set
Top 0.6%
7.1%
6
Bioinformatics Advances
184 papers in training set
Top 0.4%
6.5%
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.5%
4.1%
50% of probability mass above
8
PeerJ
261 papers in training set
Top 3%
2.8%
9
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.8%
10
BMC Genomics
328 papers in training set
Top 3%
1.5%
11
Neurobiology of Disease
134 papers in training set
Top 3%
1.4%
12
PLOS Computational Biology
1633 papers in training set
Top 18%
1.4%
13
Journal of Neurology
26 papers in training set
Top 0.9%
1.1%
14
Journal of Translational Medicine
46 papers in training set
Top 2%
1.0%
15
Genome Medicine
154 papers in training set
Top 6%
1.0%
16
Multiple Sclerosis Journal
18 papers in training set
Top 0.2%
1.0%
17
Genes
126 papers in training set
Top 2%
0.9%
18
BMC Medical Genomics
36 papers in training set
Top 0.9%
0.9%
19
Frontiers in Human Neuroscience
67 papers in training set
Top 2%
0.8%
20
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.8%
21
Heliyon
146 papers in training set
Top 6%
0.8%
22
Genetic Epidemiology
46 papers in training set
Top 0.8%
0.8%
23
Biology Methods and Protocols
53 papers in training set
Top 3%
0.7%
24
Brain Communications
147 papers in training set
Top 3%
0.7%
25
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
26
Nature Communications
4913 papers in training set
Top 65%
0.7%
27
Frontiers in Cellular Neuroscience
79 papers in training set
Top 2%
0.5%
28
Communications Biology
886 papers in training set
Top 31%
0.5%
29
Human Molecular Genetics
130 papers in training set
Top 4%
0.5%
30
Neuropathology and Applied Neurobiology
14 papers in training set
Top 0.9%
0.5%