Back

A residual-ratio framework for auditing transcriptomic gene signatures against background expression structure

Zhu, Y.; Zhang, C.; Calhoun, V. D.; Bi, Y.

2026-04-14 bioinformatics
10.64898/2026.04.11.717907 bioRxiv
Show abstract

BackgroundTranscriptomic gene signatures are widely used to infer pathway activity and biological mechanism from bulk cancer expression data, yet current evaluation strategies primarily emphasize internal coherence, predictive performance, or scoring robustness. A quantitative framework for assessing how much signature variation remains independent of background expression structure has been lacking. ResultsUnlike existing single-number signature-quality metrics such as Berglund uniqueness, residual-ratio auditing reports a trajectory across null-model richness: for each signature we compute the residual ratio [Formula] at progressively enriched expression-PC subspaces, together with an inverse-participation-ratio (IPR) concentration diagnostic that reports the effective number of axes absorbing each signature. Applied to a curated 17-entry benchmark, all 50 MSigDB Hallmark gene sets, and 1,181 Reactome pathways across 8 TCGA cancer types (4,462 samples), with external validation in METABRIC, the framework produces two complementary readouts. First, the curated panel is absorbed into the ExprPC50 subspace at residual ratios 18-43% below size-matched random 30-gene baselines in every cancer (curated mean r{perp} range 0.109-0.177 vs. random mean 0.182-0.288), providing the frameworks central quantitative discrimination between biologically coherent signatures and arbitrary gene combinations. Second, within the curated panel the ExprPC50 residual ratio is negatively correlated with the top-5 absorption concentration in every cancer (Spearman{rho} from -0.59 in PRAD to -0.89 in SKCM, median -0.71; all 8 significant at p < 0.05, most at p < 10-3); we report this correlation as a descriptive geometric property of the null-model coordinate system rather than as a biological law, because 1,000 random 30-gene draws projected through the same top-50 expression-PC basis reproduce the same pan-cancer median{rho} (-0.73; Supplementary Table S16), and it is robust to compositional nuisance: after rebuilding the null basis as immune-PC1 {oplus} stromal-PC1 {oplus} proliferation-PC1 plus 47 residual PCs, the per-cancer{rho} becomes more negative rather than shallower (median -0.86; Supplementary Table S17), ruling out tumor purity, immune infiltrate, and stromal fraction as drivers of the pattern. Because absorption at ExprPC50 is a geometric property of how any signature direction sits in expression-PC space, tier-level distributional structure at this operating point is not separable beyond the low-vs-upper band split: a Kruskal-Wallis omnibus is significant (p = 4.9 x 10-13), but pairwise Dunns post-hoc tests show that Tiers 1, 4, and 5 are not separable (pBH > 0.2). The trajectory shape itself is empirically bootstrap-invariant: across 200 sample-level fixed-basis bootstrap resamples of the 17 curated entries in BRCA, the mean pairwise Pearson correlation of trajectory-shape vectors is 0.999, and individual cell-level 95% bootstrap CI half-widths at B= 1,000 resamples are in the range 0.002-0.053. External replication in the METABRIC breast cancer cohort (nsamples = 1,980, microarray) showed moderate-to-strong rank-ordering concordance with TCGA-BRCA across the 17 curated entries (Spearman{rho} = 0.72 on the 17-signature ordering, 95% Fisher-z CI 0.37-0.89, p = 0.001). Under an upper-bound sensitivity analysis, 45 of 50 Hallmark gene sets and 992 of 1,181 Reactome pathways had ExprPC200 residual ratios below the mean of their size-matched random baselines--a descriptive statistic reflecting axis alignment under rich null models, not a failure rate. In causal DAG simulations (nrep = 100 replicates), a signature driven entirely by a latent confounder retained r{perp} = 0.233 at ExprPC50, numerically comparable to Tier 1 validated drivers, so a single-point residual ratio cannot adjudicate confounder-independence. The frameworks load-bearing signals are therefore the trajectory shape (statistically invariant under sample-level resampling) and the magnitude gap between the curated panel and its random 30-gene baseline (the curated-vs-random discrimination), read jointly--not the value of r{perp} at any single null-model dimensionality. O_TBL View this table: org.highwire.dtl.DTLVardef@1b56848org.highwire.dtl.DTLVardef@d18636org.highwire.dtl.DTLVardef@1c26db4org.highwire.dtl.DTLVardef@1b0620corg.highwire.dtl.DTLVardef@f507d2_HPS_FORMAT_FIGEXP M_TBL O_FLOATNOTable S16:C_FLOATNO O_TABLECAPTIONRandom-gene-set null for the ExprPC50 r{perp}-vs-c(5) correlation, and curated-panel absolute gap vs random baselines. For each of the 8 primary-analysis TCGA cancer cohorts, we drew B = 1,000 random 30-gene sets from the gene universe of the preprocessed expression matrix and computed, for each draw, the residual ratio r{perp}(k = 50) and the top-5 absorption concentration c(5) under the same top-50 sample-space PC basis used for the curated benchmark (Methods [&sect;]Statistical analysis; reference implementation accompanies the project repository as script 35). Column "Empirical curated{rho} " repeats the 17-signature Spearman{rho} between r{perp} and c(5) reported in the main text and in Supplementary Table S10; column "Null-A{rho} (random 30-gene)" gives the Spearman{rho} across the 1,000 random-draw (r{perp}, c(5)) pairs per cancer; column "{Delta} (emp null-A)" reports the difference. For reference, column "Null-B{rho} " gives the corresponding Spearman{rho} across 1,000 iid Gaussian unit vectors h (0, IN) in sample space (a uniform random direction that does not inherit the expression covariance geometry). Rightmost columns compare the curated 17-entry panels mean r{perp} at ExprPC50 to the random 30-gene baseline mean, both in absolute units and as a percentage gap; this magnitude gap is the quantitative discrimination between curated biological signatures and arbitrary gene combinations on which the frameworks central claim rests. C_TABLECAPTION C_TBL O_TBL View this table: org.highwire.dtl.DTLVardef@d4a1d3org.highwire.dtl.DTLVardef@1cc3baaorg.highwire.dtl.DTLVardef@1611aecorg.highwire.dtl.DTLVardef@2e9c76org.highwire.dtl.DTLVardef@220711_HPS_FORMAT_FIGEXP M_TBL O_FLOATNOTable S17:C_FLOATNO O_TABLECAPTIONPurity-aware null for the curated-panel ExprPC50 r{perp}-vs-c(5) correlation. For each cancer we rebuild a rank-50 sample-space basis as follows. Columns 1-3 are the PC1 directions of: an immune-infiltrate proxy panel (CD3D, CD3E, CD4, CD8A, CD8B, CD19, CD68, PTPRC, FOXP3, IFNG); a stromal/fibroblast proxy panel (COL1A1, COL1A2, COL3A1, VIM, FN1, ACTA2, PDGFRA, PDGFRB); and the 50 proliferation markers already used in the null-model hierarchy ([&sect;]Null model hierarchy). Columns 4-50 are the top-47 PCs of the residual expression matrix Y - QbioQ Y after QR-orthonormalization of the 3-column biological block; the full 50-column basis is re-orthonormalized by a final QR pass. The 17 curated benchmark entries are then re-scored under this purity-aware basis, and the per-cancer Spearman{rho} between r{perp} and c(5) is recomputed. Column "Standard{rho} " is the value reported in the main text (same as Supplementary Table S16, column "Empirical curated"); column "Purity-adjusted{rho} " is the value under the new basis; column "{Delta}" is the difference. This test was pre-specified as a check on whether the observed{rho} (r{perp}, c(5)) could be driven by tumor-purity, immune, or stromal composition artifacts that a standard top-50 PC basis would implicitly absorb. C_TABLECAPTION C_TBL ConclusionsResidual-ratio auditing provides an interpretable and practical framework for quantifying how much of a transcriptomic gene signatures variance remains orthogonal to a chosen background-expression model. The two statistically reliable quantities it reports are (i) the shape of the trajectory r{perp}(k) across null-model richness, which is bootstrap-invariant across sample-level resamples, and (ii) the magnitude gap between the curated panels residual ratio and size-matched random 30-gene baselines at a fixed operating point, which is 18-43% in all 8 TCGA cancers and survives a purity-aware null-model construction. The negative correlation between r{perp} and the top-5 absorption concentration c (curated-panel median{rho} = -0.71) is reproduced by random 30-gene sets under the same basis (random-draw median{rho} = -0.73) and is therefore best read as a descriptive geometric property of the null-model coordinate system rather than a biological discovery about curated signatures. Any single operating-point residual ratio carries materially wider cell-level uncertainty than the trajectory shape and cannot, on its own, adjudicate confounder-independence. The frameworks outputs describe a signatures geometric relationship to modeled background expression structure and do not evaluate clinical utility: a signature with a low residual ratio may still be clinically valuable when that low value reflects alignment with a strong prognostic or actionable program such as proliferation, immune infiltration, or cell cycle, and the framework is not a substitute for calibrated prognostic or predictive classifiers. All findings are based on bulk RNA-seq (TCGA PanCancer Atlas, 8 cancer types) and microarray (METABRIC) data; transfer to single-cell, single-nucleus, or spatial transcriptomics is out of scope and not claimed. Used within this scope--reading the trajectory shape and the magnitude-gap signal jointly, rather than the value of r{perp} at any one k--the framework adds a complementary audit layer to existing pathway-scoring and experimental-validation workflows, and supports more calibrated interpretation, comparison, and reporting of transcriptomic gene signatures in cancer studies.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Cell Systems
167 papers in training set
Top 0.5%
17.7%
2
Genome Biology
555 papers in training set
Top 0.1%
13.9%
3
Nature Communications
4913 papers in training set
Top 15%
11.9%
4
Nature Genetics
240 papers in training set
Top 1%
6.6%
50% of probability mass above
5
Nature Biotechnology
147 papers in training set
Top 2%
3.8%
6
Genome Research
409 papers in training set
Top 0.9%
3.8%
7
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 19%
3.7%
8
Nature
575 papers in training set
Top 7%
3.5%
9
PLOS Computational Biology
1633 papers in training set
Top 11%
3.0%
10
Genome Medicine
154 papers in training set
Top 3%
2.3%
11
Nucleic Acids Research
1128 papers in training set
Top 9%
2.0%
12
Cell Genomics
162 papers in training set
Top 3%
2.0%
13
eLife
5422 papers in training set
Top 37%
2.0%
14
Nature Methods
336 papers in training set
Top 4%
1.8%
15
Molecular Systems Biology
142 papers in training set
Top 0.6%
1.7%
16
Cell Reports Medicine
140 papers in training set
Top 5%
1.3%
17
Patterns
70 papers in training set
Top 2%
0.9%
18
Communications Biology
886 papers in training set
Top 18%
0.9%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
20
Science Advances
1098 papers in training set
Top 29%
0.8%
21
npj Systems Biology and Applications
99 papers in training set
Top 2%
0.8%
22
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
23
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
24
Scientific Reports
3102 papers in training set
Top 77%
0.7%
25
Bioinformatics
1061 papers in training set
Top 10%
0.7%
26
iScience
1063 papers in training set
Top 39%
0.6%
27
PLOS ONE
4510 papers in training set
Top 72%
0.6%