Aiki-XP: leakage-controlled multimodal prediction of within-species relative protein expression at pan-bacterial scale
Tien, H.; Meda, R. S.; Shastry, S.; Mysore, V.
Show abstract
Generalizable protein-expression prediction can accelerate protein engineering, inform disease mechanisms, and help optimize heterologous recombinant protein production. Protein expression is governed by many interacting parameters that no single omics view captures. We develop Aiki-XP, a multimodal platform integrating four biological scales (genome, operon, coding sequence, protein) plus biophysical features across 492,026 genes from 385 bacterial species. Aiki-XP predicts within-species relative abundance (per-species z-score rank), not absolute copies per cell. Under a leakage-controlled gene-operon split Aiki-XP reaches Spearman{rho} nc = 0.592 versus 0.509 for ESM-C 600M alone, and each tier of a monotone protein[->]operon[->]genome deployment ladder yields a statistically significant gain; a five-recipe rank-average ensemble adds a further +0.016. All recipes were locked before external evaluation; transfer to heterologous, cross-species, and novel-phylum benchmarks demonstrates utility and limits. Ablations and scaling experiments identify operon-scale genomic context, not protein-language-model capacity, as the rate-limiting input at this scale; one foundation model per biological scale suffices, with same-scale stacking adding little.
Matching journals
The top 3 journals account for 50% of the predicted probability mass.