OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability

Wang, L.

2026-05-14 bioinformatics

10.64898/2026.05.12.724542 bioRxiv

Show abstract

We introduce OmniGene-4, a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing). We inject 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP labels), continue pretraining on a 32.5 GB DNA / protein / natural-language / structural mixture, and run a five-stage supervised fine-tuning pipeline (v2-v5) on 199,576 instruction-format examples across eight task families. The final v5 adds a dual-head architecture: a generation head plus two per-residue classification heads (3Di, DSSP) trained jointly under a 0.5/0.5 loss split. v5 reaches 99.40% accuracy on BioPAWS standard protein homology, 82.60% on remote homology (500 pairs), and 93.66% on BixBench -- gaining +14.4, +22.6, +6.7 percentage points over the vocabulary-extended Gemma-4-Instruct baseline, and outperforming ESM-2 (650M) by +32.1 pp on the identical remote-homology split. The classification heads reach 78.6% per-residue accuracy on 3Di (chance 5%) and 100% on DSSP (chance 12.5%). MoE router activations further yield a clean CPT/SFT 96%/4% decomposition of cross-task differentiation, providing direct interpretability of where biological specialization is acquired.

OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability

Matching journals