HalluCodon enables species-specific codon optimization using multimodal language models
Lou, Y.; Mao, S.; Wu, T.; Xia, F.; Zhang, Z.; Tian, Y.; Li, Y.; Cheng, Q.; Yan, J.; Wang, X.
Show abstract
Codon optimization is widely used in transgenic crop development, plant synthetic biology, and molecular farming to improve heterologous protein expression in plant cells. Increasing availability of plant omics data now enables optimization strategies that account for species-specific sequence features. We developed HalluCodon, a customizable framework that uses multimodal language models to design coding sequences tailored to individual plant species. The framework allows users to fine tune pre-trained protein and RNA language models with their own datasets to build species-specific codon optimization models. The current implementation includes base models trained on coding sequences and proteomes from fifteen plant species. HalluCodon generates coding sequences through a hallucination-based design strategy guided by two predictive modules that evaluate coding sequence naturalness (CodonNAT) and expression potential (CodonEXP). Benchmark tests using representative proteins show that the generated sequences reproduce host-specific codon usage patterns and support high expression levels in plant systems.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.