Back

MolCodon: A Codon-Based Molecular Language for InterpretableStructural Representation and Similarity Search

Sayyah, E.; Kurul, E.; Tunc, H.; DURDAGI, S.

2026-05-21 bioinformatics
10.64898/2026.05.20.726468 bioRxiv
Show abstract

Molecular representation determines which aspects of chemical structure can be learned, compared, and interpreted in computational drug discovery. Existing encodings typically emphasize either compact string description, as in SMILES and SELFIES, or efficient similarity search, as in circular fingerprints, but they may not simultaneously provide deterministic sequence structure, graph-level interpretability, pharmacophore annotation, and high-fidelity molecular reconstruction. Here, we introduce MolCodon, a codon-based molecular language that represents small molecules as deterministic sequences of fixed-width three-character tokens over a five-symbol alphabet, C, N, O, S, and X. Inspired by the triplet organization of the genetic code, MolCodon assigns chemically defined codon families to atoms, bonds, ring and branch topology, fused-ring references, pharmacophore features, bond mobility, charge, and stereochemistry. A deterministic graph traversal with ring-contiguity preservation produces sequences in which chemically meaningful substructures remain locally organized and traceable to the underlying molecular graph. Across around 2,9 million molecules from six commercial screening libraries, MolCodon achieved 98.93% InChIKey-level round-trip fidelity, supporting its use as a high-fidelity sequence representation for drug-like chemistry. MolCodon-derived sparse sequence and trace features further outperformed SELFIES and Group SELFIES across ten QSAR tasks and exceeded classical fingerprint baselines in six out of ten tasks. As an application of the representation, MolCodon BLAST similarity engine decomposes molecular similarity into ring topology, branch context, attachment architecture, and pharmacophore correspondence, enabling interpretable scaffold-hopping searches. In a PARP1 virtual screening study, MolCodon retrieved scaffold-diverse candidates to a known PARP-1 inhibitor Olaparib. Together, these results establish MolCodon as a new molecular representation paradigm that transforms chemical graphs into high-fidelity, interpretable, and alignment-compatible codon sequences, opening a direct path for bioinformatics-inspired analysis of small-molecule chemical space. The MolCodon encoder, decoder, and BLAST similarity engine are freely available as open-source software at https://github.com/DurdagiLab/MolCodon

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.3%
17.3%
2
Bioinformatics
1061 papers in training set
Top 2%
14.2%
3
Cell Systems
167 papers in training set
Top 1%
9.0%
4
Nature Methods
336 papers in training set
Top 2%
6.2%
5
Nature Biotechnology
147 papers in training set
Top 2%
4.8%
50% of probability mass above
6
Nature Communications
4913 papers in training set
Top 36%
4.2%
7
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
8
Journal of Cheminformatics
25 papers in training set
Top 0.2%
3.5%
9
Advanced Science
249 papers in training set
Top 7%
2.8%
10
Bioinformatics Advances
184 papers in training set
Top 2%
2.6%
11
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.1%
12
Nature Machine Intelligence
61 papers in training set
Top 2%
2.0%
13
PLOS Computational Biology
1633 papers in training set
Top 14%
2.0%
14
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 30%
1.9%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.8%
16
Chemical Science
71 papers in training set
Top 1%
1.5%
17
Patterns
70 papers in training set
Top 1%
1.3%
18
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
19
PLOS ONE
4510 papers in training set
Top 60%
1.2%
20
Scientific Reports
3102 papers in training set
Top 73%
0.8%
21
iScience
1063 papers in training set
Top 33%
0.7%
22
Cell Genomics
162 papers in training set
Top 7%
0.7%
23
Genome Medicine
154 papers in training set
Top 8%
0.7%
24
Communications Biology
886 papers in training set
Top 30%
0.6%
25
ACS Synthetic Biology
256 papers in training set
Top 4%
0.6%