MolCodon: A Codon-Based Molecular Language for InterpretableStructural Representation and Similarity Search
Sayyah, E.; Kurul, E.; Tunc, H.; DURDAGI, S.
Show abstract
Molecular representation determines which aspects of chemical structure can be learned, compared, and interpreted in computational drug discovery. Existing encodings typically emphasize either compact string description, as in SMILES and SELFIES, or efficient similarity search, as in circular fingerprints, but they may not simultaneously provide deterministic sequence structure, graph-level interpretability, pharmacophore annotation, and high-fidelity molecular reconstruction. Here, we introduce MolCodon, a codon-based molecular language that represents small molecules as deterministic sequences of fixed-width three-character tokens over a five-symbol alphabet, C, N, O, S, and X. Inspired by the triplet organization of the genetic code, MolCodon assigns chemically defined codon families to atoms, bonds, ring and branch topology, fused-ring references, pharmacophore features, bond mobility, charge, and stereochemistry. A deterministic graph traversal with ring-contiguity preservation produces sequences in which chemically meaningful substructures remain locally organized and traceable to the underlying molecular graph. Across around 2,9 million molecules from six commercial screening libraries, MolCodon achieved 98.93% InChIKey-level round-trip fidelity, supporting its use as a high-fidelity sequence representation for drug-like chemistry. MolCodon-derived sparse sequence and trace features further outperformed SELFIES and Group SELFIES across ten QSAR tasks and exceeded classical fingerprint baselines in six out of ten tasks. As an application of the representation, MolCodon BLAST similarity engine decomposes molecular similarity into ring topology, branch context, attachment architecture, and pharmacophore correspondence, enabling interpretable scaffold-hopping searches. In a PARP1 virtual screening study, MolCodon retrieved scaffold-diverse candidates to a known PARP-1 inhibitor Olaparib. Together, these results establish MolCodon as a new molecular representation paradigm that transforms chemical graphs into high-fidelity, interpretable, and alignment-compatible codon sequences, opening a direct path for bioinformatics-inspired analysis of small-molecule chemical space. The MolCodon encoder, decoder, and BLAST similarity engine are freely available as open-source software at https://github.com/DurdagiLab/MolCodon
Matching journals
The top 5 journals account for 50% of the predicted probability mass.