MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

Wijaya, A. S.; Leung, H.; Yoo, H.

2026-05-28 bioinformatics

10.64898/2026.05.25.727711 bioRxiv

Show abstract

MotivationFrozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each genes natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. ResultsIn 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 /{kappa} 0.821, compared with 0.672 /{kappa} 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context. Availability and implementationSource code: https://github.com/Austin-Senna/dna_to_text; Python [≥]3.11. Contactasw2215@columbia.edu Supplementary informationSupplementary tables, figures, and reproducibility details are included at the end of this preprint.

MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

Matching journals