MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders
Wijaya, A. S.; Leung, H.; Yoo, H.
Show abstract
MotivationFrozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each genes natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. ResultsIn 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 /{kappa} 0.821, compared with 0.672 /{kappa} 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context. Availability and implementationSource code: https://github.com/Austin-Senna/dna_to_text; Python [≥]3.11. Contactasw2215@columbia.edu Supplementary informationSupplementary tables, figures, and reproducibility details are included at the end of this preprint.
Matching journals
The top 2 journals account for 50% of the predicted probability mass.