Back

TPCAV: Interpreting deep learning genomics models via concept attribution

Yang, J.; Mahony, S.

2026-01-21 bioinformatics
10.64898/2026.01.20.700723 bioRxiv
Show abstract

Interpreting genomics deep learning models remains challenging. Existing feature attribution methods largely focus on scoring individual bases or extracting global DNA motifs from one-hot encoded inputs, leaving them unable to assess broader genomic features such as chromatin accessibility or sequence annotations. Concept attribution methods offer an input-agnostic global interpretation framework, yet they have not been systematically applied to interpret neural network applications in genomics. We present the first application of concept attribution to interpret genomics deep learning models by adapting the Testing with Concept Activation Vectors (TCAV) method. We introduce Testing with PCA-projected Concept Activation Vectors (TPCAV), which improves upon the original method by using a PCA-based decorrelation transformation to address the correlated and redundant embedding features common in genomics models. We also introduce a strategy for extracting concept-specific input attribution maps. We evaluate our approach by interpreting influential biological concepts across a diverse set of genomics models spanning multiple input representations and prediction tasks. We demonstrate that TPCAV provides more reliable DNA motif interpretation than TCAV and is comparable to TF-MoDISco on one-hot coded DNA-based transcription factor binding prediction models. Beyond motif interpretation, TPCAV enables robust interpretive analysis of more general concepts such as repetitive elements and chromatin accessibility and generalizes to tokenized foundation models as well as models incorporating chromatin signal inputs. We further show that TPCAV can identify representative transcription factor binding sites associated with specific concepts, motivating downstream investigation of distinct binding mechanisms. Overall, TPCAV provides a flexible and robust complement to existing model interpretation techniques.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1.0%
23.4%
2
Bioinformatics Advances
184 papers in training set
Top 0.3%
7.4%
3
Cell Systems
167 papers in training set
Top 2%
7.1%
4
Genome Biology
555 papers in training set
Top 0.8%
7.1%
5
Nature Communications
4913 papers in training set
Top 27%
6.6%
50% of probability mass above
6
PLOS Computational Biology
1633 papers in training set
Top 6%
5.0%
7
BMC Bioinformatics
383 papers in training set
Top 2%
4.5%
8
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.6%
3.8%
9
Nucleic Acids Research
1128 papers in training set
Top 5%
3.7%
10
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.7%
11
Genome Research
409 papers in training set
Top 1%
3.0%
12
BMC Genomics
328 papers in training set
Top 2%
2.0%
13
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.2%
1.7%
14
Nature Machine Intelligence
61 papers in training set
Top 2%
1.5%
15
Nature Methods
336 papers in training set
Top 5%
1.4%
16
Nature Biotechnology
147 papers in training set
Top 5%
1.3%
17
Frontiers in Genetics
197 papers in training set
Top 7%
1.1%
18
GigaScience
172 papers in training set
Top 2%
1.0%
19
Scientific Reports
3102 papers in training set
Top 70%
0.9%
20
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
21
PLOS ONE
4510 papers in training set
Top 65%
0.8%
22
Genome Medicine
154 papers in training set
Top 8%
0.8%
23
Cell Genomics
162 papers in training set
Top 6%
0.8%
24
iScience
1063 papers in training set
Top 32%
0.7%
25
Advanced Science
249 papers in training set
Top 19%
0.7%
26
Cell Reports Methods
141 papers in training set
Top 6%
0.5%
27
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 48%
0.5%