Back

Enhancing t-SNE Performance for Biological Sequencing Data through Kernel Selection

Chourasia, P.; Murad, T.; Ali, S.; Patterson, M.

2023-08-22 bioinformatics
10.1101/2023.08.21.554138 bioRxiv
Show abstract

The genetic code for many different proteins can be found in biological sequencing data, which offers vital insight into the genetic evolution of viruses. While machine learning approaches are becoming increasingly popular for many "Big Data" situations, they have made little progress in comprehending the nature of such data. One such area is the t-distributed Stochastic Neighbour Embedding (t-SNE), a generalpurpose approach used to represent high dimensional data in low dimensional (LD) space while preserving similarity between data points. Traditionally, the Gaussian kernel is used with t-SNE. However, since the Gaussian kernel is not data-dependent, it determines each local bandwidth based on one local point only. This makes it computationally expensive, hence limited in scalability. Moreover, it can misrepresent some structures in the data. An alternative is to use the isolation kernel, which is a data-dependent method. However, it has a single parameter to tune in computing the kernel. Although the isolation kernel yields better performance in terms of scalability and preserving the similarity in LD space, it may still not perform optimally in some cases. This paper presents a perspective on improving the performance of t-SNE and argues that kernel selection could impact this performance. We use 9 different kernels to evaluate their impact on the performance of t-SNE, using SARS-CoV-2 "spike" protein sequences. With three different embedding methods, we show that the cosine similarity kernel gives the best results and enhances the performance of t-SNE.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
12.3%
2
PLOS Computational Biology
1633 papers in training set
Top 4%
8.4%
3
PLOS ONE
4510 papers in training set
Top 25%
6.8%
4
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
4.9%
5
Scientific Reports
3102 papers in training set
Top 24%
4.9%
6
Journal of Computational Biology
37 papers in training set
Top 0.1%
4.3%
7
BMC Bioinformatics
383 papers in training set
Top 2%
4.0%
8
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1%
3.7%
9
Journal of Chemical Information and Modeling
207 papers in training set
Top 1%
3.6%
50% of probability mass above
10
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
11
IEEE/ACM Transactions on Computational Biology and Bioinformatics
32 papers in training set
Top 0.1%
3.6%
12
Bioinformatics Advances
184 papers in training set
Top 2%
2.4%
13
Frontiers in Genetics
197 papers in training set
Top 4%
2.1%
14
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.8%
15
Frontiers in Molecular Biosciences
100 papers in training set
Top 2%
1.3%
16
Neuroinformatics
40 papers in training set
Top 0.7%
1.2%
17
GigaScience
172 papers in training set
Top 2%
1.2%
18
PeerJ
261 papers in training set
Top 10%
1.2%
19
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
20
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
1.1%
21
Journal of Proteome Research
215 papers in training set
Top 2%
0.9%
22
IEEE Access
31 papers in training set
Top 0.7%
0.9%
23
International Journal of Molecular Sciences
453 papers in training set
Top 12%
0.9%
24
Patterns
70 papers in training set
Top 2%
0.9%
25
Biology Methods and Protocols
53 papers in training set
Top 2%
0.9%
26
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.6%
0.7%
27
iScience
1063 papers in training set
Top 37%
0.6%
28
Expert Systems with Applications
11 papers in training set
Top 0.6%
0.6%
29
Journal of Bioinformatics and Systems Biology
14 papers in training set
Top 0.9%
0.6%
30
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.6%