Back

TCRseek: Scalable Approximate Nearest Neighbor Search for T-Cell Receptor Repertoires via Windowed k-mer Embeddings

Yang, Y.

2026-03-24 bioinformatics
10.64898/2026.03.20.713313 bioRxiv
Show abstract

The rapid growth of T-cell receptor (TCR) sequencing data has created an urgent need for computational methods that can efficiently search CDR3 sequences at scale. Existing approaches either rely on exact pairwise distance computation, which scales quadratically with repertoire size, or employ heuristic grouping that sacrifices sensitivity. Here we present TCRseek, a two-stage retrieval framework that combines biologically informed sequence embeddings with approximate nearest neighbor (ANN) indexing for scalable search over TCR repertoires. TCRseek first encodes CDR3 amino acid sequences into fixed-length numerical vectors through a multi-scale windowed k-mer embedding scheme derived from BLOSUM62 eigendecomposition, then indexes these vectors using FAISS-based structures (IVF-Flat, IVF-PQ, or HNSW-Flat) that support sublinear-time search. A second-stage reranking module refines the shortlisted candidates using exact sequence alignment scores (Needleman-Wunsch with BLOSUM62), Levenshtein distance, or Hamming distance. We benchmarked TCRseek against tcrdist3, TCRMatch, and GIANA on a 100,000-sequence corpus with precomputed exact ground truth under three distance metrics. Under cross-metric evaluation--where the reranking and ground truth metrics differ, providing the most informative test of generalization--TCRseek achieved NDCG@10 = 0.890 (Levenshtein ground truth) and 0.880 (Hamming ground truth), ranking highest among the retained baselines under Hamming and remaining competitive with tcrdist3 (0.894) under Levenshtein. When the reranking metric matches the ground truth definition (BLOSUM62 alignment), NDCG@10 reached 0.993, confirming that the ANN shortlist captures >99% of true neighbors--the expected ceiling of the two-stage design. On the 100,000-sequence corpus, TCRseek achieved 3.6-39.6x speedup over exact brute-force search depending on index type and distance metric, with the largest gains for alignment-based retrieval. These results demonstrate that embedding-based ANN search provides a practical and scalable alternative for TCR repertoire analysis.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Biotechnology
147 papers in training set
Top 0.3%
14.5%
2
Genome Research
409 papers in training set
Top 0.1%
12.2%
3
Bioinformatics
1061 papers in training set
Top 2%
12.2%
4
Nature Methods
336 papers in training set
Top 0.9%
10.3%
5
Nucleic Acids Research
1128 papers in training set
Top 2%
10.0%
50% of probability mass above
6
Cell Systems
167 papers in training set
Top 2%
6.2%
7
Bioinformatics Advances
184 papers in training set
Top 0.7%
4.8%
8
Nature Communications
4913 papers in training set
Top 40%
3.5%
9
Briefings in Bioinformatics
326 papers in training set
Top 3%
2.6%
10
PLOS Computational Biology
1633 papers in training set
Top 14%
2.1%
11
iScience
1063 papers in training set
Top 12%
1.9%
12
Genome Biology
555 papers in training set
Top 4%
1.7%
13
Genome Medicine
154 papers in training set
Top 6%
1.3%
14
Nature Machine Intelligence
61 papers in training set
Top 3%
1.2%
15
Cell Reports Methods
141 papers in training set
Top 5%
0.8%
16
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
17
GigaScience
172 papers in training set
Top 3%
0.8%
18
Nature Computational Science
50 papers in training set
Top 2%
0.7%
19
Nature
575 papers in training set
Top 16%
0.7%
20
Patterns
70 papers in training set
Top 3%
0.7%
21
Advanced Science
249 papers in training set
Top 22%
0.6%
22
Bioengineering
24 papers in training set
Top 2%
0.6%
23
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.6%
24
Science Advances
1098 papers in training set
Top 33%
0.6%
25
Scientific Reports
3102 papers in training set
Top 79%
0.6%
26
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%