Back

Powerful read processing with matchbox

Schuster, J.; Zeglinski, K.; Xiao, L. C.; Voulgaris, O.; Rivera, S. M.; Vervoort, S. J.; Ritchie, M. E.; Gouil, Q.; Clark, M. B.

2026-02-03 bioinformatics
10.1101/2025.11.09.685711 bioRxiv
Show abstract

The wide variety of protocols and applications for DNA and RNA sequencing makes flexible tools for read processing an important step in sequence analysis. Beyond trimming and demultiplexing, custom read-level processing is commonly required for data exploration, QC and analysis. Existing tools are often task-specific and dont generalise to new bioinformatic problems. Thus, there is a need for a tool flexible enough to handle the full variety of read processing tasks, and fast and scalable enough to retain high performance on growing sequencing datasets. We introduce matchbox, a read processor that enables fluent manipulation and analysis of FASTA/FASTQ/SAM/BAM files. With a lightweight scripting language designed around error-tolerant pattern-matching, users can write their own matchbox scripts to tackle a wide variety of bioinformatic problems, and incorporate them into existing pipelines and work-flows. We demonstrate matchboxs versatility in a number of contexts: demultiplexing long-read scRNA-seq data with 10X or SPLiT-seq barcodes; restranding RNA-seq reads; assessing CRISPR editing efficiency; and haplotyping macrosatel-lite repeat regions. matchbox achieves a computational performance comparable to existing tools, while addressing a broader range of bioinformatic needs, representing a new state-of-the-art in sequence processing. matchbox is implemented in Rust and available open-source at https://github.com/jakob-schuster/matchbox.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 2%
14.5%
2
Nature Biotechnology
147 papers in training set
Top 0.4%
12.5%
3
Nature Communications
4913 papers in training set
Top 15%
12.2%
4
Genome Biology
555 papers in training set
Top 0.4%
10.0%
5
Nature Methods
336 papers in training set
Top 2%
4.8%
50% of probability mass above
6
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.4%
4.8%
7
BMC Bioinformatics
383 papers in training set
Top 2%
3.9%
8
Nucleic Acids Research
1128 papers in training set
Top 6%
3.5%
9
Bioinformatics Advances
184 papers in training set
Top 1%
3.5%
10
Genome Research
409 papers in training set
Top 1%
3.5%
11
PLOS ONE
4510 papers in training set
Top 44%
2.7%
12
PLOS Computational Biology
1633 papers in training set
Top 14%
2.0%
13
Genome Medicine
154 papers in training set
Top 5%
1.7%
14
Molecular Biology and Evolution
488 papers in training set
Top 3%
1.7%
15
Journal of Open Source Software
22 papers in training set
Top 0.1%
1.7%
16
BMC Genomics
328 papers in training set
Top 3%
1.6%
17
GigaScience
172 papers in training set
Top 2%
1.6%
18
Scientific Reports
3102 papers in training set
Top 68%
1.1%
19
Cell Systems
167 papers in training set
Top 10%
1.1%
20
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 45%
0.7%
21
Nature
575 papers in training set
Top 17%
0.6%
22
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.6%