Back

Rastair: an integrated variant and methylation caller

Etzioni, Z.; Zhao, L.; Hertleif, P.; Schuster-Boeckler, B.

2026-03-23 bioinformatics
10.64898/2026.03.19.712983 bioRxiv
Show abstract

Cytosine methylation is a crucial epigenetic mark that impact tissue-specific chromatin conformation and gene expression. For many years, bisulfite sequencing (BS-seq), which converts all non-methylated cytosine (C) to thymine (T), remained the only approach to measure cytosine methylation at base resolution. Recently, however, several new methods that convert only methylated cytosines to thymine (mC[->]T) have become widely available. Here we present rastair, an integrated software toolkit for simultaneous SNP detection and methylation calling from mC[->]T sequencing data such as those created with Watchmakers TAPS+ and Illuminas 5-Base chemistries. Rastair combines machine-learning-based variant detection with genotype-aware methylation estimation. Using NA12878 benchmark datasets, we show that rastair outperforms existing methylation-aware SNP callers and achieves F1 scores exceeding 0.99 for datasets above 30x depth, matching the accuracy of state-of-the-art tools run on whole-genome sequencing data. At the same time, rastair is significantly faster than other genetic variant callers, processing a 30x depth file takes less than 30 minutes given 32 CPU cores on an Intel Xeon, and half as long when a GPU is available. By integrating genotyping with methylation calling, rastair reports an additional 500,000 positions in NA12878 where a SNP turns a non-CpG reference position into a "de-novo" CpG. Vice-versa, rastair also identifies positions where a variant disrupts a CpG and corrects their reported methylation levels. Rastair produces standard-compliant outputs in vcf, bam and bed formats, facilitating integration into downstream analyses pipelines. Rastair is open-source and available via conda, Dockerhub, and as pre-compiled binaries from https://www.rastair.com.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
19.6%
2
Nature Communications
4913 papers in training set
Top 9%
14.9%
3
Genome Biology
555 papers in training set
Top 0.3%
10.2%
4
Genome Research
409 papers in training set
Top 0.3%
6.9%
50% of probability mass above
5
Genome Medicine
154 papers in training set
Top 0.8%
6.9%
6
Nature Methods
336 papers in training set
Top 1%
6.9%
7
Nature Biotechnology
147 papers in training set
Top 2%
4.9%
8
BMC Bioinformatics
383 papers in training set
Top 3%
3.7%
9
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
10
The American Journal of Human Genetics
206 papers in training set
Top 2%
2.4%
11
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
1.9%
13
Nature
575 papers in training set
Top 12%
1.3%
14
PLOS ONE
4510 papers in training set
Top 58%
1.3%
15
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.2%
16
Scientific Reports
3102 papers in training set
Top 73%
0.8%
17
Cell Systems
167 papers in training set
Top 12%
0.8%
18
Nature Computational Science
50 papers in training set
Top 2%
0.8%
19
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
20
PLOS Computational Biology
1633 papers in training set
Top 26%
0.7%
21
Communications Biology
886 papers in training set
Top 26%
0.7%
22
Cell Reports Methods
141 papers in training set
Top 6%
0.7%
23
Science
429 papers in training set
Top 22%
0.5%
24
Cell Genomics
162 papers in training set
Top 8%
0.5%