Back

CBIcall: a configuration-driven framework for variant calling in large sequencing cohorts

Rueda, M.; Fernandez Orth, D.; Gut, I. G.

2026-03-25 bioinformatics
10.64898/2026.03.23.713646 bioRxiv
Show abstract

MotivationVariant calling for next-generation sequencing (NGS) data relies on a diverse ecosystem of tools and workflows. Large-scale collaborative studies increasingly adopt federated analysis, where each institution processes sensitive data locally using standardized pipelines. Deploying identical pipelines across multiple centers remains challenging because heterogeneous software environments and computing policies can cause workflow divergence and inconsistent results. ResultsWe developed CBIcall, a workflow-agnostic, configuration-driven framework that runs standardized variant-calling pipelines from raw FASTQ files to analysis-ready VCFs using a single YAML file. An execution driver validates user parameters, enforces compatibility across pipelines, analysis modes, work-flow backends, genome builds, and tool versions, and records structured provenance for each run, ensuring consistent and reproducible pipeline execution across computing environments. CBIcall dispatches validated workflows through Bash or Snakemake backends and provides production-ready pipelines for germline WES, WGS (single-sample or cohort joint genotyping following GATK Best Practices), and mitochondrial DNA analysis. We validated CBIcall on public datasets and deployed it in the EU HEREDITARY project, processing 1,111 samples with both WES and mtDNA pipelines on an institutional HPC system, demonstrating its suitability for reproducible cohort-scale genomic analyses. Availability and implementationCBIcall is open source (GPLv3) and distributed with ready-to-run pipelines; full dependency and installation documentation is available at https://github.com/CNAG-Biomedical-Informatics/cbicall.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1.0%
23.2%
2
Nature Communications
4913 papers in training set
Top 9%
14.8%
3
Genome Medicine
154 papers in training set
Top 0.6%
8.7%
4
Genome Biology
555 papers in training set
Top 0.7%
7.4%
50% of probability mass above
5
Nature Methods
336 papers in training set
Top 2%
5.0%
6
BMC Bioinformatics
383 papers in training set
Top 2%
4.3%
7
Nature Biotechnology
147 papers in training set
Top 2%
3.7%
8
Nucleic Acids Research
1128 papers in training set
Top 5%
3.7%
9
GigaScience
172 papers in training set
Top 0.7%
2.7%
10
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
11
Nature Genetics
240 papers in training set
Top 4%
1.9%
12
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
1.9%
13
The American Journal of Human Genetics
206 papers in training set
Top 2%
1.7%
14
Briefings in Bioinformatics
326 papers in training set
Top 4%
1.4%
15
Genome Research
409 papers in training set
Top 3%
1.3%
16
Nature Computational Science
50 papers in training set
Top 0.9%
1.3%
17
Scientific Reports
3102 papers in training set
Top 69%
1.0%
18
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
19
Alzheimer's & Dementia
143 papers in training set
Top 3%
0.7%
20
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.9%
0.7%
21
PLOS ONE
4510 papers in training set
Top 73%
0.5%
22
Science
429 papers in training set
Top 22%
0.5%
23
NAR Cancer
36 papers in training set
Top 0.3%
0.5%
24
Nature
575 papers in training set
Top 17%
0.5%
25
Nature Medicine
117 papers in training set
Top 6%
0.5%
26
Communications Biology
886 papers in training set
Top 31%
0.5%