Back

An explainable boosting machine model for identifying artifacts caused by formalin-fixed paraffin embedding

Grether, V.; Goldstein, Z. R.; Shelton, J. M.; Chu, T. R.; Hooper, W. F.; Geiger, H.; Corvelo, A.; Martini, R.; Davis, M. B.; Robine, N.; Liao, W.

2026-03-13 bioinformatics
10.64898/2026.03.10.710815 bioRxiv
Show abstract

BackgroundFormalin-fixed paraffin-embedding (FFPE) is a widely used, cost-effective method for long-term storage of clinical samples. However, fixation is known to introduce damage to nucleic acids that can present as artifactual bases in sequencing otherwise absent from higher fidelity storage methods such as fresh freezing (FF). Various machine learning methods exist for filtering these variant artifacts, but benchmarking performance can be difficult without reliable truth sets. In this study, we employ a collection of 90 paired fresh-frozen and formalin-fixed paraffin embedded samples from the same tumor to robustly define real and FFPE-derived, artifactual variation and enable objective evaluation of filtering methods. To address existing shortcomings, we propose a novel explainable boosting machine (EBM) model that improves performance, can be easily updated with new data, requires modest computational resources, and is analysis pipeline agnostic, making it broadly accessible. ResultsWe evaluated several methods for limiting FFPE-derived variant artifacts using cohorts of B-cell lymphoma samples. We found capturing local context around variants to be a highly informative, under-utilized feature set not commonly incorporated into many existing machine learning methods. Consequently, we developed a novel algorithm, FIFA, for filtering FFPE artifacts, which uses an EBM model, an interpretable decision-tree-based learning algorithm, to address some of the existing shortcomings. We used four independent cohorts composed of paired lymphoma and cervical cancer samples and a breast cancer cell line with both FF and FFPE samples to define clearly annotated training and test sets and demonstrated improved performance over existing methods. Additionally, FIFA filtering increased relevant biological signals in FFPE breast cancer datasets distinct from the training and testing sets. The EBM framework employed by FIFA is computationally efficient and easily amenable to incorporation of additional datasets due to its generalized additive modeling of features making it straightforward to incorporate new data into existing models dynamically over time. ConclusionsOur novel FFPE variant artifact filtering tool, FIFA, is a marked improvement over existing methods. It can be easily implemented, post hoc, to supplement existing somatic calling pipelines, training and inference can be run quickly across most compute environments, and it can be easily updated online as new training data becomes available. Accordingly, FIFA represents an important advance in retrospective cancer genomics research by further enhancing access to the vast stores of FFPE-archived tumor samples currently in existence.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
BMC Bioinformatics
383 papers in training set
Top 0.1%
32.6%
2
Bioinformatics
1061 papers in training set
Top 3%
9.1%
3
PLOS Computational Biology
1633 papers in training set
Top 5%
7.1%
4
BMC Genomics
328 papers in training set
Top 0.5%
4.3%
50% of probability mass above
5
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.7%
3.6%
6
Briefings in Bioinformatics
326 papers in training set
Top 2%
3.6%
7
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
8
PLOS ONE
4510 papers in training set
Top 40%
3.6%
9
Clinical Chemistry
22 papers in training set
Top 0.2%
2.6%
10
Computational and Structural Biotechnology Journal
216 papers in training set
Top 3%
2.4%
11
Scientific Reports
3102 papers in training set
Top 48%
2.3%
12
Genome Medicine
154 papers in training set
Top 3%
2.1%
13
Biology Methods and Protocols
53 papers in training set
Top 1.0%
1.7%
14
GigaScience
172 papers in training set
Top 1%
1.7%
15
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.5%
16
Frontiers in Genetics
197 papers in training set
Top 6%
1.3%
17
Cancer Research Communications
46 papers in training set
Top 0.9%
0.9%
18
BioData Mining
15 papers in training set
Top 0.7%
0.9%
19
Microbial Genomics
204 papers in training set
Top 2%
0.9%
20
Communications Biology
886 papers in training set
Top 21%
0.8%
21
Frontiers in Bioinformatics
45 papers in training set
Top 0.8%
0.8%
22
BMC Medical Genomics
36 papers in training set
Top 1%
0.7%
23
PeerJ
261 papers in training set
Top 15%
0.7%
24
The Journal of Molecular Diagnostics
36 papers in training set
Top 0.6%
0.6%
25
iScience
1063 papers in training set
Top 38%
0.6%
26
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%