Back

Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

Rehfeldt, T. G.; Krawczyk, K.; Echers, S. G.; Marcatili, P.; Palczynski, P.; Roettger, R.; Schwaemmle, V.

2023-05-02 bioinformatics
10.1101/2023.05.01.538996 bioRxiv
Show abstract

BackgroundMachine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs. ResultsWe aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. ConclusionsOur findings show significantly higher levels of homogeneity within a project than between projects, which indicates that its important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Analytical Chemistry
205 papers in training set
Top 0.1%
28.7%
2
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.1%
19.3%
3
Journal of Proteome Research
215 papers in training set
Top 0.4%
8.7%
50% of probability mass above
4
Bioinformatics
1061 papers in training set
Top 3%
8.7%
5
PLOS ONE
4510 papers in training set
Top 37%
3.7%
6
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.2%
7
Molecular & Cellular Proteomics
158 papers in training set
Top 0.8%
2.2%
8
PROTEOMICS
35 papers in training set
Top 0.3%
2.0%
9
Metabolites
50 papers in training set
Top 0.5%
1.5%
10
BMC Bioinformatics
383 papers in training set
Top 5%
1.5%
11
GigaScience
172 papers in training set
Top 2%
1.3%
12
Scientific Reports
3102 papers in training set
Top 69%
1.0%
13
Analytical and Bioanalytical Chemistry
17 papers in training set
Top 0.3%
0.9%
14
Analytica Chimica Acta
17 papers in training set
Top 0.5%
0.8%
15
Nature Communications
4913 papers in training set
Top 60%
0.8%
16
Frontiers in Plant Science
240 papers in training set
Top 5%
0.8%
17
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
18
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
19
PeerJ
261 papers in training set
Top 18%
0.5%
20
Journal of Chemical Information and Modeling
207 papers in training set
Top 3%
0.5%
21
Journal of Proteomics
27 papers in training set
Top 0.6%
0.5%