Back

Evaluating Limits of Machine Learning-Assisted Raman Spectroscopy in Classification of Biological Samples

Yadav, A.; Birkby, A.; Armstrong, N.; Arnob, A.; Chou, M.-H.; Fernandez, A.; Verhoef, A. J.; Yi, Z.; Gulati, S.; Kotnis, S.; Sun, Q.; Kao, K. C.; Wu, H.-J.

2026-03-01 bioinformatics
10.64898/2026.02.26.708284 bioRxiv
Show abstract

Machine learning (ML)-assisted Raman spectroscopy has become a powerful analytical tool for the classification and identification of analytes; however, technical challenges impacting its detection accuracy have not been investigated. This study explores experimental factors affecting classification performance. Among the evaluated ML models, ML algorithms show minimal impacts on classification accuracy. Instead, experimental factors, including spectral similarity between tested samples and the data quality, dominate detection performance. Increases in spectral noises and spectral similarity significantly reduce classification accuracy. In well-controlled samples with low experimental noise, ML-assisted Raman spectroscopy can discriminate lipid mixtures with a composition difference of 1.85 mol%. To assess the effect of biological heterogeneity, we analyzed single-cell Raman spectra from Saccharomyces cerevisiae strains carrying single, double, or triple gene mutations. Intrinsic cell-to-cell variability introduced substantial spectral differences, severely reducing the accuracy of multiclass classification of these genetically similar strains at the single-cell level. Averaging Raman spectra across multiple cells improved classification accuracy by reducing this spectral variability. We also assess the effectiveness of transfer learning across different Raman spectrometers, specifically by applying a ML model trained on one instrument to another Raman spectrometer. Transfer learning can be improved with proper instrument calibration, highlighting the importance of instrument standardization. Overall, our results demonstrate that data quality and spectral similarity are the primary bottlenecks in ML-assisted Raman spectroscopy. Careful attention to sample preparation, data acquisition, measurement conditions, and instrument calibration is critical to achieving robust and reliable classification performance.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Analytical Chemistry
205 papers in training set
Top 0.1%
23.2%
2
Analytica Chimica Acta
17 papers in training set
Top 0.1%
10.4%
3
Scientific Reports
3102 papers in training set
Top 13%
7.0%
4
PLOS ONE
4510 papers in training set
Top 24%
7.0%
5
Analytical and Bioanalytical Chemistry
17 papers in training set
Top 0.1%
3.7%
50% of probability mass above
6
Nature Communications
4913 papers in training set
Top 38%
3.7%
7
Frontiers in Plant Science
240 papers in training set
Top 3%
2.8%
8
Bioinformatics
1061 papers in training set
Top 7%
1.7%
9
Journal of Biomedical Optics
25 papers in training set
Top 0.3%
1.7%
10
Nano Letters
63 papers in training set
Top 1%
1.7%
11
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.3%
1.5%
12
Biophysical Journal
545 papers in training set
Top 3%
1.4%
13
Optica
25 papers in training set
Top 0.5%
1.4%
14
ACS Omega
90 papers in training set
Top 3%
1.1%
15
The Analyst
15 papers in training set
Top 0.4%
1.0%
16
Water Research
74 papers in training set
Top 1%
1.0%
17
Biomedical Optics Express
84 papers in training set
Top 0.9%
0.9%
18
Environmental Science & Technology Letters
22 papers in training set
Top 0.3%
0.9%
19
Molecules
37 papers in training set
Top 2%
0.8%
20
Journal of Proteome Research
215 papers in training set
Top 2%
0.8%
21
Frontiers in Molecular Biosciences
100 papers in training set
Top 5%
0.8%
22
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%
23
SLAS Technology
11 papers in training set
Top 0.3%
0.7%
24
Communications Chemistry
39 papers in training set
Top 1%
0.7%
25
Cancer Research Communications
46 papers in training set
Top 1%
0.7%
26
Computational and Structural Biotechnology Journal
216 papers in training set
Top 10%
0.7%
27
PROTEOMICS
35 papers in training set
Top 0.9%
0.7%
28
PLOS Computational Biology
1633 papers in training set
Top 27%
0.7%
29
Advanced Science
249 papers in training set
Top 23%
0.5%
30
Journal of Natural Products
11 papers in training set
Top 0.4%
0.5%