Back

Facility-Scale Workflows for Data Acquisition, Standardization, Machine Learning Analysis, and Reproducible Science

Madugula, S. S.; Brown, S. R.; Bible, A. N.; Solsona, R. M.; Checa, M.; Massenburg, L.; Williams, A. N.; Collins, L.; Harris, S. B.; Morrell-Falvey, J.; Retterer, S. T.; Vasudevan, R. K.

2026-05-11 microbiology
10.64898/2026.05.06.723241 bioRxiv
Show abstract

Scientific user facilities routinely generate large-scale microscopy datasets across diverse instruments and vendors, differing substantially in file formats, dimensionality, and resolution. Beyond these inconsistencies, datasets are frequently fragmented living across isolated instruments and constrained by security policies and uneven metadata practices. Consequently, tracking, standardizing, processing, and visualizing these datasets in a manner compatible with modern machine learning and autonomous experimentation workflows remains a major challenge. While existing initiatives address data archiving, standardization, or analysis individually, few provide integrated solutions that bridge instrument-level acquisition and scalable ML workflows within heterogeneous, security-constrained user facilities. Here, we establish a deployable, facility-scale infrastructure that bridges instrument-level data generation with cloud-based ML analytics while remaining compliant with institutional network constraints. Our framework integrates on-premises cloud computing, the in-house Pycroscopy ecosystem, and an open-source metadata management platform to transform heterogeneous microscopy datasets into standardized, ML-ready representations. We demonstrate this approach across distinct microscopy modalities through end-to-end workflows encompassing metadata capture, format harmonization, automated database ingestion, segmentation-based ML inference, and interactive visualization. By structurally separating acquisition from cloud-based analysis services, the framework enables scalable model deployment and iterative refinement without direct connectivity to instrument computers. Together, this work provides a reproducible blueprint for facility-scale data and AI infrastructure, enabling ML-ready analytics, metadata traceability, and future autonomous experimentation workflows in microscopy-driven research.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.4%
18.4%
2
Optica
25 papers in training set
Top 0.1%
12.2%
3
Journal of Microscopy
18 papers in training set
Top 0.1%
12.2%
4
Light: Science & Applications
16 papers in training set
Top 0.1%
6.3%
5
PLOS ONE
4510 papers in training set
Top 34%
4.3%
50% of probability mass above
6
Nature Communications
4913 papers in training set
Top 38%
3.8%
7
ACS Photonics
13 papers in training set
Top 0.1%
3.6%
8
Journal of Visualized Experiments
30 papers in training set
Top 0.1%
3.2%
9
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 25%
2.7%
10
eLife
5422 papers in training set
Top 38%
1.9%
11
Communications Biology
886 papers in training set
Top 7%
1.8%
12
Biomedical Optics Express
84 papers in training set
Top 0.7%
1.7%
13
Cell Systems
167 papers in training set
Top 7%
1.7%
14
GigaScience
172 papers in training set
Top 2%
1.5%
15
Scientific Reports
3102 papers in training set
Top 64%
1.3%
16
Bioinformatics
1061 papers in training set
Top 8%
1.2%
17
mBio
750 papers in training set
Top 10%
0.9%
18
Computational and Structural Biotechnology Journal
216 papers in training set
Top 8%
0.8%
19
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
20
Nucleic Acids Research
1128 papers in training set
Top 19%
0.7%
21
Scientific Data
174 papers in training set
Top 3%
0.7%
22
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
23
Advanced Science
249 papers in training set
Top 22%
0.6%
24
Nature
575 papers in training set
Top 17%
0.6%