Back

mdBIRCH for Fast, Scalable, Online Clustering of Molecular Dynamics Trajectories

Woody Santos, J. B.; Chen, L.; Miranda Quintana, R. A.

2026-03-19 biophysics
10.1101/2025.11.05.686879 bioRxiv
Show abstract

We present mdBIRCH, an online clustering method that adapts the BIRCH CF-tree to molecular dynamics (MD) data by using a merge test calibrated directly to RMSD. Each arriving frame is routed to the nearest centroid and added only if the post-merge radius computed from the cluster feature remains within a user-supplied threshold. This keeps the average deviation to each cluster centroid bounded as the cluster grows and preserves a simple interpretation of resolution in physical units. We evaluate mdBIRCH on a {beta}-heptapeptide and the HP35 system. We propose two protocols to make the threshold selection easier: (a) RMSD-anchored runs that use controlled structural edits to define interpretable operating points and (b) blind sweep that tracks how cluster count, occupancy, and coverage change with the threshold. In both systems, increasing the threshold reduces the number of clusters, concentrates coverage in high-occupancy states, and broadens within-cluster RMSD distributions. Furthermore, because decisions rely only on cluster summaries, mdBIRCH completely avoids the need for pairwise distance matrices, scales near-linearly with the number of frames on standard hardware, and naturally supports incremental operation. The method offers a practical combination of speed and interpretability for large-scale trajectory analysis.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Methods
336 papers in training set
Top 0.2%
22.9%
2
Nature Communications
4913 papers in training set
Top 28%
6.5%
3
Bioinformatics
1061 papers in training set
Top 4%
6.5%
4
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.8%
6.4%
5
PLOS ONE
4510 papers in training set
Top 31%
4.9%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 16%
4.4%
50% of probability mass above
7
Nature Computational Science
50 papers in training set
Top 0.1%
3.6%
8
PLOS Computational Biology
1633 papers in training set
Top 9%
3.6%
9
Structure
175 papers in training set
Top 0.8%
3.6%
10
Acta Crystallographica Section D Structural Biology
54 papers in training set
Top 0.1%
2.9%
11
IUCrJ
29 papers in training set
Top 0.1%
2.8%
12
Nature Biotechnology
147 papers in training set
Top 3%
2.6%
13
Scientific Reports
3102 papers in training set
Top 49%
2.1%
14
Nucleic Acids Research
1128 papers in training set
Top 9%
1.9%
15
eLife
5422 papers in training set
Top 41%
1.7%
16
Journal of Molecular Biology
217 papers in training set
Top 2%
1.7%
17
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
18
Frontiers in Molecular Biosciences
100 papers in training set
Top 2%
1.4%
19
Biophysical Journal
545 papers in training set
Top 3%
1.4%
20
Journal of Structural Biology
58 papers in training set
Top 0.9%
1.4%
21
Communications Biology
886 papers in training set
Top 18%
0.9%
22
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.8%
0.8%
23
Journal of Computational Chemistry
11 papers in training set
Top 0.2%
0.7%
24
BMC Bioinformatics
383 papers in training set
Top 7%
0.7%
25
Chemical Science
71 papers in training set
Top 2%
0.7%
26
Cell Systems
167 papers in training set
Top 13%
0.7%
27
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.5%
28
iScience
1063 papers in training set
Top 40%
0.5%
29
Genome Research
409 papers in training set
Top 5%
0.5%
30
Nature Protocols
30 papers in training set
Top 0.4%
0.5%