Back

A 37-million-particle dataset from over 250 experiments to accelerate data-driven cryo-EM analysis

Zamanos, A.; Kyrilis, F. L.; Koromilas, P.; Kastritis, P. L.; Panagakis, Y.

2026-05-03 bioinformatics
10.64898/2026.04.29.720997 bioRxiv
Show abstract

Cryogenic Electron Microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-resolution structure determination of biological macromolecules. Central to cryo-EM analysis are particles, namely 2D projections of biomolecules extracted from micrographs, which serve as the primary input for 3D reconstruction. While data-driven methods have transformed other scientific domains, their impact on cryo-EM remains limited because existing particle datasets are too small, too narrow in protein diversity, and lack rich per-particle annotations. We introduce cryoPANDA (cryo-EM Particles ANnotated DAtaset), comprising over 37 million annotated particles from 252 experiments spanning a wide range of protein types, more than 10-fold larger than prior collections. Each particle is accompanied by detailed annotations covering acquisition, classification, and re-construction metadata, alongside the corresponding 3D electrostatic potential map, the published EMDB map, and, where available, the PDB model. We validate cryoPANDA in two ways: first, by reconstructing hundreds of distinct high-resolution cryo-EM maps; and second, by training a DINOv2 foundation model and evaluating its learned representations on micrograph segmentation, particle picking, and particle clustering.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.