Pelagibacter, resolved
Nielsen, T. N.; Lui, L. M.
Show abstract
Pelagibacter, the largest genus within the SAR11 clade, is the most abundant bacterium in the ocean, yet the vast majority of its species-level diversity remains uncharacterized at the genomic level. Here we present 135 complete Pelagibacter genomes -- the largest such collection assembled to date -- comprising 75 from Oxford Nanopore metagenomes of the San Francisco Estuary (SFE), 31 from a deeply sequenced station within the same transect, and 29 from public databases. These genomes define 52 species at 95% ANI, of which 44 (85%) are taxonomically novel. An expanded phylogeny incorporating 89 additional high-quality NCBI genomes confirms that our collection captures the phylogenetic backbone of the genus, with genomes from Hawaii, Namibia, and the Sargasso Sea nesting within SFE clades. The pangenome is open (14,862 singletons, 62%), driven by two distinct mechanisms. First, a universal hypervariable region (HVR) at a conserved chromosomal position (7-15% from dnaA) is present in all 135 genomes, anchored by tRNA genes at both boundaries (Phe/His and Arg). The HVR carries genome-specific surface polysaccharide biosynthesis genes with a GC age gradient -- highest GC at the tRNA boundaries, lowest in the center -- consistent with a two-ended phage insertion model. Only this HVR is positionally conserved across the genus; the three other hypervariable regions previously described in a single reference genome are not. Second, scattered genomic islands throughout the chromosome contribute the remaining singleton content, including chimeric islands with genes from four bacterial phyla. Biosynthetic pathway reconstruction reveals auxotrophies that are phylogenetically structured, not uniform: biotin, reduced sulfur, and glycine are genus-wide dependencies, while isoleucine, pantothenate, histidine, and glyoxylate cycle capacity vary across lineages with significant phylogenetic clustering. Structural annotation with ESMFold and Foldseek resolved 3,125 hypothetical proteins; 1,222 remain uncharacterized by any method, including a 47-amino-acid protein conserved in two-thirds of all genomes within a fixed operonic context -- independently predicted by two gene callers yet matching nothing in any database. A controlled depth comparison at one station demonstrates that standard metagenome sequencing systematically underestimates Pelagibacter diversity, with three species recovered only at elevated depth and the species count at that station more than doubling (9 vs 4).
Matching journals
The top 3 journals account for 50% of the predicted probability mass.