Back

A stable, hierarchical LIN code system for Campylobacter jejuni and Campylobacter coli: A unified genomic nomenclature for lineage-level typing and global surveillance.

Parfitt, K. M.; Pascoe, B.; Jolley, K. A.; Douglas, A.; Goforth, M. P.; Sheppard, S. K.; Maiden, M. C. J.; Colles, F. M.

2026-02-10 microbiology
10.64898/2026.02.10.705007 bioRxiv
Show abstract

Campylobacter remains the leading cause of bacterial gastroenteritis worldwide, with C. jejuni accounting for around 90% of infection and C. coli accounting for most of the rest. Seven-locus multilocus sequence typing (MLST) has improved our understanding of host association and population structure, whilst core genome MLST (cgMLST), enables investigation of transmission events at high-resolution. However, the lack of a stable and standardised nomenclature for clustering of cgMLST data has limited reproducibility and long-term comparability between studies. Here we introduce a joint, hierarchical Life Identification Number (LIN) code system that provides reproducible, multi-level genomic identifiers for C. jejuni and C. coli lineages. Using an updated cgMLST v2 scheme (1,142 loci) and globally representative datasets of high-quality genomes selected from over 53,000 assemblies in the Campylobacter PubMLST database (https://pubmlst.org/organisms/campylobacter-jejunicoli), we firstly defined LIN codes on a dataset of 5,664 genomes. Pairwise allelic distances were computed using MSTclust, and 18 nested thresholds were defined through silhouette, adjusted Wallace and adjusted Rand Index (ARI) statistics to capture the population structure from species to outbreak level resolution. The LIN thresholds were then validated using a second dataset of 1,781 genomes from PubMLST and applied to a large water-associated outbreak dataset from New Zealand in 2016, containing clinical and ecological genomes. Further application of LIN codes was demonstrated by analyses of the C. jejuni ST-21 clonal complex and ST-6175 isolates, as well as the broader population structure of C. coli, using data from PubMLST. Across all datasets, LIN clusters were stable, largely monophyletic, and back-compatible with existing nomenclature, accurately distinguishing host-adapted and outbreak-associated lineages. By embedding cgMLST data within a stable and scalable nomenclature, the Campylobacter LIN system delivers consistent, automated genome-to-lineage assignment. This unified framework bridges population genetics and applied surveillance, enabling robust, real-time comparison of Campylobacter isolates across sources, studies, and time. Impact statementHuman cases of Campylobacter worldwide continue unabated. Tracing the source of Campylobacter infection is particularly challenging given the sporadic or multi-source nature of outbreaks, with potential transmission from foodborne, animal or environmental sources. Seven-locus MLST has greatly improved our broad understanding of Campylobacter population structure. However, whilst high-resolution cgMLST alleles and STs themselves do not change, longitudinal cluster analyses of cgMLST data have lacked a stable nomenclature, rendering them unsuitable for robust and comparable surveillance over time. Life Identification Number (LIN) codes provide a solution to this problem, establishing an automated and scalable nomenclature derived directly from cgMLST profiles, that is stable over time. We have implemented a joint C. jejuni and C. coli LIN code scheme in PubMLST, with scripts for real-time lineage assignment. LIN codes are back-compatible with existing MLST nomenclature, and we demonstrate their added practical value for exploring population structure and high-resolution outbreak investigation. LIN codes support surveillance of Campylobacter in a One Health context, by enabling consistent typing at multiple levels across different sources, laboratories and time. Data summary1. The isolate collections used to develop the LIN codes are publicly available and searchable as individual projects on the PubMLST database (https://pubmlst.org). O_LILIN code development (Dataset 1) (n=5,664 isolates, up to 200 isolates per clonal complex) C_LIO_LILIN code validation (Dataset 2) (n=1,781 isolates), up to 50 isolates per clonal complex C_LIO_LIOutbreak investigation (Dataset 3): New Zealand 2016 Havelock North waterborne outbreak, Gilpin et al (n=161 isolates) [1] C_LIO_LIPopulation structure exploration (clonal complex) (Dataset 4): ST-21 complex (n=1800 isolates, up to 100 isolates randomly selected from each country) C_LIO_LIPopulation structure exploration (sequence type) (Dataset 5): ST-6175 (n=321 isolates, genomes with good cgMLST v2 annotation) C_LI 2. The software for LIN code development is publicly available as follows: O_LIMSTclust for pairwise distance matrices https://gitlab.pasteur.fr/GIPhy/MSTclust [2] C_LIO_LIPython script to define LIN codes in a local dataset; (https://gitlab.pasteur.fr/BEBP/LINcoding) C_LIO_LIBIGSdb Perl script to define LIN codes from cgMLST profiles on the PubMLST database; (https://github.com/kjolley/BIGSdb/blob/develop/scripts/maintenance/lincodes.pl) C_LI

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Microbial Genomics
204 papers in training set
Top 0.1%
48.6%
2
Nature Communications
4913 papers in training set
Top 17%
10.3%
50% of probability mass above
3
Genome Medicine
154 papers in training set
Top 0.9%
6.4%
4
Journal of Clinical Microbiology
120 papers in training set
Top 0.4%
4.9%
5
Nature Microbiology
133 papers in training set
Top 2%
1.9%
6
mSphere
281 papers in training set
Top 3%
1.7%
7
PLOS ONE
4510 papers in training set
Top 56%
1.5%
8
mSystems
361 papers in training set
Top 5%
1.4%
9
mBio
750 papers in training set
Top 9%
1.2%
10
Scientific Reports
3102 papers in training set
Top 66%
1.2%
11
Nucleic Acids Research
1128 papers in training set
Top 14%
1.1%
12
eLife
5422 papers in training set
Top 53%
0.9%
13
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
14
Journal of Infection
71 papers in training set
Top 2%
0.8%
15
The Lancet Microbe
43 papers in training set
Top 1%
0.8%
16
International Journal of Food Microbiology
11 papers in training set
Top 0.6%
0.7%
17
GigaScience
172 papers in training set
Top 3%
0.7%
18
Microbiome
139 papers in training set
Top 3%
0.7%
19
Genome Research
409 papers in training set
Top 5%
0.7%
20
Microbiology Resource Announcements
22 papers in training set
Top 1%
0.7%
21
Frontiers in Cellular and Infection Microbiology
98 papers in training set
Top 7%
0.5%
22
Genome Biology
555 papers in training set
Top 9%
0.5%
23
PeerJ
261 papers in training set
Top 19%
0.5%