Back

Methodological pitfalls in plant pangenome gene family identification may lead to biased evolutionary inferences

Liu, S.; Zhang, W.; Yu, P.

2026-05-18 genomics
10.64898/2026.05.15.725319 bioRxiv
Show abstract

Pangenome-level gene family identification often applies sequence similarity clustering without phylogenetic or synteny information, which risks biologically misleading evolutionary inferences. Using five transcription factor families (bHLH, MYB, NAC, WRKY, MADS-box) across 401 rice pangenome accessions, we compared clustering strategies: OrthoFinder alone, cd-hit alone, MMseqs2 alone, and OrthoFinder-informed refinement by cd-hit or MMseqs2. Methods solely based on sequence similarity merged distinct orthogroups and generated fewer orthogroups than approaches incorporating graph-based orthology. Conflicting cluster assignments, measured against OrthoFinder, varied strongly among families, from approximately 14% in MADS-box to approximately 57% in MYB, and were associated with protein length differences. Core, shell, and cloud gene classifications shifted substantially depending on the method, especially in MYB, NAC, and WRKY families. Critically, Ka/Ks distributions for core genes were highly method-sensitive, with orthology-aware methods yielding more convergent and less variable estimates of selective pressure, whereas noncore gene estimates remained robust. These findings demonstrate that neglecting graph-based orthogroup inference inflates methodological artifacts. We recommend a two-step strategy: initial graph-based orthogroup delineation followed by sequence similarity refinement to balance evolutionary accuracy and resolution in pangenome-scale gene family studies.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
The Plant Journal
197 papers in training set
Top 0.1%
18.1%
2
Applications in Plant Sciences
21 papers in training set
Top 0.1%
12.4%
3
New Phytologist
309 papers in training set
Top 0.6%
9.8%
4
Frontiers in Plant Science
240 papers in training set
Top 1%
6.6%
5
The Plant Genome
53 papers in training set
Top 0.1%
6.2%
50% of probability mass above
6
The Plant Cell
141 papers in training set
Top 0.7%
4.2%
7
Plant Biotechnology Journal
56 papers in training set
Top 0.3%
3.9%
8
Plant Direct
81 papers in training set
Top 0.6%
3.5%
9
BMC Genomics
328 papers in training set
Top 1%
2.7%
10
Molecular Biology and Evolution
488 papers in training set
Top 2%
2.7%
11
Plant Communications
35 papers in training set
Top 0.7%
1.8%
12
Methods in Ecology and Evolution
160 papers in training set
Top 1%
1.8%
13
PLOS ONE
4510 papers in training set
Top 55%
1.7%
14
G3 Genes|Genomes|Genetics
351 papers in training set
Top 2%
1.4%
15
G3: Genes, Genomes, Genetics
222 papers in training set
Top 0.5%
1.4%
16
Genome Biology
555 papers in training set
Top 6%
1.2%
17
Genetics
225 papers in training set
Top 3%
1.2%
18
Scientific Reports
3102 papers in training set
Top 67%
1.2%
19
Plant Physiology
217 papers in training set
Top 2%
1.1%
20
Genome Biology and Evolution
280 papers in training set
Top 2%
0.9%
21
Nature Plants
84 papers in training set
Top 2%
0.8%
22
eLife
5422 papers in training set
Top 59%
0.7%
23
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.7%
24
Nucleic Acids Research
1128 papers in training set
Top 20%
0.6%
25
GENETICS
189 papers in training set
Top 2%
0.6%
26
in silico Plants
24 papers in training set
Top 0.4%
0.6%
27
Computational and Structural Biotechnology Journal
216 papers in training set
Top 11%
0.6%
28
GigaScience
172 papers in training set
Top 4%
0.6%