Back

Benchmark of Wide Range of Pairwise Distance Metrics for Automated Classification of Mouse Mutant Phenotypes from Flow Cytometry Data

May, M.; Hewitt, T.; Mashford, B.; Hammill, D.; Davies, A.; Andrews, T. D.

2025-01-06 bioinformatics
10.1101/2025.01.06.631468 bioRxiv
Show abstract

Precision medicine requires a comprehensive mapping of genotype to phenotype to provide patients with individually tailored treatment. However, when using flow cytometry to identify phenotypes, such as the quantity of various immune cell populations in tissue and blood used to identify autoimmune disorders, it is often unclear which cellular phenotypes are from healthy and disease individuals, especially when including the effects of population diversity, due to the high-dimensional nature of the data. To identify and segregate healthy phenotype from various disease phenotypes, we use pairwise distance metrics between each samples cell populations. By comparing distance metrics between C57BL/6 clone mice with mutations of known phenotype, we find that cosine similarity is best suited for segregating wildtype from mutant samples while respecting minute differences in already small cell populations, and that standardised Euclidean distance is best suited for machine-learning input due to its sensitivity. Both metrics outperform other tested metrics (including Aitchison, Euclidean, Manhattan, Earth-Movers Distance, and squared Euclidean). We demonstrate the utility of these different pairwise metrics through their application to a classification task of known mutant phenotypes: using an existing FACS phenotype dataset derived from X000 inbred C57BL/6 mice that harbour potentially phenotypic genetic variation introduced through ENU mutagenesis of individual pedigree-founding G0 male mice.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 1%
18.3%
2
BMC Medical Genomics
36 papers in training set
Top 0.1%
14.5%
3
Briefings in Bioinformatics
326 papers in training set
Top 1.0%
6.2%
4
Disease Models & Mechanisms
119 papers in training set
Top 0.3%
4.8%
5
Cell Reports Methods
141 papers in training set
Top 0.8%
3.6%
6
Computational and Structural Biotechnology Journal
216 papers in training set
Top 2%
3.5%
50% of probability mass above
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 0.8%
3.5%
8
Patterns
70 papers in training set
Top 0.3%
3.5%
9
Computers in Biology and Medicine
120 papers in training set
Top 1.0%
3.2%
10
Frontiers in Immunology
586 papers in training set
Top 3%
2.6%
11
PLOS Computational Biology
1633 papers in training set
Top 13%
2.3%
12
Bioinformatics
1061 papers in training set
Top 7%
2.0%
13
Bioinformatics Advances
184 papers in training set
Top 2%
2.0%
14
iScience
1063 papers in training set
Top 10%
2.0%
15
PLOS ONE
4510 papers in training set
Top 51%
1.9%
16
Genome Medicine
154 papers in training set
Top 5%
1.7%
17
Frontiers in Genetics
197 papers in training set
Top 5%
1.6%
18
Communications Biology
886 papers in training set
Top 15%
1.2%
19
npj Systems Biology and Applications
99 papers in training set
Top 2%
1.2%
20
Frontiers in Cell and Developmental Biology
218 papers in training set
Top 6%
1.2%
21
Nature Communications
4913 papers in training set
Top 59%
0.9%
22
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
23
Nucleic Acids Research
1128 papers in training set
Top 16%
0.9%
24
ImmunoInformatics
11 papers in training set
Top 0.2%
0.8%
25
Cell Systems
167 papers in training set
Top 12%
0.7%
26
BMC Genomics
328 papers in training set
Top 6%
0.7%
27
GigaScience
172 papers in training set
Top 4%
0.6%
28
Journal of Computational Biology
37 papers in training set
Top 0.8%
0.6%
29
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.8%
0.6%
30
Genes
126 papers in training set
Top 4%
0.6%