Back

A real data-driven simulation strategy to select an imputation method for mixed-type trait data

May, J. A.; Feng, Z.; Adamowicz, S. J.

2022-10-27 bioinformatics
10.1101/2022.05.03.490388 bioRxiv
Show abstract

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly complete information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset. Author summaryThe issue of missing data is problematic in trait datasets as the missingness pattern may not be entirely random. Whether data are missing may depend on other known observations in the dataset, or on the value of the missing data points themselves. When only complete cases are used in an analysis, derived results may be biased. Imputation is an alternative to complete-case analysis and entails filling in the missing values using information provided by other trait values present in the dataset. Including phylogenetic information in the imputation process can improve the accuracy of imputed values, though results are dependent on the amount and pattern of missingness. Most previous evaluations of imputation methods for trait datasets are limited to numerical simulated data, with categorical traits not considered. Given a particular dataset, we propose the use of a real data-driven simulation strategy to select an imputation method. We evaluated the accuracies of four different imputation methods, with and without phylogeny information, and under different simulated missingness patterns using an example reptile trait dataset. Results indicated that data imputed using the best-performing method better reflected the original dataset characteristics compared to complete-case data. As imputation performance varies depending on the properties of a given dataset, a real data-driven simulation strategy can be used to provide guidance on best imputation practices.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
PeerJ
261 papers in training set
Top 0.1%
19.2%
2
Methods in Ecology and Evolution
160 papers in training set
Top 0.3%
12.7%
3
PLOS ONE
4510 papers in training set
Top 18%
10.3%
4
Peer Community Journal
254 papers in training set
Top 0.6%
4.4%
5
Ecological Informatics
29 papers in training set
Top 0.2%
3.8%
50% of probability mass above
6
Scientific Reports
3102 papers in training set
Top 33%
3.8%
7
Molecular Ecology Resources
161 papers in training set
Top 0.3%
3.7%
8
BMC Bioinformatics
383 papers in training set
Top 3%
3.7%
9
Frontiers in Genetics
197 papers in training set
Top 3%
2.1%
10
Ecology and Evolution
232 papers in training set
Top 2%
2.1%
11
Biology Methods and Protocols
53 papers in training set
Top 0.6%
2.1%
12
Royal Society Open Science
193 papers in training set
Top 2%
1.9%
13
Gigabyte
60 papers in training set
Top 0.6%
1.7%
14
PLOS Computational Biology
1633 papers in training set
Top 16%
1.7%
15
BMC Genomics
328 papers in training set
Top 3%
1.4%
16
Bioinformatics Advances
184 papers in training set
Top 4%
1.3%
17
GigaScience
172 papers in training set
Top 2%
1.3%
18
Systematic Biology
121 papers in training set
Top 0.3%
1.0%
19
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
20
Ecography
50 papers in training set
Top 1%
0.8%
21
Systematic Entomology
11 papers in training set
Top 0.1%
0.7%
22
Bioinformatics
1061 papers in training set
Top 10%
0.7%
23
Journal of Computational Biology
37 papers in training set
Top 0.7%
0.7%
24
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.7%
25
Genetics Selection Evolution
33 papers in training set
Top 0.2%
0.7%
26
G3: Genes, Genomes, Genetics
222 papers in training set
Top 1%
0.5%
27
Heredity
53 papers in training set
Top 0.4%
0.5%
28
BMC Research Notes
29 papers in training set
Top 0.9%
0.5%