Back

Generalization of the minimum covariance determinant algorithm for categorical and mixed data types

Beaton, D.; Sunderland, K. M.; ADNI, ; Levine, B.; Mandzia, J.; Masellis, M.; Swartz, R. H.; Troyer, A. K.; ONDRI, ; Binns, M. A.; Abdi, H.; Strother, S. C.

2020-03-31 bioinformatics
10.1101/333005 bioRxiv
Show abstract

The minimum covariance determinant (MCD) algorithm is one of the most common techniques to detect anomalous or outlying observations. The MCD algorithm depends on two features of multivariate data: the determinant of a matrix (i.e., geometric mean of the eigenvalues) and Mahalanobis distances (MD). While the MCD algorithm is commonly used, and has many extensions, the MCD is limited to analyses of quantitative data and more specifically data assumed to be continuous. One reason why the MCD does not extend to other data types such as categorical or ordinal data is because there is not a well-defined MD for data types other than continuous data. To address the lack of MCD-like techniques for categorical or mixed data we present a generalization of the MCD. To do so, we rely on a multivariate technique called correspondence analysis (CA). Through CA we can define MD via singular vectors and also compute the determinant from CAs eigenvalues. Here we define and illustrate a generalized MCD on categorical data and then show how our generalized MCD extends beyond categorical data to accommodate mixed data types (e.g., categorical, ordinal, and continuous). We illustrate this generalized MCD on data from two large scale projects: the Ontario Neurodegenerative Disease Research Initiative (ONDRI) and the Alzheimers Disease Neuroimaging Initiative (ADNI), with genetics (categorical), clinical instruments and surveys (categorical or ordinal), and neuroimaging (continuous) data. We also make R code and toy data available in order to illustrate our generalized MCD.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.5%
2
PLOS ONE
4510 papers in training set
Top 13%
14.4%
3
PLOS Computational Biology
1633 papers in training set
Top 3%
10.1%
4
BMC Bioinformatics
383 papers in training set
Top 2%
6.4%
50% of probability mass above
5
Scientific Reports
3102 papers in training set
Top 24%
4.9%
6
Biostatistics
21 papers in training set
Top 0.1%
4.9%
7
Statistics in Medicine
34 papers in training set
Top 0.1%
4.3%
8
Bioinformatics Advances
184 papers in training set
Top 1%
3.6%
9
Methods in Ecology and Evolution
160 papers in training set
Top 1%
1.7%
10
BioData Mining
15 papers in training set
Top 0.3%
1.7%
11
Nature Communications
4913 papers in training set
Top 55%
1.3%
12
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
13
The Annals of Applied Statistics
15 papers in training set
Top 0.1%
1.2%
14
Biometrics
22 papers in training set
Top 0.1%
1.2%
15
Journal of Open Source Software
22 papers in training set
Top 0.1%
1.2%
16
Genetic Epidemiology
46 papers in training set
Top 0.6%
1.1%
17
PLOS Genetics
756 papers in training set
Top 12%
1.0%
18
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.9%
19
Journal of Computational Biology
37 papers in training set
Top 0.5%
0.8%
20
Communications Biology
886 papers in training set
Top 24%
0.7%
21
NeuroImage
813 papers in training set
Top 6%
0.7%
22
Imaging Neuroscience
242 papers in training set
Top 3%
0.7%
23
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%
24
Frontiers in Neuroscience
223 papers in training set
Top 9%
0.6%
25
BMC Genomics
328 papers in training set
Top 7%
0.6%