Back

Comparing optimal transport and machine learning approaches for databases merging in scenarios involving missing data in covariates.Application to Medical Research

N'kam suguem, F.; DEJEAN, s.; Saint-Pierre, P.; Savy, N.

2026-01-26 bioinformatics
10.64898/2026.01.23.701369 bioRxiv
Show abstract

MotivationOne of the challenges encountered when merging heterogeneous observational clinical datasets is the recoding of categorical target variables that may have been measured differently across data sources. Standard machine learning-based approaches, such as Multiple Imputation by Chained Equations and the k-Nearest Neighbours method are compared with an Optimal Transport based algorithm (OTre-cod) when databases are altered by missing values in covariates or by imbalanced groups. The empirical performance in these realistic data integration settings remains underexplored. ResultsA comprehensive simulation study was conducted, varying sample size, group imbalance, signal-to-noise ratio, and mechanisms of missing data. The results demonstrate that OTrecod consistently achieves higher recoding accuracy compared with Multiple Imputation by Chained Equations and k-Nearest Neighbours, particularly in large, imbalanced and weak-signal scenarios. These findings are further illustrated using subsets of the National Child Development Study, where OTrecod and Multiple Imputation by Chained Equations minimised the distributional divergence between recoded social-class scales, while k-Nearest Neighbours produced less stable results. Availability and ImplementationThe source code supporting this study is publicly available at https://github.com/FloAI/CompareOT.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
14.6%
2
Bioinformatics
1061 papers in training set
Top 2%
14.6%
3
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
10.0%
4
BMC Bioinformatics
383 papers in training set
Top 1%
9.1%
5
Journal of Biomedical Informatics
45 papers in training set
Top 0.3%
6.3%
50% of probability mass above
6
PLOS ONE
4510 papers in training set
Top 32%
4.8%
7
BioData Mining
15 papers in training set
Top 0.1%
3.6%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.8%
3.6%
9
Scientific Reports
3102 papers in training set
Top 38%
3.6%
10
GigaScience
172 papers in training set
Top 0.7%
3.0%
11
PLOS Computational Biology
1633 papers in training set
Top 12%
2.6%
12
Statistics in Medicine
34 papers in training set
Top 0.1%
2.1%
13
BMC Research Notes
29 papers in training set
Top 0.1%
1.7%
14
Nature Communications
4913 papers in training set
Top 52%
1.7%
15
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.5%
1.3%
16
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
17
npj Digital Medicine
97 papers in training set
Top 4%
0.7%
18
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
19
JAMIA Open
37 papers in training set
Top 2%
0.7%
20
Trials
25 papers in training set
Top 2%
0.7%
21
Epidemics
104 papers in training set
Top 2%
0.6%