Back

Can synthetic data overcome the privacy and fidelity bottleneck in Pharmacometrics? A comparative benchmark using a daptomycin population pharmacokinetic model

Destere, A.; Lombardi, R.; Labriffe, M.; Benoist, C.; marquet, p.; Lavrut, T.; Gerard, A.; Bouveyron, c.; Woillard, J.-B.

2026-06-02 pharmacology and therapeutics
10.64898/2026.05.30.26354512 medRxiv
Show abstract

Abstract Introduction The sharing of individual patient data is essential for advancing pharmacometrics but is strictly limited by privacy regulations (e.g., GDPR). While synthetic data generation offers a legally compliant alternative, its structural impact on complex nonlinear mixed-effects (NLME) modelling remains largely unexplored. This study aimed to benchmark five generative artificial intelligence algorithms by evaluating the balance between data privacy and the preservation of structural PK properties and clinical dosing guidance. Material & methods A daptomycin two-compartment PopPK model was used to simulate a reference cohort of 500 patients. Five generative algorithms (Modified AVATAR, Gaussian Copula, Synthpop, TVAE, and CTGAN) produced 100 independent synthetic datasets each. A two-stage evaluation framework was applied: first, a statistical indistinguishability test based on logistic regression (AUC ROC) was used as a macroscopic pre-selection criterion to determine algorithm eligibility for NLME modelling and privacy risk assessment. Privacy risk was independently quantified using the Anonymeter framework (Singling Out and Linkability attacks). Eligible algorithms were further evaluated on PK parameter recovery bias and clinical dosing simulations. Results Deep learning architectures (TVAE, CTGAN) were excluded at the pre-selection stage due to both biologically implausible covariate generation and high macroscopic detectability (mean AUC ROC = 0.837 and 0.986, respectively). Synthpop, AVATAR, and Gaussian Copula all passed the indistinguishability threshold (AUC ROC = 0.475 +- 0.033, 0.490 +- 0.013, and 0.619 +- 0.031, respectively) and proceeded to NLME evaluation. However, attack-based privacy assessment revealed that Synthpop carried an unacceptable singling-out risk (0.035), disqualifying it from privacy-preserving data sharing. AVATAR and Gaussian Copula demonstrated acceptable privacy profiles (singling-out = 0.004 and 0.001; linkability = 0.010 and 0.003, respectively). At the structural level, Gaussian Copula injected stochastic noise inflating residual error (+157.0%) and V1; (+25.9%), blunting predicted Cmax and predisposing to empirical dose escalation and risk of toxicity. AVATAR acted aSs a smoothing filter, deflating V2; (-48.3%) and underestimating CL (-12.9%). Forward clinical simulations confirmed directionally opposed prediction errors: Gaussian Copula consistently underestimated Cmax across standard and renally impaired profiles (-14.5% and -16.0%, respectively), predisposing to empirical dose escalation, whereas AVATAR- and Synthpop-derived models overestimated Cmax and Cmin in the obese infected patient (+14.7% and +8.2%, respectively), compounding the accumulation risk already present in this profile. Conclusion While no generative algorithm currently offers a perfect solution, AVATAR and Gaussian Copula represent the most viable candidates, being the only methods to satisfy both macroscopic indistinguishability and attack-based privacy criteria. These findings highlight the necessity of a structured, two-stage validation framework and suggest that, when coupled with therapeutic drug monitoring, synthetic datasets could significantly enhance multicentre collaboration while maintaining strict regulatory compliance

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.1%
28.2%
2
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.1%
10.3%
3
Frontiers in Pharmacology
100 papers in training set
Top 0.2%
8.6%
4
Clinical and Translational Science
21 papers in training set
Top 0.1%
8.6%
50% of probability mass above
5
PLOS Computational Biology
1633 papers in training set
Top 8%
4.4%
6
npj Digital Medicine
97 papers in training set
Top 1%
4.4%
7
PLOS ONE
4510 papers in training set
Top 37%
3.7%
8
British Journal of Clinical Pharmacology
21 papers in training set
Top 0.2%
3.1%
9
Scientific Reports
3102 papers in training set
Top 49%
2.1%
10
Pharmaceutics
21 papers in training set
Top 0.2%
1.9%
11
npj Systems Biology and Applications
99 papers in training set
Top 0.9%
1.9%
12
eLife
5422 papers in training set
Top 37%
1.9%
13
Journal of Biomedical Informatics
45 papers in training set
Top 0.8%
1.7%
14
Nature Communications
4913 papers in training set
Top 56%
1.3%
15
Frontiers in Medicine
113 papers in training set
Top 4%
1.3%
16
BioData Mining
15 papers in training set
Top 0.6%
1.0%
17
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.9%
18
JMIRx Med
31 papers in training set
Top 2%
0.8%
19
Journal of Medicinal Chemistry
68 papers in training set
Top 1%
0.8%
20
Bioinformatics Advances
184 papers in training set
Top 5%
0.8%
21
iScience
1063 papers in training set
Top 31%
0.8%
22
Bioinformatics
1061 papers in training set
Top 9%
0.8%
23
IEEE Transactions on Computational Biology and Bioinformatics
17 papers in training set
Top 0.7%
0.7%
24
Cell Reports Medicine
140 papers in training set
Top 8%
0.7%
25
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.5%
0.7%
26
Frontiers in Psychiatry
83 papers in training set
Top 4%
0.5%
27
Schizophrenia
19 papers in training set
Top 0.4%
0.5%