Back

Validated Synthetic Data Generation from a Multicenter Spine Surgery Registry: Methodology and Benchmark

Challier, V.; Jacquemin, C.; Diebo, B.; Dehouche, N.; Denisov, A.; Cristini, J.; Campana, M.; Castelain, J.-E.; Lonjon, G.; Lafage, V.; Ghailane, S.; SpineDAO Collaborative Group,

2026-04-11 health informatics
10.64898/2026.04.07.26350316 medRxiv
Show abstract

BackgroundSynthetic data have emerged as a complementary strategy for secondary use of clinical registries, enabling data sharing without patient-level exposure. In spine surgery, multicenter data sharing is constrained by institutional governance and patient privacy regulations. Validated synthetic data generation may enable broader access to surgical outcomes data for artificial intelligence development without compromising patient confidentiality. ObjectiveTo describe and benchmark a three-domain validated synthetic data pipeline applied to a multicenter, tokenized spine surgery registry (SpineBase), and to establish a reproducible certification framework for synthetic spine surgery datasets. MethodsWe extracted 125 sacroiliac joint fusion cases from the SpineBase registry (SIBONE study, IRB-SOFCOT approval Ref. 14-2025; CNIL MR-004 Ref. 2234503 v 0). A GaussianCopula generative model was trained on 52 structured variables spanning demographics, preoperative assessments, operative details, and longitudinal outcomes at 3, 6, 12, and 24 months. Synthetic datasets of 100, 1,000, and 10,000 patients were generated. Validation followed a three-domain framework: (1) fidelity, assessed by Kolmogorov-Smirnov tests and Jensen-Shannon divergence; (2) utility, assessed by train-on-synthetic, test-on-real (TSTR) methodology; and (3) privacy, assessed by nearest-neighbor distance ratio (NNDR), membership inference attack, and k-anonymity proxy. ResultsAll three validation gates passed. Fidelity: mean KS p-value 0.52 (threshold >0.05). Privacy: NNDR >1.0 in 98.9% of synthetic records; membership inference AUROC 0.57. Utility: 12-month Oswestry Disability Index prediction yielded Pearson r = 0.29, consistent with expected attenuation at N = 125. A SHA-256 cryptographic hash of each certified dataset was anchored on the Solana blockchain for immutable provenance. ConclusionsA validated, blockchain-anchored synthetic data pipeline for spine surgery registries is technically feasible and meets current publication-standard criteria for fidelity and privacy. Utility metrics scale with registry size, creating a direct incentive for multicenter data contribution. This framework provides a reproducible methodology for synthetic data certification in spine surgery research, and establishes certified synthetic datasets as a privacy-native substrate for expert-annotation pipelines -- as demonstrated in the companion Spine Reviews study.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.1%
28.7%
2
Nature Communications
4913 papers in training set
Top 21%
8.7%
3
Scientific Reports
3102 papers in training set
Top 15%
6.6%
4
Scientific Data
174 papers in training set
Top 0.2%
6.6%
50% of probability mass above
5
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
6.6%
6
JMIR Medical Informatics
17 papers in training set
Top 0.2%
4.5%
7
Science Advances
1098 papers in training set
Top 8%
3.2%
8
PLOS ONE
4510 papers in training set
Top 46%
2.5%
9
BMJ Open
554 papers in training set
Top 9%
1.8%
10
The Lancet Digital Health
25 papers in training set
Top 0.4%
1.7%
11
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.4%
12
Patterns
70 papers in training set
Top 1%
1.4%
13
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.5%
1.4%
14
International Journal of Medical Informatics
25 papers in training set
Top 1%
1.3%
15
PLOS Digital Health
91 papers in training set
Top 2%
1.1%
16
BMC Medical Research Methodology
43 papers in training set
Top 1.0%
1.0%
17
JAMIA Open
37 papers in training set
Top 1%
0.8%
18
Experimental Neurology
57 papers in training set
Top 1%
0.8%
19
Communications Biology
886 papers in training set
Top 20%
0.8%
20
European Respiratory Journal
54 papers in training set
Top 2%
0.8%
21
GigaScience
172 papers in training set
Top 3%
0.8%
22
Annals of Internal Medicine
27 papers in training set
Top 1.0%
0.7%
23
JMIR Public Health and Surveillance
45 papers in training set
Top 4%
0.7%
24
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%
25
Clinical and Translational Science
21 papers in training set
Top 1%
0.7%
26
Nature Medicine
117 papers in training set
Top 5%
0.7%
27
Trials
25 papers in training set
Top 2%
0.7%
28
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1%
0.7%
29
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 7%
0.5%
30
iScience
1063 papers in training set
Top 39%
0.5%