Back

Synthesising artificial patient-level data for Open Science - an evaluation of five methods

Allen, M.; Salmon, A.

2020-10-13 health informatics
10.1101/2020.10.09.20210138 medRxiv
Show abstract

BackgroundOpen science is a movement seeking to make scientific research accessible to all, including publication of code and data. Publishing patient-level data may, however, compromise the confidentiality of that data if there is any significant risk that data may later be associated with individuals. Use of synthetic data offers the potential to be able to release data that may be used to evaluate methods or perform preliminary research without risk to patient confidentiality. MethodsWe have tested five synthetic data methods: O_LIA technique based on Principal Component Analysis (PCA) which samples data from distributions derived from the transformed data. C_LIO_LISynthetic Minority Oversampling Technique, SMOTE which is based on interpolation between near neighbours. C_LIO_LIGenerative Adversarial Network, GAN, an artificial neural network approach with competing networks - a discriminator network trained to distinguish between synthetic and real data., and a generator network trained to produce data that can fool the discriminator network. C_LIO_LICT-GAN, a refinement of GANs specifically for the production of structured tabular synthetic data. C_LIO_LIVariational Auto Encoders, VAE, a method of encoding data in a reduced number of dimensions, and sampling from distributions based on the encoded dimensions. C_LI Two data sets are used to evaluate the methods: O_LIThe Wisconsin Breast Cancer data set, a histology data set where all features are continuous variables. C_LIO_LIA stroke thrombolysis pathway data set, a data set describing characteristics for patients where a decision is made whether to treat with clot-busting medication. Features are mostly categorical, binary, or integers. C_LI Methods are evaluated in three ways: O_LIThe ability of synthetic data to train a logistic regression classification model. C_LIO_LIA comparison of means and standard deviations between original and synthetic data. C_LIO_LIA comparison of covariance between features in the original and synthetic data. C_LI ResultsUsing the Wisconsin Breast Cancer data set, the original data gave 98% accuracy in a logistic regression classification model. Synthetic data sets gave between 93% and 99% accuracy. Performance (best to worst) was SMOTE > PCA > GAN > CT-GAN = VAE. All methods produced a high accuracy in reproducing original data means and stabdard deviations (all R-square > 0.96 for all methods and data classes). CT-GAN and VAE suffered a significant loss of covariance between features in the synthetic data sets. Using the Stroke Pathway data set, the original data gave 82% accuracy in a logistic regression classification model. Synthetic data sets gave between 66% and 82% accuracy. Performance (best to worst) was SMOTE > PCA > CT-GAN > GAN > VAE. CT-GAN and VAE suffered loss of covariance between features in the synthetic data sets, though less pronounced than with the Wisconsin Breast Cancer data set. ConclusionsThe pilot work described here shows, as proof of concept, that synthetic data may be produced, which is of sufficient quality to publish with open methodology, to allow people to better understand and test methodology. The quality of the synthetic data also gives promise of data sets that may be used for screening of ideas, or for research project (perhaps especially in an education setting). More work is required to further refine and test methods across a broader range of patient-level data sets.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
9.8%
2
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.3%
8.2%
3
PLOS ONE
4510 papers in training set
Top 26%
6.6%
4
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
6.2%
5
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.1%
4.7%
6
Scientific Reports
3102 papers in training set
Top 29%
4.2%
7
Bioinformatics
1061 papers in training set
Top 5%
4.2%
8
International Journal of Medical Informatics
25 papers in training set
Top 0.4%
3.5%
9
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
3.5%
50% of probability mass above
10
Biology Methods and Protocols
53 papers in training set
Top 0.3%
3.5%
11
Journal of Medical Internet Research
85 papers in training set
Top 2%
3.5%
12
JAMIA Open
37 papers in training set
Top 0.5%
3.2%
13
Journal of the American Medical Informatics Association
61 papers in training set
Top 1.0%
2.5%
14
BMC Bioinformatics
383 papers in training set
Top 4%
2.0%
15
Artificial Intelligence in Medicine
15 papers in training set
Top 0.2%
2.0%
16
PLOS Computational Biology
1633 papers in training set
Top 15%
1.8%
17
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.7%
18
BMC Research Notes
29 papers in training set
Top 0.2%
1.4%
19
GigaScience
172 papers in training set
Top 2%
1.4%
20
PLOS Digital Health
91 papers in training set
Top 2%
1.4%
21
JMIR Medical Informatics
17 papers in training set
Top 0.9%
1.4%
22
Frontiers in Digital Health
20 papers in training set
Top 0.8%
1.4%
23
Frontiers in Neuroinformatics
38 papers in training set
Top 0.5%
1.3%
24
BMJ Open
554 papers in training set
Top 11%
1.2%
25
Cureus
67 papers in training set
Top 4%
1.2%
26
Informatics in Medicine Unlocked
21 papers in training set
Top 0.8%
0.9%
27
Wellcome Open Research
57 papers in training set
Top 2%
0.8%
28
Scientific Data
174 papers in training set
Top 2%
0.8%
29
Biomedicines
66 papers in training set
Top 3%
0.8%
30
Life
27 papers in training set
Top 0.3%
0.8%