Synthesising artificial patient-level data for Open Science - an evaluation of five methods

Allen, M.; Salmon, A.

2020-10-13 health informatics

10.1101/2020.10.09.20210138 medRxiv

Show abstract

BackgroundOpen science is a movement seeking to make scientific research accessible to all, including publication of code and data. Publishing patient-level data may, however, compromise the confidentiality of that data if there is any significant risk that data may later be associated with individuals. Use of synthetic data offers the potential to be able to release data that may be used to evaluate methods or perform preliminary research without risk to patient confidentiality. MethodsWe have tested five synthetic data methods: O_LIA technique based on Principal Component Analysis (PCA) which samples data from distributions derived from the transformed data. C_LIO_LISynthetic Minority Oversampling Technique, SMOTE which is based on interpolation between near neighbours. C_LIO_LIGenerative Adversarial Network, GAN, an artificial neural network approach with competing networks - a discriminator network trained to distinguish between synthetic and real data., and a generator network trained to produce data that can fool the discriminator network. C_LIO_LICT-GAN, a refinement of GANs specifically for the production of structured tabular synthetic data. C_LIO_LIVariational Auto Encoders, VAE, a method of encoding data in a reduced number of dimensions, and sampling from distributions based on the encoded dimensions. C_LI Two data sets are used to evaluate the methods: O_LIThe Wisconsin Breast Cancer data set, a histology data set where all features are continuous variables. C_LIO_LIA stroke thrombolysis pathway data set, a data set describing characteristics for patients where a decision is made whether to treat with clot-busting medication. Features are mostly categorical, binary, or integers. C_LI Methods are evaluated in three ways: O_LIThe ability of synthetic data to train a logistic regression classification model. C_LIO_LIA comparison of means and standard deviations between original and synthetic data. C_LIO_LIA comparison of covariance between features in the original and synthetic data. C_LI ResultsUsing the Wisconsin Breast Cancer data set, the original data gave 98% accuracy in a logistic regression classification model. Synthetic data sets gave between 93% and 99% accuracy. Performance (best to worst) was SMOTE > PCA > GAN > CT-GAN = VAE. All methods produced a high accuracy in reproducing original data means and stabdard deviations (all R-square > 0.96 for all methods and data classes). CT-GAN and VAE suffered a significant loss of covariance between features in the synthetic data sets. Using the Stroke Pathway data set, the original data gave 82% accuracy in a logistic regression classification model. Synthetic data sets gave between 66% and 82% accuracy. Performance (best to worst) was SMOTE > PCA > CT-GAN > GAN > VAE. CT-GAN and VAE suffered loss of covariance between features in the synthetic data sets, though less pronounced than with the Wisconsin Breast Cancer data set. ConclusionsThe pilot work described here shows, as proof of concept, that synthetic data may be produced, which is of sufficient quality to publish with open methodology, to allow people to better understand and test methodology. The quality of the synthetic data also gives promise of data sets that may be used for screening of ideas, or for research project (perhaps especially in an education setting). More work is required to further refine and test methods across a broader range of patient-level data sets.

Synthesising artificial patient-level data for Open Science - an evaluation of five methods

Matching journals