Back

Generation of Synthetic Data in Health Surveys Using Large Language Models

Villarreal-Zegarra, D.; Bellido-Boza, L.

2026-01-30 health informatics
10.64898/2026.01.27.26345015 medRxiv
Show abstract

BackgroundGenerating synthetic data using artificial intelligence, such as large language models (LLMs), is a useful strategy in public health because it can reduce time and costs, expand access to data, and facilitate information sharing without compromising confidentiality. ObjectiveTo evaluate the consistency and psychometric plausibility of synthetic data generated by an LLM to simulate the responses of survey participants (user personas) in a national health survey in Peru. MethodsWe conducted a cross-sectional study based on the National Health Satisfaction Survey (ENSUSALUD 2016) of ambulatory health service users. We used the GPT-OSS-20B model to generate synthetic responses in Spanish, conditioned on narrative profiles derived from sociodemographic and clinical variables. We evaluated consistency between responses and profile characteristics (sex, age, and comorbidities) using performance metrics (accuracy, precision, recall, F1 score, and AUC). We compared distributions between real and synthetic data using t-tests and chi-square tests. For latent variables, we conducted confirmatory factor analyses of the PHQ-9, PHQ-8, and GAD-7 (WLSMV; polychoric matrices) and estimated internal consistency ( and {omega}). We examined normality (Jarque-Bera test) and stability through correlations between real measures (PHQ-2 and EQ-5D) and synthetic measures (PHQ-2, PHQ-8, PHQ-9, GAD-2, and GAD-7). ResultsThe model showed strong concordance with the profile for sex, age, and chronic disease status, with metrics close to 1 for most variables; overall consistency was high in the vast majority of cases. The synthetic PHQ-9, PHQ-8, and GAD-7 instruments showed optimal factor fit and high internal consistency. Synthetic measures were positively and significantly correlated with the real PHQ-2 and negatively correlated with EQ-5D, with moderate to high correlations, particularly for PHQ-8/PHQ-9 and GAD-7. ConclusionsAn LLM can generate plausible synthetic data for health surveys when its output is conditioned on user personas, preserving high coherence with demographic and clinical characteristics and maintaining adequate psychometric properties in depression and anxiety scales. However, relevant deviations were identified (e.g., overestimation of obesity, unexpected distributions in some variables, and missing values in a sensitive item), which supports the need for rigorous validation and bias control before using these data for inferential purposes or public policy.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
BMC Medical Research Methodology
43 papers in training set
Top 0.1%
17.9%
2
JMIR Public Health and Surveillance
45 papers in training set
Top 0.1%
8.2%
3
PLOS ONE
4510 papers in training set
Top 26%
6.7%
4
International Journal of Medical Informatics
25 papers in training set
Top 0.2%
6.2%
5
Journal of Medical Internet Research
85 papers in training set
Top 1.0%
4.8%
6
Scientific Reports
3102 papers in training set
Top 25%
4.8%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.8%
50% of probability mass above
8
BMJ Open
554 papers in training set
Top 6%
3.5%
9
JAMIA Open
37 papers in training set
Top 0.4%
3.5%
10
JMIR Medical Informatics
17 papers in training set
Top 0.3%
3.5%
11
npj Digital Medicine
97 papers in training set
Top 1%
3.5%
12
PLOS Digital Health
91 papers in training set
Top 1%
2.0%
13
Frontiers in Digital Health
20 papers in training set
Top 0.5%
2.0%
14
Frontiers in Public Health
140 papers in training set
Top 4%
2.0%
15
Journal of Biomedical Informatics
45 papers in training set
Top 0.8%
1.7%
16
BMC Bioinformatics
383 papers in training set
Top 5%
1.3%
17
Bioinformatics
1061 papers in training set
Top 8%
1.3%
18
JMIR Research Protocols
18 papers in training set
Top 0.9%
1.3%
19
JMIR mHealth and uHealth
10 papers in training set
Top 0.3%
1.2%
20
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
1.1%
21
DIGITAL HEALTH
12 papers in training set
Top 0.5%
0.9%
22
BMC Health Services Research
42 papers in training set
Top 2%
0.9%
23
Frontiers in Psychiatry
83 papers in training set
Top 3%
0.9%
24
PeerJ
261 papers in training set
Top 14%
0.8%
25
BMC Public Health
147 papers in training set
Top 6%
0.8%
26
JMIR Formative Research
32 papers in training set
Top 2%
0.7%
27
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.9%
0.7%
28
American Journal of Epidemiology
57 papers in training set
Top 2%
0.7%
29
Acta Neuropsychiatrica
12 papers in training set
Top 1%
0.6%
30
Journal of Affective Disorders
81 papers in training set
Top 2%
0.6%