Back

Data Resource Profile: EST-Health-30

Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.

2026-04-24 epidemiology
10.64898/2026.04.21.26351087 medRxiv
Show abstract

BackgroundThe increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. MethodsEST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026.Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. ResultsThe dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. ConclusionsEST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations. Key featuresO_LIEST-Health-30 is a population-representative dataset of complete health records for a random 30% sample of the Estonian population (~500,000 individuals) spanning 2012-present, enabling population-level epidemiological analyses with annual updates. C_LIO_LIThe dataset is constructed using a random sampling approach based on hashed password-protected personal identifiers, ensuring consistent inclusion over time with unbiased population coverage. C_LIO_LIIndividual-level data are linked across multiple nationwide databases, including electronic health records, claims, prescriptions, cancer and cause of death registry data, enabling multimodal analyses of health trajectories. C_LIO_LIAll data are standardised to the OMOP Common Data Model (CDM) version 5.4 using international vocabularies (e.g., SNOMED CT, RxNorm, LOINC), supporting reproducibility and participation in federated research networks. C_LIO_LIThe dataset is accessible through a secure processing environment compliant with the European Health Data Space (EHDS) framework. C_LI

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
International Journal of Epidemiology
74 papers in training set
Top 0.1%
17.3%
2
BMJ Open
554 papers in training set
Top 2%
8.3%
3
Nature Communications
4913 papers in training set
Top 30%
6.2%
4
PLOS ONE
4510 papers in training set
Top 32%
4.8%
5
npj Digital Medicine
97 papers in training set
Top 1.0%
4.8%
6
Scientific Data
174 papers in training set
Top 0.4%
4.3%
7
BMC Medical Research Methodology
43 papers in training set
Top 0.3%
3.5%
8
Pharmacoepidemiology and Drug Safety
13 papers in training set
Top 0.1%
3.2%
50% of probability mass above
9
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.4%
10
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
2.1%
11
Scientific Reports
3102 papers in training set
Top 50%
2.1%
12
Database
51 papers in training set
Top 0.3%
2.1%
13
Nature Human Behaviour
85 papers in training set
Top 2%
1.9%
14
International Journal of Medical Informatics
25 papers in training set
Top 0.8%
1.8%
15
American Journal of Epidemiology
57 papers in training set
Top 0.7%
1.7%
16
Eurosurveillance
80 papers in training set
Top 0.7%
1.7%
17
European Journal of Epidemiology
40 papers in training set
Top 0.4%
1.6%
18
JAMIA Open
37 papers in training set
Top 0.9%
1.5%
19
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.5%
20
PLOS Digital Health
91 papers in training set
Top 2%
1.5%
21
BMC Medicine
163 papers in training set
Top 4%
1.3%
22
Frontiers in Public Health
140 papers in training set
Top 7%
0.9%
23
Wellcome Open Research
57 papers in training set
Top 2%
0.9%
24
Swiss Medical Weekly
12 papers in training set
Top 0.2%
0.9%
25
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.9%
26
BMJ
49 papers in training set
Top 1%
0.8%
27
PLOS Medicine
98 papers in training set
Top 5%
0.7%
28
Nature Medicine
117 papers in training set
Top 6%
0.7%
29
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.7%
30
Healthcare
16 papers in training set
Top 2%
0.6%