Data Resource Profile: EST-Health-30
Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.
Show abstract
BackgroundThe increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. MethodsEST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026.Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. ResultsThe dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. ConclusionsEST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations. Key featuresO_LIEST-Health-30 is a population-representative dataset of complete health records for a random 30% sample of the Estonian population (~500,000 individuals) spanning 2012-present, enabling population-level epidemiological analyses with annual updates. C_LIO_LIThe dataset is constructed using a random sampling approach based on hashed password-protected personal identifiers, ensuring consistent inclusion over time with unbiased population coverage. C_LIO_LIIndividual-level data are linked across multiple nationwide databases, including electronic health records, claims, prescriptions, cancer and cause of death registry data, enabling multimodal analyses of health trajectories. C_LIO_LIAll data are standardised to the OMOP Common Data Model (CDM) version 5.4 using international vocabularies (e.g., SNOMED CT, RxNorm, LOINC), supporting reproducibility and participation in federated research networks. C_LIO_LIThe dataset is accessible through a secure processing environment compliant with the European Health Data Space (EHDS) framework. C_LI
Matching journals
The top 8 journals account for 50% of the predicted probability mass.