Back

The Power of Open Health Data: Impact, Representation, and Knowledge Diffusion

Gorijavolu, R.; Armengol de la Hoz, M. A.; Bielick, C.; Cajas, S.; Charpignon, M.-L.; El Mir, A.; Gichoya, J. W.; Kwak, H. G.; Madapati, K.; Mattie, H.; McCullum, L.; Mwavu, R.; Nair, V.; Nakayama, L. F.; Nanyonjo, J.; Nazer, L.; Patel, M. S.; Sauer, C. M.; Celi, L. A.

2026-03-24 health informatics
10.64898/2026.03.20.26348933 medRxiv
Show abstract

Background Open health data repositories receive billions in public funding, yet no systematic framework exists to evaluate their downstream scholarly impact, the composition of the research communities they cultivate, or the breadth of disciplines they reach. We introduce a two-degree citation methodology to quantify knowledge diffusion from open data, normalized by funding, and apply it to four major health data repositories. Methods Using the OpenAlex bibliometric database (January-February 2026), we identified all first-degree citing publications (n = 30,049) and their second-degree citing publications (n = 485,396), defined as papers citing those first-degree publications, for MIMIC (versions I-IV; retrospective EHR data; $14.4M), UK Biobank (prospective cohort with genomics; $525.5M), OpenSAFELY (federated EHR platform; $53.7M), and All of Us (prospective national cohort with biobanking and community engagement; $2,160M). We extracted author demographics (gender via Genderize.io, institutional country income via World Bank 2024 classifications) and research topics. Chi-square tests with odds ratios assessed demographic differences across repositories. Results Funding-normalized first-degree papers per $1M ranged from 689 (MIMIC) to 1 (All of Us), though these figures reflect total program investment, which included community engagement and biobanking for prospective cohorts in addition to data-curation costs. The citation amplification ratio was consistent across these four repositories (9.3-11.5x). Author demographics differed significantly (p < 0.001): LMIC authorship ranged from 41.8% (MIMIC) to 4.3% (All of Us), while female authorship showed the opposite pattern, lowest for MIMIC (31.8%) and highest for All of Us (43.2%). Female authors were consistently underrepresented in senior (last-author) compared with first-author positions across all repositories. Differences in scope, design, and what funding covers limit direct comparisons. Conclusions Open health data generates a consistent ~10x indirect citation amplification beyond its direct users, a ratio that held across repositories spanning over two orders of magnitude in funding. The large differences in funding-normalized output partly reflect structural differences between retrospective databases and prospective cohorts. Low-cost access combined with intentional community building attracted globally diverse research communities with LMIC investigators in intellectual leadership positions, while a persistent gender gap in senior authorship across all repositories reflects disciplinary and structural inequities that data access policies alone cannot address. Future evaluations of open data investments should examine who is producing research, from where, in what positions, and whether their participation translates into locally relevant knowledge production.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.8%
2
PLOS ONE
4510 papers in training set
Top 27%
6.4%
3
PLOS Digital Health
91 papers in training set
Top 0.5%
4.3%
4
BMJ Health & Care Informatics
13 papers in training set
Top 0.1%
4.2%
5
Journal of Clinical and Translational Science
11 papers in training set
Top 0.1%
4.0%
6
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.6%
7
PLOS Biology
408 papers in training set
Top 3%
3.6%
8
BMC Medicine
163 papers in training set
Top 2%
2.6%
9
Scientific Reports
3102 papers in training set
Top 45%
2.6%
50% of probability mass above
10
JAMIA Open
37 papers in training set
Top 0.6%
2.1%
11
Annals of Internal Medicine
27 papers in training set
Top 0.3%
2.1%
12
The Lancet Digital Health
25 papers in training set
Top 0.3%
1.8%
13
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 3%
1.7%
14
eLife
5422 papers in training set
Top 41%
1.7%
15
Nature Communications
4913 papers in training set
Top 51%
1.7%
16
JAMA
17 papers in training set
Top 0.1%
1.7%
17
BMJ Open
554 papers in training set
Top 9%
1.7%
18
DIGITAL HEALTH
12 papers in training set
Top 0.4%
1.3%
19
Patterns
70 papers in training set
Top 1%
1.3%
20
JAMA Network Open
127 papers in training set
Top 3%
1.2%
21
Royal Society Open Science
193 papers in training set
Top 3%
1.1%
22
European Respiratory Journal
54 papers in training set
Top 1%
1.0%
23
JMIR Medical Informatics
17 papers in training set
Top 1%
1.0%
24
BMJ
49 papers in training set
Top 1.0%
0.9%
25
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
0.8%
26
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 42%
0.8%
27
npj Digital Medicine
97 papers in training set
Top 3%
0.8%
28
Journal of Biomedical Informatics
45 papers in training set
Top 1%
0.8%
29
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
30
International Journal of Medical Informatics
25 papers in training set
Top 2%
0.7%