Back

Challenges in the Computational Reproducibility of Linear Regression Analyses: An Empirical Study

Jones, L. V.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement
10.64898/2026.04.07.26350286 medRxiv
Show abstract

Background: Reproducibility concerns in health research have grown, as many published results fail to be independently reproduced. Achieving computational reproducibility, where others can replicate the same results using the same methods, requires transparent reporting of statistical tests, models, and software use. While data-sharing initiatives have improved accessibility, the actual usability of shared data for reproducing research findings remains underexplored. Addressing this gap is crucial for advancing open science and ensuring that shared data meaningfully support reproducibility and enable collaboration, thereby strengthening evidence-based policy and practice. Methods: A random sample of 95 PLOS ONE health research papers from 2019 reporting linear regression was assessed for data-sharing practices and computational reproducibility. Data were accessible for 43 papers. From the randomly selected sample, the first 20 papers with available data were assessed for computational reproducibility. Three regression models per paper were reanalysed. Results: Of the 95 papers, 68 reported having data available, but 25 of these lacked the data required to reproduce the linear regression models. Only eight of 20 papers we analysed were computationally reproducible. A major barrier to reproducing the analyses was the great difficulty in matching the variables described in the paper to those in the data. Papers sometimes failed to be reproduced because the methods were not adequately described, including variable adjustments and data exclusions. Conclusion: More than half (60%) of analysed studies were not computationally reproducible, raising concerns about the credibility of the reported results and highlighting the need for greater transparency and rigour in research reporting. When data are made available, authors should provide a corresponding data dictionary with variable labels that match those used in the paper. Analysis code, model specifications, and any supporting materials detailing the steps required to reproduce the results should be deposited in a publicly accessible repository or included as supplementary files. To increase the reproducibility of statistical results, we propose a Model Location and Specification Table (MLast), which tracks where and what analyses were performed. In conjunction with a data dictionary, MLast enables the mapping of analyses, greatly aiding computational reproducibility.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
28.4%
2
Research Synthesis Methods
20 papers in training set
Top 0.1%
9.4%
3
PLOS ONE
4510 papers in training set
Top 21%
8.6%
4
BMJ Open
554 papers in training set
Top 3%
7.0%
50% of probability mass above
5
BMJ Global Health
98 papers in training set
Top 0.7%
4.4%
6
Trials
25 papers in training set
Top 0.3%
4.1%
7
F1000Research
79 papers in training set
Top 0.3%
4.1%
8
PLOS Biology
408 papers in training set
Top 5%
3.0%
9
European Journal of Epidemiology
40 papers in training set
Top 0.2%
2.8%
10
Royal Society Open Science
193 papers in training set
Top 1%
2.2%
11
International Journal of Epidemiology
74 papers in training set
Top 1%
1.7%
12
BMJ Open Quality
15 papers in training set
Top 0.6%
1.3%
13
The Lancet Global Health
24 papers in training set
Top 0.8%
1.3%
14
Journal of Biomedical Informatics
45 papers in training set
Top 1.0%
1.3%
15
BMC Medicine
163 papers in training set
Top 5%
1.0%
16
British Journal of General Practice
22 papers in training set
Top 0.5%
0.9%
17
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.8%
18
JAMA Network Open
127 papers in training set
Top 4%
0.8%
19
PLOS Digital Health
91 papers in training set
Top 3%
0.7%
20
Wellcome Open Research
57 papers in training set
Top 3%
0.7%
21
Neuroscience & Biobehavioral Reviews
43 papers in training set
Top 1%
0.7%
22
BMC Biology
248 papers in training set
Top 6%
0.7%
23
Scientific Reports
3102 papers in training set
Top 80%
0.5%
24
Healthcare
16 papers in training set
Top 2%
0.5%