Back

CausalKnowledgeTrace: A Novel Computational Framework for Automated Literature-Based Causal Graph Construction and Evidence-Based Variable Selection in Biomedical Research

Upadhayaya, R.; Pradhan, M. M.; Metzger, V. T.; Malec, S. A.

2026-05-12 bioinformatics
10.64898/2026.05.07.723601 bioRxiv
Show abstract

BackgroundVariable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale. MethodsWe developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection. ResultsAnalysis of the hypertension-Alzheimers relationship across three degree neighborhoods (1-3) demonstrated systematic scaling of causal complexity: 361-866 variables, 429-1,442 relationships, with graph densities of 0.0033-0.0019. The analysis revealed complex cyclic structures with 54-606 baseline cycles across degree levels. Processing times ranged from 0.3-1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators--including obesity, oxidative stress, ischemia, and vascular diseases--all were found to have strong supporting evidence in established epidemiological and pathophysiological literature. ConclusionsCausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research. Statement of SignificanceO_ST_ABSProblem or IssueC_ST_ABSSelecting proper confounders and variables for causal inference from observational biomedical datasets is challenging and often biased by limited expertise or manual review. What is Already KnownExisting approaches rely on domain experts, statistical variable screening, or manual construction of causal graphs, but these often overlook literature-documented confounders and complex biases. What this Paper AddsThis paper introduces an automated, literature-based framework for synthesizing and validating causal graphs, identifying critical variables and complex bias structures, such as M-bias and butterfly bias, with full evidentiary traceability. Who would benefit from the new knowledge in this paper?Epidemiologists, biomedical researchers, informaticians, and clinical investigators seeking reliable and transparent causal modeling for observational studies.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
22.0%
2
Bioinformatics
1061 papers in training set
Top 2%
17.1%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
14.4%
50% of probability mass above
4
BMC Bioinformatics
383 papers in training set
Top 1%
8.0%
5
BMC Medical Research Methodology
43 papers in training set
Top 0.3%
3.0%
6
PLOS ONE
4510 papers in training set
Top 43%
2.8%
7
Database
51 papers in training set
Top 0.2%
2.5%
8
BioData Mining
15 papers in training set
Top 0.2%
1.8%
9
GigaScience
172 papers in training set
Top 1%
1.8%
10
PLOS Computational Biology
1633 papers in training set
Top 17%
1.7%
11
JAMIA Open
37 papers in training set
Top 1%
1.3%
12
Patterns
70 papers in training set
Top 1%
1.3%
13
Frontiers in Genetics
197 papers in training set
Top 7%
1.2%
14
European Journal of Epidemiology
40 papers in training set
Top 0.5%
1.2%
15
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
16
Scientific Reports
3102 papers in training set
Top 67%
1.2%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
18
Bioinformatics Advances
184 papers in training set
Top 4%
0.9%
19
BMC Medical Genomics
36 papers in training set
Top 1%
0.9%
20
JMIR Medical Informatics
17 papers in training set
Top 1%
0.9%
21
PeerJ
261 papers in training set
Top 14%
0.8%
22
Briefings in Bioinformatics
326 papers in training set
Top 7%
0.7%
23
F1000Research
79 papers in training set
Top 6%
0.6%