Back

Causal Language Detection using Text-Document Features: Methodology and Insights from 10 Years of Gut Microbiome Research

Tskhay, A.; Longo, C.; Moldakozhayev, A.; Kang, N.; Greenwood, C. M.; Behruzi, R.; Kubow, S.; Schuster, T.

2026-03-04 scientific communication and education
10.64898/2026.03.02.709039 bioRxiv
Show abstract

Detecting causal language in scientific literature is critical for understanding how research fields frame evidence and inform interventions and policies, yet existing approaches commonly rely on manual annotation. The objective of this study was to evaluate four classifiers for detecting causal language and to apply the best-performing model to assess trends in microbiome research. Microbiome research, with its rapidly expanding observational literature, provides a relevant case study. We extracted Term Frequency-Inverse Document Frequency (TF-IDF) features from the last three sentences of available publication abstracts and trained four classifiers (L1- and L2-regularized logistic regression, Random Forest, and eXtreme Gradient Boosting) to detect causal language. A total of 475 sentences, as determined pragmatically based on annotation feasibility and observed stabilization of model performance, were manually labeled as causal or non-causal following established guidelines for systematic evaluation of causal language in observational health research. Of these, 75% of sentences were used for training and 25% for testing. L1-regularized logistic regression achieved the highest performance (accuracy 76%, F1 72%, prevalence detection accuracy 95%, sensitivity 72%, and specificity 80%) and was applied to 20,022 human gut microbiome abstracts published between 2015 and 2025 grouped into 20 thematic topics using structural topic modeling. Predicted causal language prevalence declined from 52% to 44% between 2015 and 2018, then rose to 51% by 2025, with notable variation across topics (range: 43.1-53.3%). Temporal trends differed across subfields, with increases in Metabolic disorders, Fecal microbiota transplantation, and decreases in Biomarkers and prediction, Antibiotic resistance, and In vitro fermentation. Analysis of influential words confirmed that causal meaning is primarily driven by verbs and modifiers lexically signaling change or intervention. The proposed approach for identifying causal claims in scientific abstracts enables systematic and automated, scalable assessment of how evidence is framed. Its application to the microbiome field highlighted heterogeneity in the reporting of causal relationships and informing the interpretation of microbiome findings for clinical and public health decision-making.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
PLOS Biology
408 papers in training set
Top 0.1%
28.9%
2
Nature Biotechnology
147 papers in training set
Top 0.3%
15.4%
3
eLife
5422 papers in training set
Top 5%
10.5%
50% of probability mass above
4
PLOS Computational Biology
1633 papers in training set
Top 6%
5.1%
5
PLOS ONE
4510 papers in training set
Top 35%
4.1%
6
Microbiome
139 papers in training set
Top 0.9%
3.8%
7
Nature Communications
4913 papers in training set
Top 38%
3.8%
8
Scientific Reports
3102 papers in training set
Top 52%
2.0%
9
Gut Microbes
70 papers in training set
Top 0.5%
1.7%
10
Philosophical Transactions of the Royal Society B
51 papers in training set
Top 3%
1.6%
11
Bioinformatics
1061 papers in training set
Top 8%
1.0%
12
Communications Biology
886 papers in training set
Top 20%
0.8%
13
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.8%
14
mSystems
361 papers in training set
Top 7%
0.8%
15
Nature Human Behaviour
85 papers in training set
Top 4%
0.8%
16
Nature Microbiology
133 papers in training set
Top 5%
0.7%
17
The FEBS Journal
78 papers in training set
Top 1%
0.7%
18
eneuro
389 papers in training set
Top 10%
0.7%
19
Nature Neuroscience
216 papers in training set
Top 7%
0.7%
20
mSphere
281 papers in training set
Top 7%
0.7%
21
F1000Research
79 papers in training set
Top 5%
0.7%
22
Genome Biology
555 papers in training set
Top 8%
0.7%
23
Nature Genetics
240 papers in training set
Top 8%
0.7%
24
Molecular Systems Biology
142 papers in training set
Top 2%
0.5%
25
Genome Medicine
154 papers in training set
Top 10%
0.5%
26
Nature Methods
336 papers in training set
Top 7%
0.5%
27
Cell
370 papers in training set
Top 19%
0.5%