Back

Development and validation of a machine learning model for community-based tuberculosis screening among persons aged >= 15 years in South Africa and Zambia

Zimmer, A. J.; Loharja, H.; Fentahun Muchie, K.; Koeppel, L.; Ayles, H.; Castro, M. d. M.; Christodoulou, E.; Fox, G. J.; Gaeddert, M.; Hamada, Y.; Isaacs, C.; Kapata, N.; Chanda-Kapata, P.; Karimi, K.; Kasese, N.; Kerkhoff, A.; Law, I.; Maier-Hein, L.; Marx, F. M.; Maimbolwa, M. M.; Moyo, S.; Mthiyane, T.; Muyoyeta, M.; Rocklöv, J.; Schaap, A.; Yerlikaya, S.; Opata, M.; Denkinger, C. M.

2026-04-04 public and global health
10.64898/2026.03.30.26349632 medRxiv
Show abstract

Introduction: Current tuberculosis (TB) screening tools, such as the WHO four-symptom screen (W4SS), lack sufficient sensitivity and specificity for effective community-based active case finding, contributing to both missed diagnoses and unnecessary diagnostic evaluations. This study aimed to develop and validate a machine learning (ML) model to improve TB risk prediction among persons aged >=15 years in community settings of Zambia and South Africa. Methods: A large, harmonized dataset was created from four community-based TB prevalence surveys in South Africa and Zambia (N=169,813), restricted to individuals not under treatment at the time of survey. A binary reference outcome was defined based on available microbiological and radiographic data, grouping individuals as either 'Possible TB' or 'Unlikely TB'. An XGBoost model was trained on 80% (N=135,854) of the data using demographic, clinical, and socio-economic variables, and model interpretability was assessed using SHapley Additive exPlanations (SHAP) values. Internal validation was performed using a 20% hold-out test set (N=33,959). Model performance was assessed using discrimination, calibration, and clinical utility measures compared to the W4SS and against WHO's 2025 Target Product Profile (TPP) for a tool in a two-step screening algorithm. Results: Overall, 16,413 (9.7%) of individuals were labelled as 'Possible TB'. On the test set, the XGBoost model yielded an area under the curve (AUC) of 79.7% (95% CI: 78.7, 80.7), outperforming the W4SS (AUC 57.0%; 95% CI: 56.1, 57.8). The XGBoost model achieved 81.5% sensitivity (95% CI: 77.6, 84.9) at a 60% specificity threshold. This exceeded the W4SS, which achieved only 38.2% sensitivity (95% CI: 36.5, 39.9) on the same dataset. SHAP analysis identified age, previous TB treatment, times treated for TB and unemployment as the primary contributors to risk. Conclusion: The ML XGBoost model shows promise as a screening tool to support community-based active case finding activities prior to diagnostic testing. However, as performance remained below TPP targets, and adding variables, e.g. on geolocation, could be considered.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Global Public Health
293 papers in training set
Top 0.7%
12.7%
2
PLOS Digital Health
91 papers in training set
Top 0.1%
12.5%
3
PLOS ONE
4510 papers in training set
Top 19%
10.1%
4
Thorax
32 papers in training set
Top 0.1%
6.8%
5
BMC Infectious Diseases
118 papers in training set
Top 0.7%
4.4%
6
Scientific Reports
3102 papers in training set
Top 29%
4.2%
50% of probability mass above
7
Epidemiology and Infection
84 papers in training set
Top 0.5%
3.6%
8
Clinical Infectious Diseases
231 papers in training set
Top 2%
3.1%
9
Frontiers in Public Health
140 papers in training set
Top 3%
2.7%
10
Frontiers in Medicine
113 papers in training set
Top 3%
1.7%
11
BMJ Global Health
98 papers in training set
Top 2%
1.7%
12
Tropical Medicine & International Health
15 papers in training set
Top 0.3%
1.5%
13
Journal of Clinical Microbiology
120 papers in training set
Top 1%
1.3%
14
BMC Public Health
147 papers in training set
Top 4%
1.3%
15
JMIR Public Health and Surveillance
45 papers in training set
Top 2%
1.2%
16
JAC-Antimicrobial Resistance
13 papers in training set
Top 0.3%
1.2%
17
Journal of Infection
71 papers in training set
Top 2%
1.2%
18
The Lancet Global Health
24 papers in training set
Top 0.8%
1.2%
19
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.2%
20
EClinicalMedicine
21 papers in training set
Top 0.5%
1.2%
21
Diagnostics
48 papers in training set
Top 1%
1.2%
22
The American Journal of Tropical Medicine and Hygiene
60 papers in training set
Top 3%
0.9%
23
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
24
BMC Medicine
163 papers in training set
Top 6%
0.8%
25
The Journal of Infectious Diseases
182 papers in training set
Top 5%
0.8%
26
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
27
International Journal of Epidemiology
74 papers in training set
Top 3%
0.7%
28
American Journal of Epidemiology
57 papers in training set
Top 1%
0.7%
29
International Journal of Infectious Diseases
126 papers in training set
Top 4%
0.7%
30
Open Forum Infectious Diseases
134 papers in training set
Top 3%
0.6%