Unverified Vendor Claims and Preventable Harms: A Mixed-Methods Longitudinal Independent Audit of Health AI System Performance in Nigeria
Uzochukwu, B. S. C.; Cherima, Y. J.; Enebeli, U. U.; Hassan, B.; Okeke, C. C.; Uzochukwu, A. C.; Omoha, A.; Uzochukwu, K. A.; Kalu, E. I.; Victor, D.; Alih, H. E.; Matinja, L. S.; Rindap, I. T.
Show abstract
Objective: To independently audit vendor-reported performance claims of health AI systems deployed in Nigeria and assess discrepancies, clinical consequences, equity impacts, and implications for safe AI deployment in low- and middle-income countries. Methods and analysis: We conducted a mixed-methods longitudinal audit (October 2024-March 2026) of six health AI systems (chest X-ray interpretation, TB screening, symptom triage, maternal health risk prediction, patient history intake, and health chatbots) across 73 diverse health facilities in six Nigerian states, involving 52,000 patients and 45 key informant interviews conducted with stakeholders. All data were sourced from integrated facility-level records, and no database linkage was performed. Vendor claims were abstracted from documentation, white papers, and validation studies. Independent performance was verified by an independent third party through system logs, patient records, clinical outcomes, and stakeholder interviews. Performance gaps were quantified as absolute percentage-point differences; clinical harms were estimated using patient volume and bootstrap confidence intervals; equity impacts were assessed across vulnerability dimensions (geography, age, income, comorbidities, infrastructure) using interaction terms in mixed-effects models and an Equity Harm Index (EHI). Results: Vendor-reported accuracy averaged 91.5%, while independently measured real-world accuracy averaged 67.3%, yielding a mean performance gap of 24.2 percentage points (95% CI: 21.5 to 26.9; p<0.001) across systems. Gaps ranged from 17 to 35 percentage points and were statistically significant for all systems. These discrepancies translated to substantial preventable harm, including an estimated 1,247 undetected TB cases (186 preventable deaths) and 342 misclassified high-risk pregnancies annually. Performance gaps were 28-38% larger among vulnerable groups (e.g., rural patients showed 38% higher EHI). Gaps were classified as systematic, context-dependent, or population-dependent. Conclusion: Vendor-reported performance metrics substantially overstated the real-world effectiveness of health AI in Nigeria, leading to preventable patient harm and widening inequities. Mandatory independent post-deployment verification, analogous to pharmaceutical Phase IV surveillance, is essential to ensure safe, equitable AI use in resource-constrained settings. Donors and regulators should prioritize verification over trust-based deployment.
Matching journals
The top 6 journals account for 50% of the predicted probability mass.