Deciphering the links between metabolism and health by building small-scale knowledge graphs: application to endometriosis and persistent pollutants
Mathe, M.; Laisney, G.; Filangi, O.; Giacomoni, F.; Delmas, M.; Cano-Sancho, G.; Jourdan, F.; Frainay, C.
Show abstract
MotivationKnowledge graphs (KGs) are a robust formalism for structuring biomedical knowledge, but large-scale KGs often require complex queries, are difficult for non-experts to explore, and lack real-world context (such as experimental data, clinical conditions, patients symptoms). This limits their usability for addressing specific research questions. ResultsWe present Kg4j, a computational framework built on FORVM (a large-scale KG containing 82 million compound-biological concept associations), that constructs local, keyword-based sub-graphs tailored to address biomedical research questions. Resulting graphs support hypothetical relationships and can integrate experimental datasets, enabling the discovery of plausible but yet unknown connections. Starting from a conceptual definition of a research field of interest (e.g., disease, symptoms, exposure), the framework extracts relevant associations from FORVM and identifies potential biological mechanisms and chemical compounds. We applied this approach to endometriosis, exploring links between exposure to Persistent Organic Pollutants (POPs) and disease risk. We propose a novel validation strategy comparing the resulting sub-graph (2,706 nodes and 23,243 edges, 0.002% of FORVM) with recent scientific literature, showing consistency with known findings while also revealing new hypothetical associations requiring further investigation.We also showed that removing duplicated nodes and edges from the KG improves the proportion of validated nodes (from 8.4% to 16%), doubles the precision (from 0.085 to 0.197) while maintaining the recall (0.954 to 0.952), illustrating a trade-off between the loss of potentially relevant but redundant information and the reliability of remaining associations. By combining automated knowledge mining with experimental data integration, this framework supports reproducible, context-based exploration of biomedical knowledge and systematic hypothesis generation. Applied to endometriosis, it highlights potential mechanisms linking exposure to POPs to the aetiology of the disease, offering a scalable strategy for constructing disease-specific KGs. AvailabilityThe code and data underlying this article are available in the MetExplore repository at https://forge.inrae.fr/metexplore/kg4j. Contactclement.frainay@inrae.fr Key MessagesO_LIKg4 builds targeted knowledge maps from large biomedical databases using simple keywords. C_LIO_LIKeyword-driven exploration reveals the most relevant disease-exposure relationships without navigating millions of connections. C_LIO_LIApplied to endometriosis, the method recovered known links with persistent organic pollutant exposure. C_LIO_LIRemoving redundant information and formatting Knowledge graph as Labeled Property Graph improves the reliability of extracted knowledge. C_LI
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.