Back

PheBee: A Graph-Aware System for Scalable, Traceable, and Semantic Phenotyping

Gordon, D. M.; Homilius, M.; Antoniou, A. A.; Grannis, C.; Lammi, G. E.; Herman, A. C.; Kubatko, A.; Chaudhari, B. P.; White, P.

2026-05-13 health informatics
10.64898/2026.05.09.26352812 medRxiv
Show abstract

ObjectivesPhenotype-driven workflows in clinical and translational research require standardized ontology-based representation, ontology-aware cohort discovery, and provenance inspection for each assertion. Existing approaches optimize either for semantic traversal or scalable batch analytics, but not both. We describe PheBee, a hybrid system that links semantic assertions to scalable evidence storage via a deterministic identifier, preserving provenance while supporting ontology-aware discovery at cohort scale. Materials and MethodsPheBee represents phenotype assertions in a knowledge graph as ontology-linked nodes with clinical modifier context (e.g., negated, family history), and stores supporting evidence records in a scalable row-oriented evidence table for cohort-scale access. The two layers are connected by a deterministic identifier enabling stable joins across repeated ingestions without duplicating high-volume evidence in the graph. We evaluated PheBee using synthetic datasets designed to exercise end-to-end ingestion and query workflows. ResultsFunctional evaluation validated hierarchical term expansion, qualifier-aware retrieval, duplicate-free assertion handling under re-ingestion, and privacy-conscious management of subjects shared across multiple research projects. At scale (10,000 subjects producing 12M evidence records) PheBee completed ingestion in [~]30 minutes and responded to interactive queries within 6 seconds under concurrent load. DiscussionPheBee exposes a unified API for ontology-aware cohort discovery with hierarchical term expansion, subject-centric retrieval of phenotypes and clinical modifiers, and evidence and provenance queries. Its data model aligns with GA4GH Phenopackets, facilitating interoperability with phenotype exchange standards. ConclusionBy combining ontology-aware semantics with scalable, provenance-bearing evidence storage, PheBee provides a practical open-source foundation for phenotype-driven research workflows that demand both semantic precision and cohort-scale traceability. LAY SUMMARYResearchers often use "phenotypes" (observable clinical features) to describe individual subjects and find groups of similar subjects. Those phenotypes come from many sources and need both standard terminology and clear evidence for why a phenotype has been associated with a subject. PheBee is a software system that stores phenotype assertions in a way that supports both "ontology-aware" searching (for example, finding patients with any subtype of a condition) and scalable storage of supporting evidence across large research cohorts. PheBee uses multiple types of data storage so researchers can perform interactive phenotype searches and also store millions of pieces of supporting evidence. A shared identifier connects the two storage layers, so subjects phenotypes and their supporting evidence remain linked even as new data is added over time. We evaluated PheBee using fully synthetic (non-patient) data to confirm correct query behavior, evidence traceability, and system performance at large scale.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
35.2%
2
JAMIA Open
37 papers in training set
Top 0.1%
8.6%
3
Bioinformatics
1061 papers in training set
Top 3%
7.4%
50% of probability mass above
4
npj Digital Medicine
97 papers in training set
Top 0.7%
7.0%
5
Nature Communications
4913 papers in training set
Top 28%
6.5%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
3.7%
7
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
2.1%
8
Patterns
70 papers in training set
Top 0.7%
1.8%
9
GigaScience
172 papers in training set
Top 2%
1.4%
10
PLOS ONE
4510 papers in training set
Top 57%
1.4%
11
European Journal of Epidemiology
40 papers in training set
Top 0.4%
1.4%
12
The Lancet Digital Health
25 papers in training set
Top 0.6%
1.3%
13
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.1%
14
PLOS Digital Health
91 papers in training set
Top 2%
1.0%
15
Med
38 papers in training set
Top 0.5%
1.0%
16
GENETICS
189 papers in training set
Top 1%
0.9%
17
Journal of Clinical and Translational Science
11 papers in training set
Top 0.3%
0.9%
18
Scientific Reports
3102 papers in training set
Top 72%
0.8%
19
PLOS Computational Biology
1633 papers in training set
Top 23%
0.8%
20
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%
21
Scientific Data
174 papers in training set
Top 2%
0.7%
22
Human Mutation
29 papers in training set
Top 0.8%
0.7%
23
BMJ Health & Care Informatics
13 papers in training set
Top 1%
0.7%
24
Bioinformatics Advances
184 papers in training set
Top 5%
0.7%
25
BMC Bioinformatics
383 papers in training set
Top 8%
0.5%
26
BMC Medical Research Methodology
43 papers in training set
Top 2%
0.5%