Back

Agentic Authoring of OMOP Concept Sets from Natural Language

Chen, H.; He, X.; Dai, H.; Huang, Y.; Liu, M.; Bian, J.

2026-06-03 health informatics
10.64898/2026.06.02.26354704 medRxiv
Show abstract

Authoring OMOP concept sets from free-text descriptions remains a major bottleneck in scalable computable phenotyping for observational research. Existing tools support parts of this workflow but are designed primarily for interactive expert use rather than autonomous large language model (LLM) agents. We present an agentic framework that automatically generates OMOP concept sets by combining vocabulary tools, ontology extensions (RxClass, LOINC, and Disease Ontology), and procedural guidance. In ablation studies, the best configuration achieved Recall@100 of 0.965 and AP@100 of 0.875 on the development set. Cohort-level validation against OMOP-mapped EHR data yielded precision of 0.970, recall of 0.998, and a Jaccard index of 0.968. On an independent silver-standard benchmark of 457 concept-vocabulary pairs from 15 AD/ADRD target trial emulation studies, Recall@100 reached 0.835 and AP@100 reached 0.786. Task-specific tools outperformed unrestricted SQL access and PHOEBE 2.0, while progressive guidance performed best.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 9%
15.0%
2
npj Digital Medicine
97 papers in training set
Top 0.5%
10.6%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.3%
9.3%
4
The Lancet Digital Health
25 papers in training set
Top 0.1%
7.3%
5
Bioinformatics
1061 papers in training set
Top 4%
6.9%
6
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.2%
4.4%
50% of probability mass above
7
Med
38 papers in training set
Top 0.1%
3.1%
8
Scientific Reports
3102 papers in training set
Top 44%
2.6%
9
PLOS ONE
4510 papers in training set
Top 49%
1.9%
10
Genome Medicine
154 papers in training set
Top 4%
1.8%
11
JAMIA Open
37 papers in training set
Top 0.7%
1.8%
12
Nature Computational Science
50 papers in training set
Top 0.6%
1.7%
13
Journal of Biomedical Informatics
45 papers in training set
Top 0.8%
1.7%
14
Nature Genetics
240 papers in training set
Top 4%
1.7%
15
Science Translational Medicine
111 papers in training set
Top 3%
1.5%
16
Patterns
70 papers in training set
Top 1%
1.2%
17
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
1.2%
18
European Journal of Epidemiology
40 papers in training set
Top 0.5%
1.2%
19
Nature Medicine
117 papers in training set
Top 3%
1.2%
20
Scientific Data
174 papers in training set
Top 2%
1.2%
21
GENETICS
189 papers in training set
Top 1.0%
1.1%
22
Nature
575 papers in training set
Top 16%
0.8%
23
Annals of Internal Medicine
27 papers in training set
Top 1%
0.7%
24
PLOS Computational Biology
1633 papers in training set
Top 25%
0.7%
25
eBioMedicine
130 papers in training set
Top 5%
0.7%
26
GigaScience
172 papers in training set
Top 3%
0.7%
27
Science
429 papers in training set
Top 21%
0.7%
28
Science Advances
1098 papers in training set
Top 33%
0.7%
29
eLife
5422 papers in training set
Top 63%
0.5%
30
Communications Biology
886 papers in training set
Top 32%
0.5%