Agentic Authoring of OMOP Concept Sets from Natural Language

Chen, H.; He, X.; Dai, H.; Huang, Y.; Liu, M.; Bian, J.

2026-06-03 health informatics

10.64898/2026.06.02.26354704 medRxiv

Show abstract

Authoring OMOP concept sets from free-text descriptions remains a major bottleneck in scalable computable phenotyping for observational research. Existing tools support parts of this workflow but are designed primarily for interactive expert use rather than autonomous large language model (LLM) agents. We present an agentic framework that automatically generates OMOP concept sets by combining vocabulary tools, ontology extensions (RxClass, LOINC, and Disease Ontology), and procedural guidance. In ablation studies, the best configuration achieved Recall@100 of 0.965 and AP@100 of 0.875 on the development set. Cohort-level validation against OMOP-mapped EHR data yielded precision of 0.970, recall of 0.998, and a Jaccard index of 0.968. On an independent silver-standard benchmark of 457 concept-vocabulary pairs from 15 AD/ADRD target trial emulation studies, Recall@100 reached 0.835 and AP@100 reached 0.786. Task-specific tools outperformed unrestricted SQL access and PHOEBE 2.0, while progressive guidance performed best.

Matching journals

●Non-profit ◐University press ○Commercial

The top 6 journals account for 50% of the predicted probability mass.

Only show non-profit

Nature Communications

○ 4913 papers in training set

npj Digital Medicine

○ 97 papers in training set

Journal of the American Medical Informatics Association

◐ 61 papers in training set

The Lancet Digital Health

○ 25 papers in training set

◐ 1061 papers in training set

JCO Clinical Cancer Informatics

● 18 papers in training set

50% of probability mass above

○ 38 papers in training set

Scientific Reports

○ 3102 papers in training set

● 4510 papers in training set

Genome Medicine

○ 154 papers in training set

◐ 37 papers in training set

Nature Computational Science

○ 50 papers in training set

Journal of Biomedical Informatics

○ 45 papers in training set

Nature Genetics

○ 240 papers in training set

Science Translational Medicine

● 111 papers in training set

○ 70 papers in training set

NAR Genomics and Bioinformatics

◐ 214 papers in training set

European Journal of Epidemiology

○ 40 papers in training set

Nature Medicine

○ 117 papers in training set

Scientific Data

○ 174 papers in training set

◐ 189 papers in training set

○ 575 papers in training set

Annals of Internal Medicine

● 27 papers in training set

PLOS Computational Biology

● 1633 papers in training set

○ 130 papers in training set

◐ 172 papers in training set

● 429 papers in training set

Science Advances

● 1098 papers in training set

● 5422 papers in training set

Communications Biology

○ 886 papers in training set