Back

CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data

Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.

2026-04-23 health informatics
10.64898/2026.04.22.26351461 medRxiv
Show abstract

Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.1%
18.5%
2
npj Digital Medicine
97 papers in training set
Top 0.3%
14.6%
3
Journal of Biomedical Informatics
45 papers in training set
Top 0.1%
14.3%
4
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
6.3%
50% of probability mass above
5
JAMIA Open
37 papers in training set
Top 0.4%
3.9%
6
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.8%
7
Bioinformatics
1061 papers in training set
Top 5%
3.6%
8
Scientific Reports
3102 papers in training set
Top 44%
2.7%
9
JMIR Medical Informatics
17 papers in training set
Top 0.5%
2.6%
10
International Journal of Medical Informatics
25 papers in training set
Top 0.5%
2.6%
11
Nature Communications
4913 papers in training set
Top 45%
2.6%
12
The Lancet Digital Health
25 papers in training set
Top 0.2%
2.4%
13
Artificial Intelligence in Medicine
15 papers in training set
Top 0.3%
1.8%
14
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.7%
15
PLOS ONE
4510 papers in training set
Top 57%
1.5%
16
BMC Medical Research Methodology
43 papers in training set
Top 0.7%
1.5%
17
BMC Bioinformatics
383 papers in training set
Top 5%
1.5%
18
GigaScience
172 papers in training set
Top 2%
0.9%
19
Frontiers in Digital Health
20 papers in training set
Top 1%
0.8%
20
Med
38 papers in training set
Top 0.7%
0.8%
21
Scientific Data
174 papers in training set
Top 2%
0.7%
22
iScience
1063 papers in training set
Top 35%
0.7%
23
BMC Medicine
163 papers in training set
Top 8%
0.6%
24
NAR Genomics and Bioinformatics
214 papers in training set
Top 4%
0.6%
25
European Respiratory Journal
54 papers in training set
Top 2%
0.6%