Back

Extraction of Crohn's Disease Clinical Phenotypes from Clinical Text Using Natural Language Processing

Schmidt, L.; Ibing, S.; Borchert, F.; Hugo, J.; Marshall, A.; Peraza, J.; Cho, J. H.; Bottinger, E. P.; Ungaro, R. C.

2023-10-16 gastroenterology
10.1101/2023.10.16.23297099 medRxiv
Show abstract

Real-world studies based on electronic health records often require manual chart review to derive patients clinical phenotypes, a labor-intensive task with limited scalability. Here, we developed and compared computable phenotyping based on rules using the spaCy frame-work and a Large Language Model (LLM), GPT-4, for disease behavior and age at diagnosis of Crohns disease patients. We are the first to describe computable phenotyping algorithms using clinical texts for these complex tasks with previously described inter-annotator agreements between 0.54 and 0.98. The data comprised clinical notes and radiology reports from 584 Mount Sinai Health System patients. Overall, we observed similar or better performance using GPT-4 compared to the rules. On a note-level, the F1 score was at least 0.90 for disease behavior and 0.82 for age at diagnosis. We could not find statistical evidence for a difference to the performance of human experts on this task. Our findings underline the potential of LLMs for computable phenotyping. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=57 SRC="FIGDIR/small/23297099v2_ufig1.gif" ALT="Figure 1"> View larger version (20K): org.highwire.dtl.DTLVardef@20c846org.highwire.dtl.DTLVardef@3c92b5org.highwire.dtl.DTLVardef@c3e8cborg.highwire.dtl.DTLVardef@1e89f36_HPS_FORMAT_FIGEXP M_FIG C_FIG

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.1%
13.0%
2
Scientific Reports
3102 papers in training set
Top 4%
10.8%
3
PLOS Digital Health
91 papers in training set
Top 0.3%
7.4%
4
PLOS ONE
4510 papers in training set
Top 30%
5.0%
5
iScience
1063 papers in training set
Top 4%
3.7%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.5%
3.2%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1.0%
2.8%
8
GigaScience
172 papers in training set
Top 0.7%
2.7%
9
Biomedicines
66 papers in training set
Top 0.4%
2.5%
50% of probability mass above
10
International Journal of Medical Informatics
25 papers in training set
Top 0.6%
2.5%
11
eLife
5422 papers in training set
Top 34%
2.2%
12
Heliyon
146 papers in training set
Top 0.9%
2.2%
13
Frontiers in Oncology
95 papers in training set
Top 2%
2.0%
14
Frontiers in Physiology
93 papers in training set
Top 2%
2.0%
15
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
2.0%
16
Metabolites
50 papers in training set
Top 0.4%
1.9%
17
Frontiers in Medicine
113 papers in training set
Top 3%
1.8%
18
Database
51 papers in training set
Top 0.4%
1.5%
19
PeerJ
261 papers in training set
Top 10%
1.3%
20
eBioMedicine
130 papers in training set
Top 3%
1.0%
21
npj Digital Medicine
97 papers in training set
Top 3%
1.0%
22
JMIR Medical Informatics
17 papers in training set
Top 1%
0.9%
23
Cureus
67 papers in training set
Top 4%
0.9%
24
Nature Communications
4913 papers in training set
Top 60%
0.8%
25
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 5%
0.8%
26
Med
38 papers in training set
Top 0.6%
0.8%
27
Journal of Translational Medicine
46 papers in training set
Top 2%
0.8%
28
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
29
PLOS Computational Biology
1633 papers in training set
Top 24%
0.8%
30
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%