Back

General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

Rajeev, M.; Narayan, A.

2026-06-10 gastroenterology
10.64898/2026.06.06.26354838 medRxiv
Show abstract

Background: Unstructured data represent about 80% of total electronic health records (EHR) data. Structuring this free text is essential for advancing clinical research, including cohort selection for trials, retrospective studies, and the development of disease registries. While manual chart review (MCR) remains the gold standard for extracting this clinical data, the process is inherently slow, resource-intensive, and susceptible to errors from human fatigue. We evaluated the extraction accuracy, safety, and efficiency of the HeLIX (Hepatology Logic-Integrated Extraction) framework, a Large Language Model (LLM) protocol using Google Gemini 3 Pro, compared to a gold-standard Manual Chart Review (MCR). Methods: A prospective validation study was conducted using 50 high-complexity, simulated hepatology discharge summaries designed to replicate the real-world heterogeneity of EHRs. The HeLIX framework employed a Zero-Shot, Structured Chain-of-Thought (CoT) prompting strategy enforced by a three-layer architecture: Clinical Reasoning Trace, Schema Enforcement, and Evidence Verification. The model extracted 45 distinct clinical variables. Performance was benchmarked against a consensus MCR. Results: Across 2,250 evaluated data points, the model achieved an overall Extraction Accuracy of 99.24% (95% CI: 98.8%-99.5%), with perfect concordance in 35/45 (77.8%) variables. For binary diagnostic variables, the model demonstrated an overall F1-score of 0.98, Recall of 0.99 and substantial inter-rater reliability (Cohens {kappa} = 0.97). Hallucinations were exceptionally rare (2/2250; 0.08%). Critical errors affecting clinical management occurred in only 2 instances (<0.1% of total data), both involving etiological misattribution in complex multifactorial diagnoses. The AI workflow was 13.4-fold faster and 95.1% more cost-effective than manual extraction. Conclusion: The HeLIX framework demonstrates physician-level accuracy and reliability in extracting complex hepatology data. It offers a scalable, efficient, and economical alternative to manual chart review. Such frameworks could accelerate clinical research, enabling healthcare systems globally to build comprehensive patient registries for a fraction of the traditional cost.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
npj Digital Medicine
97 papers in training set
Top 0.3%
17.6%
2
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
8.5%
3
iScience
1063 papers in training set
Top 1%
6.4%
4
Scientific Reports
3102 papers in training set
Top 18%
6.4%
5
Nature Communications
4913 papers in training set
Top 35%
4.3%
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
4.0%
7
Med
38 papers in training set
Top 0.1%
4.0%
50% of probability mass above
8
Journal of Medical Internet Research
85 papers in training set
Top 1%
3.6%
9
PLOS ONE
4510 papers in training set
Top 42%
2.9%
10
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.2%
2.6%
11
Nature Biomedical Engineering
42 papers in training set
Top 0.7%
1.9%
12
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.9%
13
PLOS Digital Health
91 papers in training set
Top 1%
1.9%
14
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.3%
1.7%
15
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 3%
1.7%
16
Hepatology
18 papers in training set
Top 0.2%
1.5%
17
BMC Medicine
163 papers in training set
Top 4%
1.3%
18
Journal of Translational Medicine
46 papers in training set
Top 1%
1.2%
19
eLife
5422 papers in training set
Top 52%
1.0%
20
Clinical and Translational Science
21 papers in training set
Top 0.9%
0.8%
21
Nature Medicine
117 papers in training set
Top 4%
0.8%
22
Nature Genetics
240 papers in training set
Top 7%
0.8%
23
Frontiers in Medicine
113 papers in training set
Top 7%
0.8%
24
Science Translational Medicine
111 papers in training set
Top 6%
0.8%
25
JMIR Medical Informatics
17 papers in training set
Top 1%
0.8%
26
Biomedicines
66 papers in training set
Top 4%
0.6%
27
The Lancet Digital Health
25 papers in training set
Top 1%
0.6%
28
Genome Medicine
154 papers in training set
Top 9%
0.6%
29
eBioMedicine
130 papers in training set
Top 5%
0.6%
30
Frontiers in Physiology
93 papers in training set
Top 7%
0.6%