Performance Characteristics of Reasoning Large Language Models for Evidence Extraction from Clinical Genomics Literature
Murugan, M.; Yuan, B.; Stephen, J.; Gijavanekar, C.; Xu, S.; Kadirvel, S.; Rivera-Munoz, E. A.; Manita, V.; Delca, F.; Gibbs, R. A.; Venner, E.
Show abstract
BACKGROUNDGenetic variant curation, an important step in the implementation of Genomic Medicine, requires literature-guided comparison of variant prevalence in affected individuals versus healthy controls. This evidence is categorized as the PS4 evidence code by the AMP/ACMG variant interpretation guidelines and its manual extraction is a major bottleneck in clinical variant curation. This study aimed to evaluate whether reasoning-capable large language models (LLMs) can support guideline-constrained PS4 evidence extraction from literature. METHODSWe benchmarked five LLMs for publication-level variant detection and PS4-eligible proband counting under ACMG/AMP and ClinGen Variant Curation Expert Panel (VCEP) guidance using an expert-curated truth-set. We assembled an expert-curated truth-set of 281 publication-variant pairs from 275 peer-reviewed publications (58 genes and 128 variants). Five LLMs spanning frontier-scale, reasoning-optimized, and efficiency-oriented classes (Gemini 2.5 Pro, GPT-5, o3, o4-mini, and Claude Sonnet 4) were evaluated against this truth-set using identical inputs, a unified prompt template, and a schema-constrained JSON output format on two tasks: (1) determining whether a prespecified variant was correctly identified and (2) counting independent PS4-eligible probands under applicable guidance. Primary outcomes were Task 1 accuracy and Task 2 exact-count concordance (model PS4 count equals truth-set count). We also assessed prompt sensitivity, error modes, and output variability across models. RESULTSModels were able to detect the presence of a variant in a publication with high accuracy (93.6-97.9%). For PS4 case counting, exact-count concordance was highest for Gemini 2.5 Pro (91.1%) and GPT-5 (90.0%), followed by o3 (86.5%), o4-mini (79.4%), and Claude Sonnet 4 (73.0%). Most counting errors resulted from an inability of a model to correctly apply guidelines, including evaluating phenotype and family structure. Prompt refinements improved concordance for most models but reduced performance for Claude Sonnet 4, indicating model specific prompting may be warranted. CONCLUSIONSReasoning-capable LLMs can support automation of guideline-based PS4 evidence extraction, achieving high concordance with expert curation, but performance is model- and prompt-dependent and failures concentrate in applying guidelines. Our findings support a hybrid workflow for clinical use in which LLM outputs accelerate evidence extraction with expert escalation.
Matching journals
The top 1 journal accounts for 50% of the predicted probability mass.