Back

From Text to Translation: Using Language Models to Prioritize Variants for Clinical Review

Li, W.; Li, X.; Lavallee, E.; Saparov, A.; Zitnik, M.; Cassa, C. A.

2024-12-31 genetic and genomic medicine
10.1101/2024.12.31.24319792 medRxiv
Show abstract

BackgroundsDespite rapid advances in genomic sequencing, most rare genetic variants remain insufficiently characterized for clinical use, limiting the potential of personalized medicine. When classifying whether a variant is pathogenic, clinical labs adhere to diagnostic guidelines that comprehensively evaluate many forms of evidence including case data, computational predictions, and functional screening. While a substantial amount of clinical evidence has been developed for many of these variants, the majority cannot be definitively classified as pathogenic or benign, and thus persist as Variants of Uncertain Significance (VUS). MethodsWe processed over 2.4 million plaintext variant summaries from ClinVar, employing sentence-level classification to remove content that does not contain evidence and removing uninformative or highly similar summaries. We then trained ClinVar-BERT to discern clinical evidence within these summaries by fine-tuning a BioBERT-based model with labeled records. ResultsWe validated ClinVar-BERT model predictions for variant summaries that are classified as uncertain (VUS) using orthogonal functional screening data. ClinVar-BERT significantly separated estimates of functional impact in clinically actionable genes, including BRCA1 (p = 1.90x10-20), TP53 (p = 1.14x10-47), and PTEN (p = 3.82 x 10-7) and achieved an AUROC of 0.927 when classifying whether variants result in loss of function or have uncertain effects. ConclusionThese findings suggest that ClinVar-BERT is capable of discerning evidence from diagnostic reports and can be useful for prioritizing variants for re-assessment by diagnostic laboratories and expert curation panels.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.