Back

Transforming Semi-structured Variant Assessments into Computable Clinical Assertions: A Pilot Study for AI-Assisted Curation

Cannon, M. J.; Bratulin, A.; Kuzma, K.; Puthawala, D.; Corsmeier, D.; Schieffer, K.; Kelly, B.; Cottrell, C.; Wagner, A. H.

2026-05-08 health informatics
10.64898/2026.05.07.26352456 medRxiv
Show abstract

Genomic medicine relies on expert evaluation of genomic variants, but this process is dramatically slowed by a lack of readily-accessible genomic knowledge. Although genomic knowledge resources such as ClinVar and CIViC support structured data sharing and provide interfaces for adding structure, much of the variant interpretation data generated upstream of these resources is not readily interoperable with these resources, limiting the ability of clinical labs to share data and creating knowledge silos. Here we evaluate a strategy for breaking down these knowledge silos in a pilot study to transform semi-structured variant classification knowledge into computable clinical assertions leveraging the Global Alliance for Genomics and Health (GA4GH) Genomic Knowledge Standards specifications. We programmatically mapped previously captured somatic cancer clinical significance classifications from spreadsheets to the GA4GH Variant Annotation specification. For diagnostic classification data, this approach enabled reuse of standards-aware submission tooling to share 1,499 records to ClinVar. We then studied how AI-assisted curation approaches to overcome gaps in unstructured text enabled scalable curation of prior classifications in unstructured text. Using this approach, we were able to accurately classify clinical significance for 71.8% (117/163) of randomly sampled prognostic evidence statements. We conclude with an overview of how this work may be generalized to make computationally inaccessible variant evidence from other clinical laboratories broadly reusable in downstream knowledgebases such as CIViC and ClinVar.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
19.0%
2
Bioinformatics
1061 papers in training set
Top 3%
10.6%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.3%
4
GENETICS
189 papers in training set
Top 0.1%
4.9%
5
PLOS Computational Biology
1633 papers in training set
Top 9%
3.7%
6
Scientific Reports
3102 papers in training set
Top 34%
3.7%
7
NAR Genomics and Bioinformatics
214 papers in training set
Top 1%
2.7%
50% of probability mass above
8
BMC Bioinformatics
383 papers in training set
Top 4%
2.1%
9
Journal of Biomedical Informatics
45 papers in training set
Top 0.7%
1.8%
10
Nature Communications
4913 papers in training set
Top 51%
1.7%
11
Patterns
70 papers in training set
Top 0.8%
1.7%
12
GigaScience
172 papers in training set
Top 1%
1.7%
13
Genome Medicine
154 papers in training set
Top 4%
1.7%
14
npj Digital Medicine
97 papers in training set
Top 2%
1.5%
15
PLOS ONE
4510 papers in training set
Top 56%
1.5%
16
Cell Systems
167 papers in training set
Top 8%
1.5%
17
BMC Medical Genomics
36 papers in training set
Top 0.6%
1.4%
18
JAMIA Open
37 papers in training set
Top 1.0%
1.4%
19
Cell Genomics
162 papers in training set
Top 4%
1.4%
20
Med
38 papers in training set
Top 0.4%
1.2%
21
iScience
1063 papers in training set
Top 24%
1.0%
22
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
23
Nature Methods
336 papers in training set
Top 6%
0.9%
24
Nature Medicine
117 papers in training set
Top 4%
0.9%
25
Nature Biotechnology
147 papers in training set
Top 7%
0.9%
26
npj Genomic Medicine
33 papers in training set
Top 0.7%
0.9%
27
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.8%
28
BMJ Health & Care Informatics
13 papers in training set
Top 0.8%
0.8%
29
Human Mutation
29 papers in training set
Top 0.7%
0.8%
30
Frontiers in Bioinformatics
45 papers in training set
Top 1.0%
0.7%