Back

MethylCurate: Tool For Dataset Curation and Epigenetic Aging Clock Evaluation

Edwards, T. A.; Shen, L.; Long, Q.

2026-05-14 bioinformatics
10.64898/2026.05.11.723515 bioRxiv
Show abstract

SummaryDNA methylation datasets from public repositories such as NCBI Gene Expression Omnibus are central to the development and evaluation of epigenetic aging clocks, yet existing resources and tools do not fully resolve the bottlenecks of dataset retrieval and metadata harmonization. Current benchmarking frameworks often rely on static curated collections, support only a subset of available Gene Expression Omnibus studies, focus on specific tissues, or require substantial manual intervention when metadata fields and supplementary files are inconsistently structured across studies. We developed MethylCurate, an agentic AI framework that addresses these limitations by automating the retrieval of DNA methylation datasets from the Gene Expression Omnibus, harmonizing heterogeneous metadata, mapping datasets to a unified format, and enabling scalable evaluation of epigenetic aging clocks through an integrated, dialogue-driven workflow. Availability and ImplementationMethylCurate is implemented in Python and combines deterministic modules for Gene Expression Omnibus dataset retrieval, quality control, and clock evaluation with large language model-assisted agents for metadata extraction, metadata harmonization, and DNA methylation data parsing. Source code, documentation, and example workflows are available at: https://github.com/Travyse/methylcurate Contacttravyse.edwards@pennmedicine.upenn.edu Supplementary InformationSupplementary data are available at Bioinformatics online. Graphical AbstractMethylCurate is an agentic-AI framework that converts user-specified NCBI Gene Expression Omnibus DNA methylation datasets into standardized metadata, beta matrices, artifacts, logs, and aging clock benchmarking outputs through automated retrieval, quality control, metadata extraction, harmonization, and evaluation workflows. Figure generated with Biorender. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=51 SRC="FIGDIR/small/723515v1_ufig1.gif" ALT="Figure 1"> View larger version (12K): org.highwire.dtl.DTLVardef@197c0fborg.highwire.dtl.DTLVardef@1feace4org.highwire.dtl.DTLVardef@108b0d5org.highwire.dtl.DTLVardef@191a1b8_HPS_FORMAT_FIGEXP M_FIG C_FIG Key MessagesO_LIAutomated curation of DNA methylation datasets from the Gene Expression Omnibus. C_LIO_LIStandardized preprocessing and metadata harmonization. C_LIO_LIIntegrated benchmarking of epigenetic aging clocks. C_LI

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Bioinformatics
1061 papers in training set
Top 1%
22.8%
2
Bioinformatics Advances
184 papers in training set
Top 0.1%
14.9%
3
Nucleic Acids Research
1128 papers in training set
Top 1%
10.2%
4
GeroScience
97 papers in training set
Top 0.3%
6.5%
50% of probability mass above
5
Computational and Structural Biotechnology Journal
216 papers in training set
Top 1.0%
4.4%
6
PLOS ONE
4510 papers in training set
Top 38%
3.6%
7
BMC Bioinformatics
383 papers in training set
Top 3%
2.4%
8
Aging
69 papers in training set
Top 1%
1.9%
9
Clinical Epigenetics
53 papers in training set
Top 0.4%
1.9%
10
Briefings in Bioinformatics
326 papers in training set
Top 3%
1.9%
11
Cell Reports Methods
141 papers in training set
Top 2%
1.7%
12
GigaScience
172 papers in training set
Top 1%
1.7%
13
Aging Cell
144 papers in training set
Top 2%
1.7%
14
Nature Communications
4913 papers in training set
Top 55%
1.3%
15
PLOS Computational Biology
1633 papers in training set
Top 19%
1.2%
16
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
17
Frontiers in Genetics
197 papers in training set
Top 8%
0.9%
18
Epigenetics
43 papers in training set
Top 0.7%
0.9%
19
npj Aging
15 papers in training set
Top 0.9%
0.8%
20
Scientific Data
174 papers in training set
Top 2%
0.8%
21
Genome Medicine
154 papers in training set
Top 8%
0.8%
22
Nature Aging
51 papers in training set
Top 2%
0.8%
23
eneuro
389 papers in training set
Top 10%
0.7%
24
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.7%
25
Database
51 papers in training set
Top 1%
0.7%
26
iScience
1063 papers in training set
Top 40%
0.5%
27
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.5%
28
Genome Research
409 papers in training set
Top 5%
0.5%
29
Journal of Proteome Research
215 papers in training set
Top 3%
0.5%