Back

Re-annotating the EPICv2 manifest with genes, intragenic features, and regulatory elements

Mallabar-Rimmer, B.; Wells, P.; Franklin, A.; Mill, J.; Webster, A. P.

2025-03-14 bioinformatics
10.1101/2025.03.12.642895 bioRxiv
Show abstract

The Illumina Infinium MethylationEPIC v2.0 BeadChip (EPICv2 array) is a microarray for assessment of the human epigenome. Sites on the EPICv2 array are annotated with an open-source file provided by Illumina, the EPICv2 manifest. Of the 923,452 unique genomic sites targeted by the EPICv2 array, the Illumina manifest identifies just 214,808 as mapping to a gene, excluding many sites located within a gene body. Based on the genomic coordinates of probes, we have mapped each site assayed on the Illumina EPICv2 array using publicly available data, comprehensively annotating affiliated genes and regulatory elements. We have found that a total of 700,392 EPICv2 array sites are located within a gene body (exon, intron, or UTR) according to the GENCODE Human release 47 (GENCODEv47) database. 509,940 of these sites were not annotated as being within a gene in the Illumina EPICv2 manifest, primarily because the Illumina manifest does not annotate introns - 498,407 of the excluded sites, or 97.74%, are located within the intron of at least one transcript. The Illumina EPICv2 manifest annotates 358,539 sites as being within 1500bp of a transcription start site (TSS). Using a distance-based approach, we have labelled 267,183 sites as being within promoter distance of a gene (<1500bp upstream or <500bp downstream of the TSS), and 140,123 sites as being within enhancer distance (1501-5000bp upstream of the TSS, excluding sites located within a gene body). We re-annotated the EPICv2 manifest using GENCODEv47 data to label intragenic features, and a distance-based approach to label the regulatory genome. We also include a column indicating whether a site is located in any promoter or enhancer, according to the GeneHancer database. The re-annotated manifest additionally labels which sites are required for the Horvath DNA Methylation Age Calculator and MethylDetectR epigenetic clocks, to facilitate data preparation for these tools. In conclusion, we have re-annotated the EPICv2 manifest, allowing more complete assessment of EPICv2 sites associated with gene bodies and regulatory regions during the interpretation of epigenetic studies. The re-annotated manifest is publicly available - see the Data Availability section of this article.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
Epigenetics
43 papers in training set
Top 0.1%
27.9%
2
Clinical Epigenetics
53 papers in training set
Top 0.1%
12.4%
3
Epigenetics & Chromatin
42 papers in training set
Top 0.1%
10.2%
50% of probability mass above
4
Bioinformatics
1061 papers in training set
Top 4%
6.9%
5
BMC Bioinformatics
383 papers in training set
Top 2%
6.4%
6
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
7
PLOS ONE
4510 papers in training set
Top 39%
3.6%
8
Nature Communications
4913 papers in training set
Top 42%
3.1%
9
Bioinformatics Advances
184 papers in training set
Top 2%
2.1%
10
Scientific Reports
3102 papers in training set
Top 50%
2.1%
11
Frontiers in Genetics
197 papers in training set
Top 5%
1.7%
12
Epigenomics
10 papers in training set
Top 0.1%
1.7%
13
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.5%
14
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.3%
15
Genome Research
409 papers in training set
Top 3%
1.3%
16
The American Journal of Human Genetics
206 papers in training set
Top 3%
1.3%
17
Genome Biology
555 papers in training set
Top 5%
1.2%
18
BMC Genomics
328 papers in training set
Top 6%
0.8%
19
Genome Medicine
154 papers in training set
Top 8%
0.7%
20
European Journal of Human Genetics
49 papers in training set
Top 2%
0.5%
21
Scientific Data
174 papers in training set
Top 3%
0.5%
22
International Journal of Molecular Sciences
453 papers in training set
Top 19%
0.5%