Back

Improving isoform-level eQTL and integrative genetic analyses of breast cancer risk with long-read RNA transcript assemblies

Head, S. T.; Nemani, A.; Chang, Y.-H.; Harrison, T. A.; Bresnahan, S. T.; Rothstein, J. H.; Sieh, W.; Lindstroem, S.; Bhattacharya, A.

2026-03-23 genomics
10.64898/2026.03.22.713514 bioRxiv
Show abstract

Most eQTL and TWAS analyses quantify expression using aggregate, tissue-agnostic transcript annotations and ignore isoform-level regulation, potentially obscuring or misattributing regulatory mechanisms. Here, we developed a framework leveraging publicly available long-read RNA-seq data to perform tissue-informed inference of genetic regulation and prioritize candidate causal isoforms for breast cancer risk. We quantified gene- and isoform-level expression in breast tumor (TCGA), non-cancerous mammary tissue, and cultured fibroblasts (GTEx) using three transcriptome annotations: standard GENCODE, tissue-specific long-read-derived assemblies, and combined annotations incorporating transcript-isoforms from both. While GENCODE cataloged over 250,000 pan-tissue isoforms, the tissue-specific long-read assemblies captured reduced sets of 74,717 isoforms in tumor, 48,057 in fibroblasts, and 22,941 in healthy breast. We performed eQTL mapping and fine-mapping, followed by colocalization with overall and subtype-specific breast cancer GWAS and isoform-level TWAS. While most eGenes were concordant across annotations, approximately 1/3 of lead cis-eQTLs for shared eGenes differed between long-read assemblies and GENCODE. Further, eIsoform discovery was highly annotation-specific. In healthy breast tissue, the gold standard tissue for building gene expression prediction models for TWAS of breast cancer, 46% of eIsoforms identified by the long-read annotation were unique to that annotation even though 93.7% of them are present in GENCODE. Despite combined annotations expanding the GENCODE catalog by only 0.6-7.6% depending on tissue source, 69% of unique significant isoform-trait associations were specific to a single annotation. Long-read-informed annotations uncovered regulatory associations entirely missed by GENCODE, including a candidate regulatory isoform at the MARK1 locus captured only in fibroblasts and a previously unannotated splice variant prioritized as the likely effector transcript at NUP107. These findings demonstrate that transcript annotation is not merely a technical consideration but critically defines the biological hypothesis space for regulatory mechanisms and shapes discovery. Incorporating tissue-resolved isoform annotations from long-read RNA-seq improves the specificity of regulatory inference and enhances identification of candidate causal isoforms at GWAS loci.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Cell Genomics
162 papers in training set
Top 0.1%
21.9%
2
Nature Genetics
240 papers in training set
Top 0.7%
9.8%
3
Nature Communications
4913 papers in training set
Top 24%
8.2%
4
Genome Biology
555 papers in training set
Top 1.0%
6.6%
5
Nature Biotechnology
147 papers in training set
Top 2%
6.1%
50% of probability mass above
6
Science
429 papers in training set
Top 7%
4.7%
7
The American Journal of Human Genetics
206 papers in training set
Top 1%
4.0%
8
Genome Medicine
154 papers in training set
Top 2%
3.5%
9
Cell Systems
167 papers in training set
Top 5%
2.7%
10
Genome Research
409 papers in training set
Top 1%
2.5%
11
Nucleic Acids Research
1128 papers in training set
Top 9%
2.0%
12
Nature Methods
336 papers in training set
Top 4%
2.0%
13
Cell Reports
1338 papers in training set
Top 21%
2.0%
14
Nature
575 papers in training set
Top 10%
1.8%
15
Molecular Cell
308 papers in training set
Top 6%
1.8%
16
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 34%
1.6%
17
PLOS Computational Biology
1633 papers in training set
Top 17%
1.6%
18
Nature Neuroscience
216 papers in training set
Top 5%
1.4%
19
Cell
370 papers in training set
Top 13%
1.3%
20
Science Advances
1098 papers in training set
Top 22%
1.3%
21
Science Translational Medicine
111 papers in training set
Top 5%
0.9%
22
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
23
eLife
5422 papers in training set
Top 59%
0.7%
24
Cell Reports Methods
141 papers in training set
Top 6%
0.7%
25
Nature Cell Biology
99 papers in training set
Top 5%
0.7%
26
PLOS Genetics
756 papers in training set
Top 17%
0.6%