Back

Codebook: sequence specificity and genomic binding of poorly-characterized human transcription factors

Jolma, A.; Laverty, K. U.; Fathi, A.; Yang, A. W.; Yellan, I.; Vorontsov, I. E.; Inukai, S.; Kribelbauer, J. F.; Gralak, A. J.; Razavi, R.; Albu, M.; Brechalov, A.; Patel, Z. M.; Nozdrin, V.; Meshcheryakov, G.; Buyan, A.; Kozin, I.; Abramov, S.; Boytsov, A.; The Codebook Consortium, ; Weirauch, M. T.; Fornes, O.; Makeev, V. J.; Grau, J.; Grosse, I.; Bucher, P.; Deplancke, B.; Kulakovskiy, I. V.; Hughes, T. R.

2026-03-12 genomics
10.1101/2024.11.11.622097 bioRxiv
Show abstract

Gene expression is regulated by transcription factors (TFs), which recognize specific DNA sequence motifs. Several hundred putative human TFs, identified mainly by an apparent DNA-binding domain, lack known binding motifs1, and even for well-characterized TFs, it remains controversial to what degree motifs accurately reflect binding sites in living cells2,3. Here, we describe a systematic effort ("Codebook") to determine the sequence specificity of 332 putative and poorly characterized human TFs. Over 4,000 independent experiments, encompassing multiple in vitro and in vivo assays, produced motifs for just over half (177, or 53%), of which most are unique to a single protein, thereby extending the vocabulary of sequence recognition encoded by human TFs by [~]100 distinct motifs. Moreover, binding motifs identified in vitro are strongly enriched within cellular binding sites. Collectively, the data reveal tens of thousands of previously unknown, conserved, and direct TF binding sites across the human genome. These sites are concentrated in promoter regions, and are predictive of gene expression, illustrating that this new data atlas provides an important step forward in decoding the human genome.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.