Back

Artificial intelligence aided design of peptides with custom secondary structure motifs and reduced amino acid alphabets

Brown, S. M.; Cohen, A. B.; Dean, S. N.

2026-05-01 bioinformatics
10.64898/2026.04.29.721096 bioRxiv
Show abstract

Proteins are highly diverse functional polymers where the specific sequence of amino acids, selected from a standard genetically-encoded alphabet of twenty (C20), determines the structure and ultimately the function of the resulting folded protein. This standard alphabet has been identified to be non-randomly distributed in physicochemical properties crucial to both structure-formation and function, often referred to as coverage theory. While machine learning models have drastically improved protein structure prediction, protein design has yet to have similar development. Here we therefore bridge contemporary biological theory with recent advancements in artificial intelligence (AI) to develop and evaluate a generative AI protein design model, trained on hundreds of thousands of proteins within the RSCB PDB, for custom secondary structure motifs using reduced amino acid alphabets. Results indicate an overall success in designing novel proteins with desired secondary structure motifs for a broad range of amino acid alphabets. Interestingly this tool often captures the full three-dimensional tertiary structure of a target protein despite training only on physicochemical sequence space and DSSP secondary structure. The development of this model advances research across multiple disciplines, from general scientific AI/ML architecture development to protein design for biotechnology, astrobiology, and early-Earth evolutionary biology.

Matching journals

The top 8 journals account for 50% of the predicted probability mass.

1
Protein Science
221 papers in training set
Top 0.1%
14.9%
2
Scientific Reports
3102 papers in training set
Top 17%
6.4%
3
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.5%
6.4%
4
Chemical Science
71 papers in training set
Top 0.1%
6.4%
5
Journal of Chemical Information and Modeling
207 papers in training set
Top 0.9%
6.4%
6
Proteins: Structure, Function, and Bioinformatics
82 papers in training set
Top 0.1%
4.9%
7
PLOS Computational Biology
1633 papers in training set
Top 8%
4.0%
8
Bioinformatics Advances
184 papers in training set
Top 2%
2.6%
50% of probability mass above
9
Journal of Chemical Theory and Computation
126 papers in training set
Top 0.4%
2.6%
10
Nature Communications
4913 papers in training set
Top 47%
2.1%
11
ACS Synthetic Biology
256 papers in training set
Top 1%
2.1%
12
PLOS ONE
4510 papers in training set
Top 47%
2.1%
13
Cell Systems
167 papers in training set
Top 7%
1.7%
14
Structure
175 papers in training set
Top 2%
1.7%
15
Biophysical Journal
545 papers in training set
Top 3%
1.5%
16
Molecules
37 papers in training set
Top 1%
1.3%
17
International Journal of Molecular Sciences
453 papers in training set
Top 10%
1.2%
18
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 39%
1.1%
19
Artificial Intelligence in the Life Sciences
11 papers in training set
Top 0.1%
1.1%
20
Frontiers in Molecular Biosciences
100 papers in training set
Top 3%
1.1%
21
Frontiers in Genetics
197 papers in training set
Top 7%
1.0%
22
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
23
Computational Biology and Chemistry
23 papers in training set
Top 0.3%
0.9%
24
Journal of Molecular Biology
217 papers in training set
Top 3%
0.9%
25
Journal of Cheminformatics
25 papers in training set
Top 0.5%
0.8%
26
Frontiers in Immunology
586 papers in training set
Top 7%
0.8%
27
The Journal of Physical Chemistry B
158 papers in training set
Top 2%
0.8%
28
iScience
1063 papers in training set
Top 28%
0.8%
29
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
30
mAbs
28 papers in training set
Top 0.4%
0.8%