- Last updated
- Save as PDF
- Page ID
- 146142
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)
\( \newcommand{\vectorC}[1]{\textbf{#1}}\)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}}\)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}\)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}}}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)
Search Fundamentals of Biochemistry
Recent Updates: November2024
The Protein Folding Problem: Sequence to 3D Structure
In Chapter 3.4: Analyses of Protein Structure, we discussed how protein structurecan be experimentally analyzed and the 3D structure of a protein determined using NMR, X-ray crystallography, and cryo-EM.Now, using sequence and structural databases (like the PDB with over 227,000 structures), we can often predict the 3D structure of a protein just from its linear sequence by comparing the sequence of aprotein of unknown tertiary structure to homologous proteins (by sequence) whose 3D structures are known.Earlier and simpler attempts at "homology modeling"havebeen extended through machine learning and artificial intelligence to allowstructural predictions formillions of protein sequences using programs such asRoseTTAFold and AlphaFold. Both RoseTTAFold and AlphaFold produce high-qualitystructure predictions when trained using the vast sequence information in the Protein Data Bank. Embedded in those linear sequences is a large amount of hidden (to the human eye)evolutionary information that machine learning and AI can harness to predict 3D structures. They work less well when very limited sequence comparisons are available.
In general (for smaller proteins), the protein folding problem, the prediction of structure from sequence, appears to havebeen "solved".The Nobel Prize in Chemistry in 2024 was awarded toDemis Hassabis and John M. Jumper from Google DeepMindfor developing AlphaFold, and David Baker for developingRoseTTaFold and other powerful techniques described below.(Check out the previous section,Chapter 4.13: Predicting Structure and Function of Biomolecules Through Natural Language Processing Tools, if you are interested in a "deep dive" into how these programs work!)
A comparison of protein structures obtained using these programs with known 3D structures obtained through x-ray crystallography or other techniques show them to bealmost identical. Different metrics can be used to comparepredicted structures to the actual ones. The root mean squared deviation (RMSD) is a common one. RoseTTAFold uses a TM-score, a metric for assessing the topological similarity of protein structures. Compared to RMDS, the TM-score weights smaller distance errors higher than larger distance errors and makes it sensitive to the global fold, not local structural differences. TM values range from 0-100 (100 is a perfect match). Scores below 17 indicate no topology match, while those above 50 suggest a common fold.
AlphaFold uses a "neural network, meaning it simultaneously considers patterns in protein sequences, how a protein’s amino acids interact with one another, and a protein’s possible three-dimensional structure. In this architecture, one-, two-, and three-dimensional information flows back and forth, allowing the network to collectively reason about the relationship between a protein’s chemical parts and its folded structure". Programs of this type might allow the generation of proteins with new therapeutic or commercial potential just based on sequences. These include vaccines, sensors, specific immune system suppressors or activators, and antivirals.AlphaFold has now been used to predict the structure of 214 million proteins from more than one million species — essentially all known protein-coding sequences. We have included many AlphaFold iCn3D models throughout this book.
- AlphaFold Database-Protein Structure Database
Figure \(\PageIndex{1}\) shows the backbone tube cartoon of the x-ray pdb structure of the small protein (1xww, cyan) and the structure predicted by both RoseTTAFold program and AlphaFold (magenta) just from its primary sequence. Sulfate, a competitive inhibitor, is shown (spacefill) bound in the active site. The alignment is quite spectacular, except for the N-terminal 5 amino acids at thebottom of the figure (6 o'clock). This stretch has more disorder even in the x-ray structure as the amino acids have high B-factors, indicating more conformational flexibility.
Prediction of Protein-Protein Interactions
AlphaFold3 has also been used to predict the structure of protein complexes in which multiples of the same or a different protein subunit combine to form a larger, quaternary structure. Figure \(\PageIndex{2}\) shows an interactive iCn3D model of arecent stunning example of a predicted AlphaFold complex required to binda human sperm and egg. The complex consistsof three transmembranehuman sperm proteins and a human egg proteinattached to the egg membrane with a posttranslationallipidanchor (not shown).
Figure \(\PageIndex{2}\): Human sperm proteins and egg protein complex predicted by AlphaFold.(Copyright; author via source). The spacefill section of the three sperm proteins attached the proteins to the sperm membrane. The egg protein is JUNO (cyan) and the three sperm proteins are 1ZUMO1 (magenta), SPACA6 (brown/gold), and TMEM81 (blue)
Reference for PDBfile: Deneke, V. et al. A conserved fertilization complex bridges sperm and egg in vertebrates.Cell. October, 2024.https://doi.org/10.1016/j.cell.2024.09.035.Creative Commons Attribution (CC BY 4.0)
You can download this iCn3D file and load it in iCn3d using these commands to see the structure as rendered in the image above:IMPORTANT: If the fileopens as an image in a nepw browserwindow, right-clickthe image and save the file to download it!
- Open iCn3D
- File, Open, iCn3D PNG appendableand browse for the file in your download folder.
Hereis the link to theAlphaFold Server 3.AlphaFold3 is based on a machine-learning process called diffusion, which is explained in more detail below.
The following biological species can be modeled inAlphaFold3:
- macromolecules including proteins,DNA and RNA
- common ligands includingATP, ADP, AMP, GTP, GDP, FAD, NADP, NADPH, NDP, heme, heme C, myristic acid, oleic acid, palmitic acid, citric acid, chlorophylls A and B, bacteriochlorophylls A and B
- common ions such asCa2+, Co2+, Cu2+, Fe3+, K+, Mg2+, Mn2+, Na+, Zn2+, and Cl-
- common post-translational modifications (PTMs) of amino acid residues such as phosphorylation of serine, threonine, tyrosine, and histidine, acetylation of lysine residues, methylation of lysine and arginine,malonylation of cysteine, hydroxylation of proline, lysine, and asparagine, palmitoylation of cysteine, succinylationof asparagine,S-nitrosylation, formylation of tryptophan, crotonylation of lysine, citrullination of lysine and arginine
- glycan chains (including branched chains) composed of some sugars includingalpha/beta-D-glucose, alpha/beta-D-mannose, alpha-L-fucose, beta-D-galactose, N-acetyl-beta-D-glucosamine
- common chemical modifications of the DNA (including methylation of cytosine, guanine, and adenine, carboxylation of cytosine, oxidation of guanine, formylation of cytosine) and RNA (including isomerizationof uridine into pseudouridine, formylation of cytosine, and methylation of cytosine, guanine, adenine, and uracil
- structures composed of multiple proteins, nucleic acids, ligands, ions, and chemically modified derivatives.
Note that simple drugs are not on the list since those applications are proprietary (at the present time). AlphaFold 3 can now be downloaded for academic (non-commerical) applications and likely includes drugs. A more limited web versionis available here.
It is important to statistically compare the PDBexperimentally determined and AF3-predicted structures of complexes. One such comparison involves wild typeand their mutated forms, which should involve just subtle conformational changes. As an example, consider thehuman angiogenin and the placental ribonuclease inhibitor.Figure \(\PageIndex{3}\) shows an interactive iCn3D model of the experimental human angiogeninand placental ribonuclease inhibitor complex (1A4Y).
Figure \(\PageIndex{3}\): Human angiogenin- placental ribonuclease inhibitor complex (1A4Y). (Copyright; author via source). Click the image for a popup or use this link:https://structure.ncbi.nlm.nih.gov/i...DaUAJzhh8H6LU8
There are 27 mutant variants of the complex whose experimental structures are known. They are included in theSKEMPI database, which contains thermodynamic and kinetic data for wildtype and mutant complexes with known PDB structures. One widely used thermodynamic parameter we have seen before is the changein thermodynamic stability (i.e. ΔΔG0= ΔG0mutant- ΔG0wildtype), in this case for the complex, whenkey residues lining the binding pocket are mutated. The entire database contains over 317 protein–protein complexes and 8338 mutations. How well does AF3 predict the structure of mutant complexes?
Three statistical values are used to compare the experimental and AF-predictedcomplex structures:
- RMSD (Root-Mean-Square Deviation): This measures the average distance between equivalent atoms in the complex subunits. Lower values show great similarity between the experiential and AlphaFold 3 models.
- ipTM (Interface Predicted Template Model): This measures changes in the interface of the subunits between the experimental and AF3structures. Higher values indicated a closer match.
- pTM (Predicted Template Model):As with simple AF prediction, this measures the overall accuracy of the predicted structure (based on both backbone and sidechainorientations). Higher pTMvalues indicated a more accurate prediction.
Figure \(\PageIndex{4}\) shows the structureof the wildtype RNase inhibitor-Angiogenincomplex with 27 red dots indicatingmutants with experimental structures (Panel A) and a comparison of the wildtype and a mutant structure (PanelB). Statistical results comparing experimental and AF structures for all 317 complexes in the SKEMPI database are shown in Panels C and D.
Figure \(\PageIndex{4}\): Wildtype and AF predicted composite structure ofRNase inhibitor-Angiogenin complexes (Panel A and B) and statistic comparison of all structures in the SKEMPI database (panels C and D).JunJie Wee andGuo-Wei Wei.J. Chem. Inf. Model. 2024, 64, 16, 6676–6683.https://doi.org/10.1021/acs.jcim.4c00976.PublishedAugust 8, 2024.CC-BY 4.0.
Panel A: The cartoon representation of ribonuclease inhibitor-angiogenin complex (PDB ID:1A4Y). The ribonuclease inhibitor is shown in blue and the angiogenin is in green. 27 mutation spots of 1A4Y in the S8338 data set are indicated in red.
Panel B: The structural alignment of 1A4Y with its AF3 predicted complex.
Panels C and D below are based on the complexes studied, not just theRNase Inhib-Angiogenin complex.
Panel C: The boxplot for RMSD, ipTM and pTM distributions of 317 predicted AF3 protein–protein complexes. RMSDs refer to the overall RMSD calculated by structurally aligning an AF3 complex with its original PDB complex.
Panel D: The breakdown of AF3 protein-protein complexes based on their ipTM and pTM scoring criteria.
The results foraverage statistical values for the wildtype and mutant structure are1.61 Å (RMSD), 0.803 (ipTM) and 0.847 (pTM ) respectively.The graphs clearly showthat most predictions (72%) hadhigh ipTM scores (> 0.8)while 99% had reasonably high pTM scores (>0.5). However,they were outliers as shown in Panel C, with RMSDvalues>4 Å which indicates poorer performance with AF3. Most complexes with high prediction values also had low RMSD values. Experiments like these will be used to continually refine programs such as AlphaFold 3.
Reverse Protein Folding Problem: 3D Structure toFunction
So now we havestructures of 200 million plus proteins. Pick one, Protein X, of unknown function. What might its functionbe? Surely its sequence could be compared to the entire database to find homologous proteins that might give a clue to the function of Protein X. But what if the sequence of Protein X is very divergent from potentialsequence homologs since they arevery distant from each other evolutionarily?Also, what if comparison proteins in the database are underrepresented? For example, our knowledge of the sequences and structures of proteins from pathogens (viruses and bacteria) is very incomplete especially since we have studied just a small fraction of the virus and bacterial world.
To get around this problem, the actual predicted or determined 3Dstructure of Protein X (not its sequence) could be compared to the 3D structures from the databases. Now this would seem enormously complicated since it would be a 3D comparison compared to a 1D comparison of linear sequences. A program called FoldSeek allows 3D structural comparisons to be made in computationally easier ways.
Instead of using an alphabet of actual sequences (such as the single letter code for the 20 naturally occurring amino acids - ACDEFGHIKLMNPQRSTWYV),a new "structural alphabet" based onthe conformations of short stretches of 3-5 alpha C (Cα)atoms in the proteinbackbone has been used, but this doesn't explicitly contain tertiary interactions found in proteins. Instead, FoldSeek uses a 3D interaction alphabet (3Di) with 20 states (one for each amino acid), each with 10 interaction "features". A conformational state for residue X is defined for its closest spatial residue, y.For a given amino acid, the state description is less dependent on the next amino acid in the linear sequence. The defined state has more information when x is in a conserved and packed protein core than in a nonconserved more flexible loop.In contrast, there would be less information if just the backbone structural alphabet was used.Figure \(\PageIndex{5}\) below gives a pictorial view of how the3Di state for a single amino acid Val at a specific position in a3D structure is defined. Note that the state has10, 3D features, many more thanjust the conformation of a backbone of 3 amino acids, or thenext amino acidin the linear sequence.
Figure \(\PageIndex{5}\): Learning the 3Di alphabet.van Kempen, M., Kim, S.S., Tumescheit, C.et al.Fast and accurate protein structure search with Foldseek.Nat Biotechnol42, 243–246 (2024). https://doi.org/10.1038/s41587-023-01773-0.Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/.
(1) 3Di states describe tertiary interaction between a residueiand its nearest neighborj. Nearest neighbors have the closest virtual center distance (yellow). Virtual center positions were optimized for maximum search sensitivity. (2) To describe the interaction geometry of residuesiandj, we extract seven angles, the Euclidean Cαdistance and two sequence distance features from the six Cαcoordinates of the two backbone fragments (blue and red). (3) These 10 features are used to define 20 3Di states by training a VQ-VAEmodified to learn states that are maximally evolutionary conserved. For structure searches, the encoder predicts the best-matching 3Di state for each residue.
Recently, FoldSeek has been used to find structural and function similarities of over 67,000 newly predicted viral proteins(which are underrepresented in the PDB) with otherproteins of known structure.Of these:
- 62% had distinct structures and werenothomologous to proteins in the AlphaFold database (as we indicated above).
- of the 38% left, many were structurally homologous to nonviral proteins suggesting a similarity of viral protein functionsimilar to hostanalogues.
Similar 3D structures imply similar functions so probablefunctions could be described to some novel proteins. Some were involved in the viral escapefrom the host's innate immune system. We'll explore that in Chapter 5.4: Complementary Interactions between Proteins and Ligands - The Immune System.
TheFoldSeek serverallowsmulti-database searches, including AlphaFoldDB (version 4: Proteomes and Swiss-Prot), AlphaFoldDB (version 4) and CATH25clustered at 50% sequence identity, ESM Atlas-HQ and Protein Data Bank (PDB). These Google Colab sites are also available:
In summary, FoldSeek is useful in several circumstances:
- you have a protein sequence and a comparison to other sequences doesn't give you enough information. A structure-based search would then be helpful;
- you want to design a brand new protein (de novo protein synthesis), and you would want to know that its structure is not similar to other proteins
- you want to design a protein with a particular function, and want to compare its structure to other proteins with similar function
Reverse Protein Folding Problem: 3D Structure to Linear Sequence - Designing Proteins From Scratch
In yet another use of machine learning and artificialintelligence, programs can now start with a desired 3D shape (protein backbone, for example) and determine the amino acid sequence necessary to get it. Two programs,ProteinMPNN and RoseTTAFold Diffusion (RFDiffusion), developed by David Baker (who also won the NobelPrizein Chemistry in 2024) et al have enabled these predictions. It allows protein structure design, not structure prediction.
Yet another dream that seemed so distant not so long ago was to design a protein from scratch with no linear sequence (hence littlealignment information) but with a final desired structure or function in mind. Here are some possible "de novo" design examples of novel proteins that...
- are soluble versionsof a known membrane protein, which could advance drug design;
- bind with high affinity to a desired small molecule (much like an antibody), enabling the creation ofsensors and protective agents;
- bind target molecules and catalyze their chemical conversion to products, allowing the development of new and nontoxic catalysts;
- bind to another target protein and modulateits function by activating or inhibiting it;
- have novel, unrepresented folds that could further elucidate key principles of protein folding and stability whilecreating new functionalities.
RFDiffusion
This dream has in large measure been accomplished as well. David Baker has been a pioneer in the field of de novo protein structure design and structure prediction. His group has used several programs includingRoseTTAFold Diffusion (RFDifffusion), which uses AIto design new proteins of novel structure and function. RFDiffusionis freely available to anyone for use in Google Collaboratory. It uses structure prediction from RoseTTAFold with an AI "Diffusion" model to create the new structures.
To understand the term diffusion in structure prediction, let's first explore AI/machine learning models for generative image creation. Instead of starting with no previous information, start with a clear image, add random (Gaussian) noise to it (noising), and then try to recreate the original image by a "denoising" process. Some previous information and additional programs would help to constrain the denoising process for generative image creation for a requested image. This process is illustrated inFigure \(\PageIndex{6}\) below. Note the arrows are reversible.
Figure \(\PageIndex{6}\): Xiao, H., Wang, X., Wang, J.et al.Single image super-resolution with denoising diffusion GANS.Sci Rep14, 4272 (2024). https://doi.org/10.1038/s41598-024-52370-3.Creative Commons Attribution 4.0 International License.http://creativecommons.org/licenses/by/4.0/.
Simplistically, this is similar to solving X-ray structures of proteins. A given protein in a crystalproduces an X-ray diffractionpattern specific to the atoms and their arrangement in the crystal lattice. In the reverse process, the X-ray diffraction pattern can be computationally analyzed to produce the arrangement of atoms (from an initial electron density map) in the lattice that would generate the given diffractionpattern.
In a diffusion model for generative protein structure creation using RFDiffusion, randomly disordered small chemical fragmentsdiffuse together to form amore ordered and realistic protein structure. Information from the database of known protein structures is used to constrainthe generative processes through adeep learning–based protein sequence design method calledProteinMPNN(ProteinMessage-Passing Neural Network). It differs fromRosetta,a physically-based method that maximizes sidechain packing to produce the lowest energy state. It is more computationally challenging to design a sequence that produces the lowest energy state than to find the lowest energy state for a given sequence.Calculating energies of unwanted nonproductive oligomeric and aggregated states makesthis approach intractable.
What is more doable is to carry out these two steps in succession:
- first, search for the lowest-energy sequence for a given backbone structure;
- then search the "universe" of possible structuresfor the sequencecreated in the first step to determine if it is indeed the lowest energy conformation.
Methods like Rosetta use physical "rules" to minimize undesired results. For example, restrictions are used in placing hydrophobic side chains on the surface of a proteinas these might promote unwanted aggregation states.ProteinMPNNovercomes these issues since it uses data from all solved structures to find the most probable amino acid at a given position. It requires less human theoretical knowledge as it extractsan energy-minimized folded state from an immense amount of structural data. It's a bit like deriving Newton's Laws of Motion from data without the underlying theory even though the data was acquired from systems whose motions and positions are well described by Newton's Laws.
Figure \(\PageIndex{7}\) below shows the noising (right to left) and denoising(left to right) processes that can generate a protein structure in a diffusion model. It's exactly parallel to the image deconstruction and reconstruction shown in Figure \(\PageIndex{6}\) above.
Figure \(\PageIndex{7}\): Protein design using RFdiffusion.Diffusion models for proteins are trained to recover corrupted (noised) protein structures and to generate new structures by reversing the corruption process through iterative denoising of initially random noiseXTinto a realistic structureX0(top panel). Watson, J.L., Juergens, D., Bennett, N.R.et al.De novo design of protein structure and function with RFdiffusion.Nature620, 1089–1100 (2023). https://doi.org/10.1038/s41586-023-06415-8.Creative Commons Attribution 4.0 International License.http://creativecommons.org/licenses/by/4.0/.
RFDiffusion can create a protein sequence withoutintroducing conditions to the final structure (unconditional process). Conditions placed on the denoising process can lead to conditioned structures. Conditions such as symmetric noise, a binding target, a functional motif such as pre-positioned amino acids in an active site, and a symmetric motif can lead to synthetic oligomers, a binder protein that interacts with a target protein, an active site with the correct 3D disposition of catalytic residues, and symmetrical scaffolds, respectively. These examples are illustrated inFigure \(\PageIndex{8}\) below.
Figure \(\PageIndex{8}\):b, RFdiffusion is broadly applicable for protein design. RFdiffusion generates protein structures either without further input (top row) or by conditioning on (top to bottom): symmetry specifications; binding targets; protein functional motifs or symmetric functional motifs. In each case random noise, along with conditioning information, is input to RFdiffusion, which iteratively refines that noise until a final protein structure is designed. Watson, J.L. et al., ibid.
Figure \(\PageIndex{9}\) below shows that the final structure predicted for a 300 amino acid protein sequence by AlphaFold(bottom row) is almost identical to the final structure produced by RFDiffusion (top row).
Figure \(\PageIndex{9}\): An example of an unconditional design trajectory for a 300-residue chain, depicting the input to the model (Xt) and the correspondingX^0prediction. At early timesteps (hight),X^0bears little resemblance to a protein but is graduallyrefined into arealistic protein structure.Watson, J.L. et al., ibid.
Unconditional RFDiffusion models for protein sequences up to 600 amino acids are essentially the same as those produced by AlphaFold.
Figure \(\PageIndex{10}\) below shows how hot spots (key binding residues) in a target protein are used as a condition in an RFDiffusion model in the de novo synthesis of a mini-binder for a target protein.
Figure \(\PageIndex{10}\):RFdiffusion generates protein binders given a target and specification of interface hotspot residues.Watson, J.L. et al., ibid.
Thevideo below from the Baker lab (obtained from YouTube at https://youtu.be/geqlzPsigQo) showshow a protein structure can be created that binds to a predefined structure, in this case, the insulin receptor, usingconditional RFDiffusion.
The same video is found at the Baker site at:https://www.bakerlab.org/2023/07/11/...rotein-design/
The structures produced by RFDiffussion andProteinMPNNfor any given sequence can be verified by creating the protein and analyzing its structure using X-ray crystallography, NMR, or cryoEM. Less precise methods such as CD-spectroscopy are alsoused to get a simpler measure of the overall secondary structure of the synthesized protein.
Examples
The following iCn3D models of crystal structures using these methods illustrate the power of RFDiffusion methods in creating new proteins of defined structure and function.
Examples of de novo protein design | Interactive iCn3D model with links |
GP130(IL6 coreceptor) in complex with a de novo designed IL-6 mimetic (8UPA) Cytokine storms (cytokine release syndrome) are often deadly inflammatory responses accompanying bacteria or viral infections (such as Covid19 in those with severe disease). The storm is associated with the overexpressionand release of two proinflammatoryprotein cytokines, interleukin 1 (IL-1) and interleukin 6 (IL-6)by activated immune cells such as macrophages. IL-1 and IL-6 bind to their receptors (IL-1R and IL-2R) with high affinity. IL-6 also binds to acoreceptor (GP130) needed for cytokine release. Inhibitors used to interfere with the cytokines:receptor complexcan persist too long and have deleterious effects since an appropriately amplified immuneresponse is needed against bacteria and viral infections. Asmall protein antagonist (a minibinder or MB) with high affinity (pM to nM dissociation constant) for the receptor and the IL-6 coreceptorwas made through de novo designand proved protective against a cytokine storm in animal models. The structure of the IL-6 mimetic with its coreceptor, GP130, is shown in the iCn3D model to the right. The computationally designed structure was a close match to the X-ray structure. Reference:Huang, B., Coventry, B., Borowska, M.T.et al.De novo design of miniprotein antagonists of cytokine storm inducers.Nat Commun15, 7064 (2024). https://doi.org/10.1038/s41467-024-50919-4 |
Figure \(\PageIndex{11}\): GP130(IL6 coreceptor, gray) in complex with a de novo designed IL-6 mimetic (8UPA, red). (Copyright; author via source). Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...rAQXeMxWaq74Y6 Here is another link to see a surface representation of the two interacting proteins:https://structure.ncbi.nlm.nih.gov/i...3bbFWhy2eHNGAA |
Designed Influenza HA binder, HA_20, bound to Influenza HA (8SK7) Hemagglutinin (HA) from the influenza virus isa trimericmembrane protein. Each "monomer" is a heterodimer consisting of two different chains, HA1 and HA2. The HA1 domainbinds to a particular sugar, sialic acid, found on many human cells, but most importantly in the respiratory tract. The HA2subunit is transmembrane. We receive a vaccine each year that recognizes the HA protein since the globular head part of HA that interacts with the human cell surface mutates so quickly from year to year. Large shifts in the structure of the influenza HA protein lead to pandemics. Parts of the HA molecule are more conserved and are somewhat sequestered from the human immune system. Targeting them could lead to a more permanent and universal vaccine. A small influenza binder was synthesizedde novo and binds very tightly(nanomolardissociation constant). The de novo-designed protein had essentially the same structure as the AlphaFoldcomputational model.An iCn3D model showing the interaction of the HA binding with one HA1:HA2heterodimer is shown to the right. Reference: Nature 620, 1089–1100 (2023). https://doi.org/10.1038/s41586-023-06415-8. Creative Commons Attribution 4.0. International License. http://creativecommons.org/licenses/by/4.0/. |
Figure \(\PageIndex{12}\): Designed Influenza HA binder, HA_20, bound to Influenza HA (8SK7). (Copyright; author via source). Click the image for a popup or use this link: https://structure.ncbi.nlm.nih.gov/i...xaAUkDPBBUeob7 The gray isHA2, the cyan is the HA1, and the red/yellow-coded secondary structure is the designed HA minibinder. Thebiological HA complex contains three copies of the heterodimeric structure shown above Here is another link to see a surface representation of the three interactingproteins: |
Pentameric helical bundle protein (8U5W) De Novo protein synthesis was used to create a single protein chain (i.e. a monomeric protein with a single C5rotational symmetry axes. Open the iCn3D model to the right. It contains a singlerotational axis (red line). Rotation around the axis by 3600/5 reproduces the identical structure. The designed protein also displays near-infraredfluorescence when it binds to a synthetic dye, merocyanine. The protein forms a covalent Schiff base with the dye. If the Schiff base is protonated, there is a large red shift in both the excitation and emission wavelengths in the fluorescence spectra. The protein/dye complex can be used for tissue imaging at a greater depth than other visible light fluorophores. Reference:https://www.researchsquare.com/article/rs-4652998/v1 |
Figure \(\PageIndex{13}\): Pentameric helical bundle protein (8U5W). (Copyright; author via source). Click the image for a popupor use this link: https://structure.ncbi.nlm.nih.gov/i...Xm7aBQhm2GEKWA |
Symmetric Oligomers In contrast to the previous example of a symmetric monomer, the de novo protein models to the right contain multiple subunits in oligomers that display different types of cyclic symmetry. Figure \(\PageIndex{14}\) to the right displays C2 symmetry, with a rotation of 3600/2 around the axis resulting in an identical structure. The dimer also displays allostery - it changes its shape globally on the addition of effector molecules. Allostery will be explainedin Chapter 5. Figure \(\PageIndex{15}\) to the right is a homo 6-mer of identical subunit and C6 symmetry. Rotation of3600/6 around the axis resultsin an identical structure. Figure \(\PageIndex{16}\) to the right is a homo 8-mer of identical subunit and C8 symmetry. Rotation of3600/8 around the axis resultsin an identical structure. |
Figure \(\PageIndex{14}\):Allosterically Switchable De Novo Protein sr322, In Closed State (8UTM).(Copyright; author via source). Click the image for a popupor use this link: https://structure.ncbi.nlm.nih.gov/i...HSZ7Ddt8v9R3J9
Figure \(\PageIndex{15}\):Designed modular protein oligomer C6-79 (8f6r).(Copyright; author via source). Click the image for a popupor use this link: https://structure.ncbi.nlm.nih.gov/i...SDwnTmNhovAny9
Figure \(\PageIndex{16}\):Designed modular protein oligomer C8-71 (8f6q)..(Copyright; author via source). Click the image for a popupor use this link:https://structure.ncbi.nlm.nih.gov/i...ydRkeJAaGGag38 |
A protein with an active site A protein was designed using RFDiffusion that recreates an active site containing three key catalytic residues from a native enzyme,cytotoxic ribonuclease alpha-sarcin(1DE3). The left model inFigure \(\PageIndex{17}\) belowand the iCn3D model in Figure \(\PageIndex{18}\) in the adjacentright column show the 3 active site residues used for conditional de novo protein modeling. The middle two images below show the isolated input catalytic "triad" and the structure created by RFDiffusion. The right image below is a zoomed image of the active site in the designed protein. Figure \(\PageIndex{17}\): Comparison of nativeribonucleasesarcin and RFDiffusion designed protein.Supplemental Figure,Watson, J.L. et al., ibid. | This iCn3D model is for cytotoxic ribonuclease alpha-sarcin(1DE3). Three active site amino acids, H50, E96, and H137,were conditionally used to create thede novo-created protein with the same active site residues (shown to the left).
Figure \(\PageIndex{18}\): Cytotoxic ribonuclease alpha-sarcin(1DE3). (Copyright; author via source). Click the image for a popupor use this link: https://structure.ncbi.nlm.nih.gov/i...WR1oo3fru633d6 |
In Chapter 11.1 we will explore how RFDiffusion can be used to create novel membrane proteins and soluble versions.
One final comment: Structures predicted by these AI programs must besubjected to experimental validation of structure and function. Since creating new structures with designed functions is so easy, we must be careful not to blindly accept the results without supporting experimental validation.
AlphaProteo
AlphaProteo from Google DeepMindis also used todesign protein binders for target sites on proteins. Download this filefor avideo of a synthesized protein binder designed for the SARS-CoV-2 spike receptor-binding domain (reference). This program is not yet freely available (as of 11/11/24) for use. The machine learning methods used in AlphaProteo were not reported in the preprint reference because of "biosecurity and commercial considerations" so we can't explain the basis of the program as we did abovefor RFDiffusion.Figure \(\PageIndex{19}\) below shows in general the steps involved in developing binders that interact with "hotspots" sites on target proteins.
Figure \(\PageIndex{19}\): Overview and experimental performance of AlphaProteo. Vinicius Zambaldiet al.De novo design of high-affinity protein binders with AlphaProteo. Submitted 9/12/24.https://arxiv.org/abs/2409.08022.https://creativecommons.org/licenses/by-nc-sa/4.0/
Panel (A) Schematic of the design system. The generative model outputs designed structures and sequences of binder candidates andthe filter is a model or procedure that predicts whether a design will bind.
Panel (B) Schematic of target-structure-conditionedbinder design as performed by the generative model.
Panel (C) Crystal structures (light yellow) and hotspot residues (dark yellowspheres) of seven target proteins for binder design experiments in this work. VEGF-A and IL-17A are both disulfide-linkedhomodimers. See Table S1 for PDB IDs and hotspot residue numbers.
Figure \(\PageIndex{20}\) below shows the interactions of the de novo synthesized binder with 7 target proteins.
no binder reported |
Figure \(\PageIndex{20}\): Biochemical characterization of representative binders for each target-design model.
The binders all interacted tightly with their target protein.
Challenges that remain
Here are some examples that pose challenges
- Binders that affect the function of a protein:These include both small molecule binders (i.e. drugs) that target orthostericor allosteric sites. Essentially this is the task of the drug and pharmaceutical industry. Designing binders is especially hard for membrane proteins. Also, binders that mimic small drugs are difficult since the databases are more limited and often proprietary, so the training set is smaller. In addition, the differences between binders that activate or inhibit a target protein can be subtle.
- de novo synthesis of protein catalyst: Much of a protein structure is used to bring key groups into a stable configuration for catalysis. Synthetic chemists try to make small transition metal catalysts that mimic the function of proteins witha catalytic site that often contains a metal ion. This suggests that natural proteins might not be the most efficient mimicto produce novel protein catalysts. Also, proteins that differ in 3D structure can carry out similar reactions.
- Conformational flexibility in proteins: Unless we look at the dynamic structures of proteins, our minds can be trapped into creating just the most stable, low-energy protein structure. Yet flexibility and conformational changes are key to protein function and regulation. Programming conformational changeinto the algorithms for de novo synthesis is another complicated task.
- Creating proteins and protein complexeswith functions other than catalysis:Many macromolecular assemblies (inflammasomes, proteasomes, regulated membrane pores, mobility proteins, etc) provide critical cellular functions. Creating new ones could offer novel ways to modulate cell function. One example would be to create nanoparticles that can deliver "cargo" (such as vaccines) into cells or potentially sequester and eliminate deleterious intracellular components (like misfolded and aggregated proteins).