1. Perspectives on frustration
Life is based on molecular information processing. Thousands of nucleic acids and proteins are organized in cells, cooperate and reproduce themselves as a group and respond to their environment. Most of the biomolecules individually carry out the complex task of decoding their own one-dimensional sequences in order to find three-dimensional structures (Anfinsen, Reference Anfinsen1973) furthermore yielding four-dimensional dynamical patterns that permit each molecule to carry out its ‘functions’ (Frauenfelder et al. Reference Frauenfelder, Sligar and Wolynes1991). This decoding task, the first step of which is called folding, is a challenging information processing step that only recently has been successfully carried out for some small proteins by computers (Piana et al. Reference Piana, Lindorff-Larsen and Shaw2012). This task, even with computer aid, severely challenges the most clever human gamers (Cooper et al. Reference Cooper, Khatib, Treuille, Barbero, Lee, Beenen, Leaver-Fay, Baker, Popovíc and Players2010). Contemplating why this task is so hard both for computers and for people but is apparently not hard for the molecules themselves leads us to consider the concept of Frustration in biomolecules, the subject of this review.
In human psychology frustration is an emotion ‘the experience of non-fulfillment of some wish or needs’ (Anonymous, 2013). The emotion of frustration according to psychologists, often arises because a person has conflicting goals that interfere with one another (see Fig. 1). It is this type of internal frustration that can plague information processing rather generally even outside the human context. This sort of frustration is important in formal logic and in computer science and is closely related to the notion of frustration as it is used to describe physical systems, such as disordered magnets (Néel, Reference Néel and Ekspong1970; Nordblad, Reference Nordblad2013; Schiffer & Ramirez, Reference Schiffer and Ramirez1996; Vannimenus & Toulouse, Reference Vannimenus and Toulouse1977; Wannier, Reference Wannier1950a) or biomolecules (Bryngelson & Wolynes, Reference Bryngelson and Wolynes1987). For these inanimate systems, the ‘goals’ are set by evolution and are reflected in the systems’ own laws of development and motion given the genetic information. These laws are usually encoded in an energy function or ‘Hamiltonian’, which in turn determines the kinetics and thermodynamics at the macroscopic level.
Frustration slows the performance of computer algorithms and indeed can make some tasks impossible for a computer to carry out in a timely fashion. For physical systems, frustration leads to complex dynamics that often spans many orders of magnitude in time scale. Some disordered magnets, called spin glasses, can never seem to settle into any equilibrium state, no matter how long one waits. This behavior resembles the way some liquids do not crystallize into their lowest energy structure when cooled below their freezing point but instead become trapped in a glassy state. Proteins have some of the characteristics of glasses at low temperatures (Bryngelson & Wolynes, Reference Bryngelson and Wolynes1987; Frauenfelder et al. Reference Frauenfelder, Sligar and Wolynes1991) but at physiological temperatures the functionally relevant motions are generally fast enough to reliably take place in a cell (Frauenfelder, Reference Frauenfelder1987). Glassy dynamics of a condensed matter system or of a biomolecule resembles a computer code still in the process of crunching the problem – generally when a system is glassy the actual system's state depends on its history and kinetics. While kinetic control of pathways is paramount at the cellular and network level in biology, the physics of individual biomolecules is largely controlled thermodynamically and is independent of the history of the macromolecular system. We shall see the reason for this is that Nature has evolved biomolecules where the frustration of protein dynamics is minimal. Nevertheless some frustration is present in evolved biomolecules and is crucial for many aspects of their function.
The purpose of this paper is to explain the basic ideas about frustration in biomolecules and how these ideas let us understand many aspects of the architecture of biomolecules and how that architecture connects with their motions. These ideas are simultaneously both seductively simple and perilously subtle to understand completely. To do so without a mathematical framework is impossible. Such a framework is provided by the energy landscape theory of protein folding that indeed gives criteria for quantifying frustration globally in large systems (Bryngelson & Wolynes, Reference Bryngelson and Wolynes1987; Goldstein et al. Reference Goldstein, Luthey-Schulten and Wolynes1992b; Wolynes, Reference Wolynes2005). Formally such criteria would be rigorous for infinite-sized proteins, so that they can only be applied using approximations to finite systems. The global criterion for quantifying frustration can be implemented at several levels of mathematical sophistication and has led to concrete and useful protein structure prediction algorithms (Davtyan et al. Reference Davtyan, Schafer, Zheng, Clementi, Wolynes and Papoian2012). A complete mathematical description is quite complicated and this review will only sketch some of the basics (Bryngelson et al. Reference Bryngelson, Onuchic, Socci and Wolynes1995; Plotkin & Onuchic, Reference Plotkin and Onuchic2002). But this summary will allow us to concentrate on reviewing how a local notion of frustration can also be quantified (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2007) and also how localizing frustration can be used to look at protein structures and motion both in a survey sense and to understand the specific examples. The local notion of frustration is not rigorous and the approximate form we use could be severely criticized by statistical mechanics purists. Nevertheless we have found in several investigations, that a very simple estimate of local frustration can give surprising insights into how proteins fold and misfold, how proteins move once they are folded and how they may have evolved to carry out their functions.
The structure of this review is as follows. First, we develop the notion of frustration seeing how it enters the area of abstract logic and how it is used for the simplest condensed matter systems which are magnets. We then discuss how the frustration concept applies specifically to heteropolymers and discuss the role of frustration in the energy landscape theory of protein folding. We further review tests of folding landscape theory in the stylized computer simulation world provided by lattice models of proteins. We then briefly review how understanding frustration globally for real proteins provides a ‘license to do bioinformatics’, enabling us to extract physical information about molecular forces from structural databases. Studying frustration thus provides practical tools for protein structure prediction and leads to an appreciation of the complexity of the protein-folding code. Some illustrations of how frustration occurs in evolved natural proteins and in artificially designed proteins are then discussed. Following this we review how residual frustration in some biomolecules affects their folding kinetics focusing on some anecdotal examples. En passant we discuss broader uses of the frustration concept in folding, describing topological frustration, chemical frustration as it occurs in metalloproteins and the role of symmetry, which surprisingly can frustrate the folding process. The larger part of the review then goes beyond the study of folding to seeing how the notion of frustration help understanding biomolecular function by examining the role of frustration in metastable protein states, binding processes, enzymatic catalysis and allosteric transitions. We also discuss frustration in protein aggregation and in protein–DNA assemblies.
1.1. Frustration, logic and magnets
To get a feeling for frustration it is fun to start with some old, seemingly simple, logic puzzles. Medieval philosophers developed abstract logic using such puzzles anticipating many modern ideas in metamathematics and computational complexity. They are known as ‘sophismata’ (Heytesbury, Reference Heytesbury1494; Kretzmann, Reference Kretzmann, Kretzmann, Kenny and Pinborg1982). Consider these two sentences:
1. All men are donkeys or men and donkeys are donkeys.
2. The sentence ‘Socrates says something false’ in the case where Socrates says nothing other than ‘Socrates says something false’.
Are these statements true or false? How do you feel when you try to figure this out? Frustrated? Figuring these puzzles out in the original Latin in which they were written is even more challenging than it is in English translation. Modern students of logic can do better than most students did centuries ago by systematically using the idea that one must first disambiguate the sentences (with appropriate punctuation in English, such as parentheses and commas, not used in classical Latin) and then judge the truth or falsity of the elemental parts e.g., we are all pretty sure that the proposition ‘All men are donkeys’ is false, etc. After generating a ‘truth table’ of possibilities a ‘simple’ search of all the entries tells us the answer. Using more cumbersome argumentation, Albert of Saxony concluded that two different renditions of the truth or falsity of sentence 1 are possible. Sentence two is one of the forms of the ‘liar paradox’ with which even 20th century philosophers such as Whitehead and Russell continued to grapple. In modern computer science, much longer strings of logical statements are needed to answer practical problems such as how to detect objects in a scene that is digitally encoded (Field, Reference Field1999). Are there methods that allow a computer to decide the truth or falsity of very complex, long statements? Can a computer find consistent solutions of any set of abstract assertions? Such questions are known as constraint satisfaction problems (Mézard & Mora, Reference Mézard and Mora2009; Monasson et al. Reference Monasson, Zecchina, Kirkpatrick, Selman and Troyansky1999). In general, a huge number of possibilities must be tested to see if the logical constraints in the sophism can be satisfied in some way and the computations can become quite lengthy.
That solving a particular problem must lead to frustration is easiest to see when the choices for the subparts of the problem are few – the binary choice of whether an individual proposition is true or false being the simplest. Binary choice is the ‘atom’ of logic and computer science. The consequences of conflicts for problem solving are most profound, however, when many such multiple choices must be made to specify the solution. This is because the search space through the truth table becomes exponentially large with the number of choices needed to specify the answer. We quickly see the liar paradox has no solution but for people, the task of deciding whether the longer strings of logical statements in typical computer programs actually lead to solutions or whether they do not is much, much harder. As a practical matter we usually just run the programs and check whether they stop or whether they yield a unique solution after trying out all possible inputs.
In physics, frustration entered the lexicon in the study of magnets (Anderson, Reference Anderson1950; Vannimenus & Toulouse, Reference Vannimenus and Toulouse1977). Like the logical puzzles we have just discussed, the elementary units in magnets, the spins, have a limited number of choices – classically they can ‘choose’ to be up or they can ‘choose’ to be down. Actually in some magnets the quantum mechanical aspect of the spin can give the spins the opportunity not to choose at all but to remain in the limbo of a quantum superposition of up and down, like a resonance hybrid in organic chemistry, but this complication corresponding to non-Aristotelian logic can often be ignored. The analog of the logical-grammatical constructions for the spins in a magnet are the energies of interactions between the spins. The simplest energy describing the interactions between two spins, one on the ith atom, S i, and on another, usually neighboring atom jth, S j, has the form E ij = −J ijSiSj where the two choices for S i and S j are ±1, i.e. up or down. If J ij is positive, the low-energy state aligns the two spins in the same direction; if J ij is negative the spins prefer to point in opposite directions.
If we have a large magnet the energy can be taken as the sum of the individual (pair) interactions E = −∑i≠jJ ijSiSj. The potentially frustrating problem to solve then is, at low temperature, what is the lowest energy arrangement of the up and down spins that can be reached dynamically? (See Fig. 2). This problem is ‘solved’ by the system when it is cooled and random thermal motions allow it to try out various configurations or arrangements of the spins, according to their energy.
If all the J ij are positive, it is easy even for people to see the answer: the lowest energy state has all the spins pointing in the same direction – they are either all up or all down. What if all (or even some) of the J ij's are negative? Now things seem to be tricky, but sometimes people can figure out the answer pretty quickly for certain situations. Consider first what turns out to be an easy problem: suppose the atoms bearing the spins are situated on a simple square lattice and all the J ij are negative. In this case, an interlaced checkerboard of up and down spins has the lowest energy and minimizes simultaneously every local interaction E ij. The possible conflict between different local interactions can be resolved (see Fig. 2). The resulting pattern is called an anti-ferromagnet. This solution, while it can be quickly checked by anyone, was not really that easy to see a priori, and indeed (along with other insights) figuring out this pattern netted its discoverer, Louis Neel, a Nobel laureate in Physics (Néel, Reference Néel and Ekspong1970). Although it takes at least a few moments for humans to see the answer for the square lattice antiferromagnet, the system itself has a much easier time figuring out what to do. When an antiferromagnet is cooled below a certain temperature it orders spontaneously and fairly quickly.
But antiferromagnets on other lattices can be more complex than the uniform square lattice antiferromagnet case. Imagine the spins reside on atoms situated on a triangular lattice with all antiferromagnetic interactions (Fig. 2). Already a single triangle of the lattice shows the problem. There is no way to arrange the spins, so that every component interaction can simultaneously be minimized; always at least one interaction must be still in a high-energy state. We would say the individual triangle is ‘frustrated’.
What happens for a big lattice of frustrated triangles? In the strictly two-dimensional system, one finds a huge degeneracy of possibilities. There is no unique ground state. This would lead to an apparent contradiction to the Third Law of Thermodynamics, which requires entropy to vanish at absolute zero, implying a unique ground state. Wannier showed that T = 0 was a critical point for the triangular antiferromagnet and thus there would be large length scale fluctuations even at low temperature (Wannier, Reference Wannier1950a). In real three-dimensional systems, weaker interactions between planes of spins restore the Third Law and lead to ordering to a unique lowest energy state, but the energies required to excite the system from the ground state now depend only on the weak interactions and so are much smaller than you would otherwise expect. The kind of ordering found in such three-dimensional systems is also very complex and generally hard to predict with precision (Collins & Petrenko, Reference Collins and Petrenko1997).
You might think the triangular antiferromagnet case is special because of the exact symmetry of the frustration and indeed it is. Without symmetry, frustration and degeneracy still go together but manifest themselves in a very different way: very slow dynamics emerges that gives the system difficulty reaching equilibrium at low temperatures. Although there is a ground state it can be hard to find. Such a system is called a ‘spin glass’. The models that are simplest to describe place spins on a simple lattice and at random assign some bonds to be ferromagnetic and others to be antiferromagnetic (Fig. 3). A particular assignment of bonds is analogous to a sophism having a particular combination of logical connectors (AND, NOT, OR…). Although there are numerous approaches for finding the low-energy states of such a magnet, in the general case all known algorithms take a very long time. This time scales up exponentially with the size of the magnet. The reason for this slowness is there are many choices of spin assignments that have similar energies that are relatively stable but that are globally quite dissimilar. It is difficult for the system to carry out such rearrangements between degenerate states. How does frustration lead to barriers? In fact, the locality of moves makes this hard to understand but you can get some intuition from a weakly frustrated spin system with only a few frustrated interactions like the example shown in Fig. 3. Since most of the bonds in Fig. 3a are ferromagnetic, at low temperature most of the spins will be parallel, let us say oriented up. However, some interactions must be unsatisfied locally. In the example shown there are two ways of arranging these ‘misaligned’ spins. To rearrange from one to the other a very complicated set of spin changes must be made. This rearrangement is entropically disfavored at high temperature and energetically disfavored at low temperature. Many large-scale moves must be made for the system to find out which arrangement or set of arrangements are most stable.
When frustration is widespread the system resembles a glass – the exact arrangement of spins found at any one time depends on the history of the system, just as a liquid cooled into a glass has properties that depend on its annealing schedule. This is the feature of frustration most noted in the condensed matter physics of magnets. The global effect of widespread frustration is that it leads to such completely new phases of matter. The illustrative example, however, focuses attention on a point of special relevance to biochemistry: even a little frustration only locally may lead to new degeneracies that could have dynamical consequences.
Nearly degenerate states will interconvert slowly not only because the frustrated parts of the system have to be moved about, but possibly also because other parts of the system that are not frustrated have to move in order to allow the rearrangement to occur. This entails breaking up strong interactions and is energetically costly and thus difficult to be achieved by random thermal motions.
Although completely random magnets and weakly, locally frustrated magnets are not very familiar real-world systems the binary choices of spins in magnets have made magnetism a convenient playground to understand degeneracies in many other non-magnetic systems. While spin glasses have huge numbers of near-degenerate states, ferromagnets or the simple antiferromagnets have only a few low-energy patterns. We can imagine magnets that exhibit intermediate levels of degeneracy. The most conceptually important of such mildly complex landscapes are models that encode a modest number of spin patterns. Such magnets would reliably order to a specific input pattern if only a fraction of the spins were initially organized to resemble the template. Such magnetic models thus represent a very simple model of how one could construct an associative memory using very simple components. Little (Reference Little1974) and later Hopfield (Reference Hopfield1982) began the study of such associative memory models. The energy function of such a magnet has the form, $E = - \sum\nolimits_{ij} {\sum\nolimits_{\mu = 1}^M {\xi _i^{\mu} \xi _j^{\mu} S_i S_j}} $, where there are M patterns ξ iμ of spins S i and the sum goes over all pairs. If there were only one such pattern, ξ iμ, the magnet would not be at all frustrated but is really just a disguised ferromagnet. Its ground state corresponds to each spin S i taking on the value ξ i. If only a few patterns are encoded you can see that patterns close to any one of them remain as fairly low-energy states (see Fig. 4) and the existence and encoding of the other patterns just adds some random energy contributions that probably do not completely destabilize the state, in any event, if there are not too many interferences. If the magnet starts in a configuration resembling one of the individual patterns at low temperature, as the spins reconfigure they will rearrange to resemble the complete encoded ‘memorized’ pattern. Such an easily reachable pattern of spins would resemble the native structure of a protein that can be reached quickly and is fairly robust to environmental changes (Fig. 4). Like a human brain which can retrieve memories after losing many cells, if many bonds of the lattice (but not too many!) are modified the pattern still remains retrievable. The ‘memory’ pattern is stored globally not just locally. This is quite different from a typical digital computer today where memories are put into specific addresses, locally, and recall can be far from robust.
Associative memory magnets can be studied very thoroughly using modern statistical mechanics. A somewhat surprising property of associative memory models is that they have a finite ‘capacity’ to store information: the maximum number of storable patterns scales only linearly with system size. If there are too many patterns encoded, robust recollection becomes impossible. When there are too many patterns, the magnet instead behaves like a random spin glass and becomes ‘confused’ in trying to recall the encoded patterns. The situation is like that recounted in the joke about the old zoology professor who, when meeting a new faculty member at the Fall Faculty Party, says, ‘No, no. Please don't tell me your name. Every time I learn a new name I forget a fish!’
Frustration enters into protein biochemistry in ways very analogous to what happens for magnets and for these neural network models of associative memory. Frustration leads to degeneracy. Finite degeneracy allows interesting and controlled emergent behaviors. Too much frustration and degeneracy lead to qualitatively different dynamics in which searching through the degeneracies cannot be carried out in a meaningfully short time. A little frustration can be good, a lot can be disastrous.
1.2. Frustration in heteropolymers
Ferromagnetic interactions or antiferromagnetic; spins up or down, on a fixed lattice – the choices are stark and it is easy to see the origin of frustration in trying to find the lowest energy state for magnets. Similarly we may ask, what is the root of the difficulty in predicting the three-dimensional folds of proteins? How do we specify the choices of structures and how do we judge their quality? The choice space is a bit harder to visualize for proteins than it is for magnets and there are several ways to represent choice space, none perfect. Pairs of peptide bonds are restricted in their relative orientations and side-chain orientations are restricted too, so local dihedral angles seem like a sensible choice. The global fold is however a very non-local function of the individual backbone dihedral conformational selections – wild swings of portions of the chain are induced even by changing a single backbone dihedral by itself, so interactions (contacts) between the residues generally become completely reorganized by local moves.
Many such simple individual local reconfigurations of the backbone would be impeded by steric clashes and of course the chain connectivity must be preserved at all costs. The allowed configurations of a chain with excluded volume, possibly entangled, therefore require a much more complex encoding than a simple set of binary spins on a lattice. Connectivity and excluded volume by themselves already make computer searching through the allowed configuration space non-trivial (just think about the myriad of patterns in books on knitting or crocheting, or about modern day problems in mathematical knot theory). However, our understanding and study of dense polymer melts suggest the entanglement effect should actually be small for chains of the length of typical protein domains. Knots first begin to appear in random homopolymer chains typically only after hundreds of persistence lengths. Indeed knots appear rather rarely in the natural protein database, perhaps occurring at the 1% level (Sułkowska et al. Reference Sułkowska, Noel and Onuchic2012). They are more important for DNA folding.
Frustration due to topology alone is thus only a small part of the story of the difficulty of finding the global free energy minimum for proteins (but as we shall see ‘topological frustration’ does play a role in affecting in vitro protein-folding kinetics). The problems of moving the chain around that are encountered in making a single dihedral angle change can often be overcome by a local combination of several such moves, if the chain is not too compact, a so-called ‘bicycle pedal move’. Thus an alternative encoding of the protein configuration in terms of the contact map that specifies which residues are near each other in three-dimensional space (and that therefore contribute to the energy) can be more useful than the backbone encoding. Such an encoding emphasizes the idea that non-local van der Waals or solvent-mediated interactions determine the energy (or scoring function) for proteins. While more complex than the interactions in magnets, these type of interactions still are local in space. There are 20 genetically encoded amino acids, so there are at least 200 possible interaction types compared to the ± choice for magnets. Still amino acid side chains can be classified into groups whose members share similar chemical properties – big, small, acidic, basic, hydrophobic, hydrophilic, etc. Since Kauzmann, the hydrophobicity of the side chains has been known to be the dominant effect in determining protein architecture (Kauzmann, Reference Kauzmann1959). The other physicochemical aspects of the amino acids, as we shall see, are also relevant but this caricature of the interaction as being simply hydrophobic versus hydrophilic already makes it clear how frustration might arise in heteropolymers. Frustration arises from the combination of the constraints of fixed chain connectivity for a molecule with a defined sequence along with the heterogeneous energies for contact interactions that specify the greater attraction that hydrophobic residues have for each other than they have for hydrophilic residues. A problem of frustrating choice thus can arise when for example a single hydrophilic residue is surrounded in sequence by hydrophobes. The hydrophobes can lower their free energy by becoming buried, but then their nearby hydrophile must become buried too. If this lone hydrophile does not then find another hydrophilic mate in the interior (to form a salt bridge, for example), its burial will be costly in free-energy terms. The energy of the hydrophile could be alternatively chosen to be minimized by allowing the hydrophile to rise to the surface, but then if the backbones are too rigid, this choice will also bring the hydrophobes nearby in sequence up to the surface as well. We see this reconfiguration will be costly for the surrounding hydrophobes. Not everyone can be satisfied at the same time by an arrangement. In general, then low-energy states of a random heteropolymer with no specific sequential pattern of hydrophobes and hydrophiles will be the result of compromises. In addition, excluded volume and chain connectivity make it likely that the other different compromised structures that have low energy will actually be quite different in arrangement or topology. To avoid this frustration special patterns of hydrophobicity are needed like those seen in the helical wheel representation of helices (Schiffer & Edmundson, Reference Schiffer and Edmundson1967).
Energetic frustration of the type just discussed is easiest to see and to quantify in the world of ‘lattice polymers’. Lattice polymers have been much studied by theorists because the simplicity in visualization and ease of computation of such models brings intellectual benefits that overwhelm our desire for realism. A concrete example from the lattice polymer world illustrates the power of simplicity, at least in pedagogy. In Fig. 5, some low-energy configurations of a polymer chain are laid out on a two-dimensional checkerboard lattice (Wolynes, Reference Wolynes1997a). The left panel shows configurations of a homopolymer. If every pair interaction has the same value, each of these structures is nearly degenerate in energy. Only surface costs differentiate between the structures energetically. (of course in a very small system there is always a big surface to volume ratio!) Artificial homopolymers in bulk indeed choose among these degenerate states in a kinetically controlled way, making the materials science of solid polymers a challenging topic. We may say, with modest exaggeration that homopolymers slip and slide on a nearly featureless energy surface like that shown schematically below the homopolymer structure. This situation for the homopolymer is a little like what we found for the triangular antiferromagnet where we have an extremely degenerate landscape even near absolute zero temperature. The middle panel of Fig. 5 shows configurations of a binarily encoded heteropolymer. The energies of the various structures now vary a lot as they reconfigure, even their bulk energy contributions are pinned by the heterogeneity of contact energy. The landscape is rugged and two topologically very different configurations can compete at being possible ground states. The barriers between any two low-energy configurations are high since now many energetically favorable contacts must be broken up in order to move from one low-energy state to the other low-energy state. The frustration of finding the lowest ground state comes from the conflict of many reasonably similar, but nevertheless highly compromised, choices. The energy landscape of a random heteropolymer is much like that for the magnetic spin glass (Bryngelson & Wolynes, Reference Bryngelson and Wolynes1987; Bryngelson et al. Reference Bryngelson, Onuchic, Socci and Wolynes1995).
Even when there are more than two interaction types, frustration, degeneracy, and slow dynamics remain the norm for complex heteropolymers. But having more choices in the interactions allows molecular evolution to pre-select landscapes to fold better. A canny ‘choice’ (read successful but rare evolutionary outcome!) makes it possible for a single structure to clearly emerge as ground state winner and no longer have significant competitors. An example of such pre-selection is the hydrophobic patterning of α-helices and β-strands that we mentioned earlier. In the lattice world such a pre-selected example is shown in the third panel (Fig. 5). The energy landscape for the ‘evolved’ sequence plotted below this structure now has a larger overall energy scale than for the random heteropolymer but the landscape is locally a bit smoother and at a low-enough temperature only a single structure will dominate. This is much like the disguised ferromagnetic order found at low temperatures in the spin models of associative memory.
Different degrees of frustration at the global level lead to different kinetic and thermodynamic behaviors for the heteropolymer. These different behaviors show up in direct simulations of heteropolymers in the lattice world (Onuchic et al. Reference Onuchic, Luthey-Schulten and Wolynes1997). In Fig. 6, we show the results of Onuchic et al. for the folding times of two minimally frustrated lattice polymers (one with a two-letter code, the other a three-letter code) and one for a rather frustrated but typical two-letter code polymer. These are small systems and folding times of random heteropolymers scale exponentially with size so the contrasts in folding time would be bigger for longer sequences, but you can see the systems with less frustration are able to achieve faster folding under ideal conditions. The contrast is made even clearer when we simultaneously examine the lattice polymers’ thermodynamics: even at the temperature of its fastest folding the typical 27-mer is not thermodynamically stable in the folded state but instead takes up a myriad of unfolded configurations. The minimally frustrated sequences however still fold fast even at temperatures where their folded state wins out thermodynamically over the panoply of other possibilities. When there is a single ground state (as is true when there is little frustration) the global minimum of the lattice polymer can be accessed fairly quickly at low temperature, perhaps with a modest nucleation barrier largely entropy controlled still remaining that can slow folding. In this case, a single exponential rate characterizes the folding process. Folding occurs as a Poisson-distributed random event in what would appear to be a single step in a classic mechanistic investigation. For the typical sequence as shown in Fig. 6, there are several near-degenerate low-energy states. This degeneracy leads to kinetic trapping and the time course of folding therefore exhibits several distinct exponential phases. These phases signal the distributed escape from traps. Non-exponentiality is always a sign of a multiplicity of states and can also occur for minimally frustrated polymers. But for fast, minimally frustrated folders the non-exponentiality coming from intermediate states is not the result of frustration but reflects high energy states involved in direct diffusion to the ground state (down-hill folding) which typically occur on the microsecond time scale. For slower, but minimally frustrated folders with nucleation, longer-lived intermediate ensembles involve species where only part of the chain (a set of ‘foldons’) has become organized and the remainder remains disorganized (Panchenko et al. Reference Panchenko, Luthey-Schulten and Wolynes1996). When there is partial frustration these intermediates that represent entropically stabilized traps can be further stabilized by allowing the formation of non-native but nevertheless stabilizing contacts. Not only does frustration influence the net stability, but also it changes the character of the thermodynamic transition. From the thermodynamic viewpoint when there is little frustration, the folding process appears very cooperative – a single very rapid transition occurs sharply with changing the temperature or solvent conditions as seen for the two or three letter minimally frustrated chains (Wolynes, Reference Wolynes1997a). With growing frustration, the formation of populated intermediates corresponding to the near-degenerate ground states with non-native contacts causes the transition to be spread out over a range of temperatures so the change appears less cooperative. This is manifested in significant shifting baselines in the observable ensemble averages sometimes seen in equilibrium denaturation experiments. Unfrustrated heteropolymers, in other words, exhibit first-order transitions, resembling the freezing of a liquid into a crystal while highly frustrated heteropolymers, in contrast, exhibit thermodynamically continuous transitions that resemble the transition from a viscous liquid to a glass.
Energy landscape theory allows us to quantify the global effects of frustration. In a completely random energy landscape, a simple argument (made first in a more abstract setting by Derrida, Reference Derrida1981) allows us to estimate the energy of the lowest energy states by combining knowledge of the size of the configuration space to be searched with the knowledge of the statistical variation of energy between substantially different configurations (the variance of the energy). Let us call this typical ground-state energy E g. If the energy of the completely folded state, E f (for the particular sequence being studied) is substantially below the E g estimate, we can predict that frustration effects will be minimal for this sequence. On the otherhand if E f ≈ E g there will likely be at least a few competitors for the ground state and folding will be slow, accounting for the time needed to escape from traps becomes necessary and the ensemble will change more continuously with thermodynamic conditions so folding will not seem very cooperative in equilibrium experiments. Comparing E f and E g, requiring them to be well separated (Fig. 7), then provides one way of quantitatively stating the principle of minimal frustration (Bryngelson & Wolynes, Reference Bryngelson and Wolynes1987). This principle generalizes an earlier idea due to Gō, the ‘consistency principle’ which suggested that secondary structure preference needed to be consistent with tertiary packing forces (Go, Reference Go1983). The minimal frustration principle requires also consistency between the different tertiary forces themselves. In addition, as we shall see, the quantitative formulation of the minimal frustration principle using landscape theory leads to practically useful algorithms.
It is worth reprising how one makes the random energy model estimate of the globally frustrated ground-state energy here, because later we will see how this estimate can be used to give a meaning to the notion of having specific locations for frustration in proteins. The random energy landscape argument is simplest for a large system. For a large frustrated heteropolymer we expect there to be, for any given configuration, both a large number of stabilizing interactions and a large number of destabilizing interactions. We can thus invoke the central limit theorem of probability to conclude that the energy of a given configuration will be chosen from a Gaussian distribution having some mean $\overline E $ and a standard deviation ΔE. Thus if we choose a configuration at random, the energy of that chosen specific state is a random variable having a probability distribution
The mean $\overline E $ and standard deviation ΔE may depend on the properties that characterize globally the sampled ensemble of structures such as their degree of collapse and average secondary structure content. Both these statistical quantities can be estimated from an incomplete sampling of the states so long as it is a fair sampling. This makes the statistical characterization of the landscape feasible even while the exact enumeration of all possibilities would be impractical or even impossible owing to the exponentially large configurational space. We can also understand how these statistical properties vary in different parts of the energy landscape. For example, if contact pair interactions are the dominant energy contribution, the extended configurations (as measured by radius of gyration) will have very few contributing terms, while compact configurations will have many specific pairs in contact to be summed over. In this case, the mean energy $\overline E $ depends on the expected number of pairs in that configuration, which in turn depends on the degree of collapse as does ΔE 2.
If compact configurations are generally stable, $\overline \varepsilon $ is negative and mostly depends on the average hydrophobicity. In fact, compactness can be guessed quite well from knowledge of sequence composition alone. Intrinsically disordered proteins often are known not to be compact merely on the basis of their composition, which is generally insufficiently hydrophobic. Like $\overline \varepsilon $, Δε 2 also depends on composition because it depends on how many energetically distinct pairs of interactions there are. Therefore also Δε 2 is a function of protein composition. Δε 2 = 0, for a homopolymer for example. Generally Δε 2 also grows with the number of different kinds of amino acids used to encode the protein.
The Gaussian distribution for the energies of states at first looks like it would allow the energy of a state to be arbitrarily low, but because there is a finite, albeit exponentially large, number of protein conformations, an infinitely low energy cannot be realized: there must always be some particular lowest energy state to satisfy the Third Law of Thermodynamics. What is the lowest energy we would expect then? The actual number of states that have energy E is a product of the total number of states regardless of energy (called Ω) times the probability of a specific energy E, P(E). Thus,
Ω depends exponentially on the chain entropy $\Omega = {\rm e}^{S_0/k_{\rm B}} $. Since entropy is an extensive quantity, S 0 = sN, Ω is the exponential in protein size. This exponential scaling is the essence of the Levinthal calculation of the difficulty of folding on a flat energy landscape (Levinthal, Reference Levinthal1968). Again we can account for how the number of chain configurations varies with several measures that stratify the energy landscape of the heteropolymers such as its overall compactness, average amount of secondary structure, etc. A lot of interesting polymer physics is involved in such calculations (Koretke et al. Reference Koretke, Luthey-Schulten and Wolynes1996; Luthey-Schulten et al. Reference Luthey-Schulten, Ramirez and Wolynes1995; Plotkin et al. Reference Plotkin, Wang and Wolynes1996). Since both factors in W vary exponentially in protein size, for large proteins, we see that depending on the energy E, W(E) is either exponentially large in N, suggesting that every finite part of the chain still has some finite number of energetically equivalent configurations or else W is exponentially small in N – there would not be expected to be a single viable configuration for any segment of the chain with this energy. The expected ground-state energy E g then satisfies the equation W(E g)≅1, or explicitly:
From this formula we see the expected ground state of random heteropolymers becomes deeper in energy, the higher the specific entropy of the chain and the larger the variety that is allowed in the choice of the interactions.
At the value of the energy E g given by Eq. (5), the number of states is not, in fact, actually precisely one. This is because we have only kept the dominant terms in the exponentials. The number of traps at precisely E g is known to scale polynomially in polymer length N, i.e. N(E g) ≈ Nα. A large protein whose energy is just E g thus still will have quite a few kinetic traps of nearly the same stability. Therefore, in order for a protein to fold robustly, its ground-state energy must be substantially below E g on a per residue basis, otherwise trapping will become a problem. Without this gap in energies the actual ground state will be degenerate but perhaps only in short (intrinsically disordered?) segments. The condition that E f−$\overline E $ is below E g− $\overline E $ is sometimes known as the gap condition for folding.
The ratio (E f−E g)/ $\sqrt {{\rm \Delta} E} $ thus indicates how confident we can be, based on the statistics of the random energy landscape that there are no competing traps for the ground state. Since E g− $\overline E $ is proportional to $\sqrt {{\rm \Delta} E} $, this ratio is also related to the Z-score used in statistics, $Z = \displaystyle{{(E_{\rm f} - \overline E)} \over {\sqrt {\Delta E}}} $. Quantitatively we can say the effects of frustration are minimal if the Z-score is very large. See Fig. 7 for a picture of the distribution of energy levels for a minimally frustrated protein. The number of standard deviations below the mean that are needed in order to get a gap depends on the specific entropy S 0 and chain length. S 0 can be estimated for compact polymers with secondary structure (Koretke et al. Reference Koretke, Luthey-Schulten and Wolynes1996; Luthey-Schulten et al. Reference Luthey-Schulten, Ramirez and Wolynes1995). Note that if there are further a priori constraints lowering S 0, one can get by with a lower Z-score.
The minimal frustration principle can also be stated in terms of the characteristic temperatures at which states with the energies E f and E g would be accessed at equilibrium. If there is a large stability gap the folded state will be accessed in a highly cooperative first-order transition at T f. At this ‘folding temperature’ the entropy of the disordered chain is balanced by the deep ground-state energy. Clearly then T f is roughly (E f− $\overline E $)/S 0. If the heteropolymer is frustrated, on the other hand, the lowest-energy-trapped states can only be accessed at a temperature T g, which is roughly ΔE/ $\sqrt {S_0} $. In terms of these characteristic temperatures we can say a heteropolymer is minimally frustrated if T f/T g > 1. We see that T f/T g also depends monotonically on the Z-score. One advantage of this way of stating the criterion for minimal frustration is that this criterion does not explicitly involve chain length. Also this criterion in terms of temperatures turns out to be fairly robust to there being correlations in the energy landscape, which are ignored in the rough argument we have just given (Plotkin et al. Reference Plotkin, Wang and Wolynes1996). The correlations between energy levels predicted by the polymer theory certainly exist as they allow states to be grouped together into basins of attraction which can interconvert without significant barriers. The number of such basins is much smaller than the number of individual configurations. The criterion in terms of T f/T g resembles very much the criteria that have long been used in metallurgy to distinguish those systems that easily crystallize from those alloys that form glasses (Chaudhari & Turnbull, Reference Chaudhari and Turnbull1978; Greet & Turnbull, Reference Greet and Turnbull1967; Stevenson & Wolynes, Reference Stevenson and Wolynes2010). The success of the idea in the macroscopic context of material science gives us more confidence in its use at the nanoscale of protein folding.
The criteria for achieving non-degenerate ground states and also rapid, reliable folding have also been checked many times by carrying out simulations in the lattice polymer world (Sali et al. Reference Sali, Shakhnovich and Karplus1994). The results of a very comprehensive study by Mélin et al. (Reference Mélin, Li, Wingreen and Tang1999) are shown in Fig. 8. The sequences in this study use a two-letter code with a reasonably well-chosen interaction threshold, so that ground-state structures are collapsed 27-mer cubes. Competing, perfectly collapsed structures can be completely enumerated explicitly (there are 103 346 of them) (Shakhnovich & Gutin, Reference Shakhnovich and Gutin1990), so that the ground state can always be found (and checked to see that there is no competing non-compact structures) and the average $\overline E $ and ΔE 2 can be computed explicitly. The ratio Δ=(E f−E g)/ $\sqrt {{\rm \Delta} E} $ can then be computed explicitly. We can see in Fig. 8 that the folding time at constant temperature correlates well with Δ and therefore T f/T g. The contrast in rates would be even larger at temperatures where the low stability sequences with low Δ would be thermodynamically stable. The dramatic slowing of folding of low T f/T g with cooling has been explicitly confirmed then in the lattice world. At the same time, Δ correlates with the cooperativity of the thermodynamic folding transition as measured by its width δ. Clearly for a lattice model heteropolymer to behave like proteins do in the laboratory it must obey the minimal frustration principle requirement of having a sufficiently large T f/T g ratio or high Z-score. Seeing that this principle works in the controlled world of lattice proteins with known force laws gives us confidence in applying the minimal frustration principle to describe how proteins behave in the real world, where the force laws are still uncertain.
2. Frustration and the energy landscape of real proteins
2.1. The energy landscape of real proteins
The principle of minimal frustration has some very direct consequences for understanding folding mechanisms and kinetics in the laboratory. The most important consequence is that thermodynamic stability and folding kinetics should be correlated for real proteins of similar structure in much the same fashion as these thermodynamic and rate quantities are correlated for lattice models. This correlation between rates and equilibria while not completely perfect in the real world is an especial surprise since our introductory chemistry classes have drilled into our heads the idea that kinetics and thermodynamics are supposed to be completely distinct aspects of chemical reactions. Indeed in small molecule chemistry the relations between changes in rates and in equilibria are often unpredictable, but this is not true for the folding reactions of proteins because the overall folding reaction is itself a concatenation of many elementary steps each step being mostly near to local equilibrium. The smooth funnel-like landscape of completely minimally frustrated proteins thus implies a rate–stability correlation because the landscapes of minimally frustrated proteins should have few energetically generated traps. If there were many significant kinetic traps on the landscape the changes in the stabilities of the trap states when the protein is mutated would change the rate unpredictably and would thus interfere with finding a simple correlation of the overall rates with the stability of the ground-state configuration alone. In a minimally frustrated protein, the final native structure of the molecule itself is generally directly correlated with the energy landscape throughout the thermally accessed configuration space because the native-like interactions are stronger than random and thus are most common along folding paths. This structure–energy correlation is what we mean by there being a ‘funneled’ energy landscape for a minimally frustrated protein. We should remember, although, that the degree of rate/equilibrium correlation will not be uniform across the protein sequence. This non-uniformity of correlation arises even on a perfectly funneled landscape because some parts of the protein make few contacts or if they make many contacts a significant fraction of these contacts are distant in sequence so that forming them becomes entropically costly and thus unfavorable in the transition state for folding. At the same time other parts of the protein may have the bulk of their contacts rather close in sequence and thus these parts could easily form with lower entropy loss and thus will tend to form early. Studying the structural features of the rate/equilibrium correlation is called the ϕ-value analysis (Fersht et al. Reference Fersht, Matouschek and Serrano1992; Matthews et al. Reference Matthews and Hurle1987). ϕ-Value analysis reveals what are the most critical parts of the protein for the rate-limiting steps of the folding process.
The success of the ϕ-value analysis confirms the basic tenet of the landscape theory, the minimal frustration principle. This along with the controlled behavior seen in lattice model simulations gives us confidence (at a global level) that we can use the minimal frustration principle to think about natural proteins. These successes of energy landscape theory are partial, however. They do not tell us precisely what the physical interactions actually are that have allowed such an apparently low level of frustration to have been achieved through evolution. Also we would like to know how perfect the funneling of the landscape of real proteins is. The level of agreement between purely structure based landscape models and experiment is so good that we cannot tell directly from the existing kinetic experiments how much energetic residual frustration if any actually remains in most real proteins. The data show the overall frustration level is small for most proteins, but how small precisely? Basically, the glassy effects on the thermodynamics and rates of most folding reactions are insufficient to provide an accurate value of the glass transition temperature T g. A quantitative scale for the energetic ruggedness cannot be easily found using the measured folding kinetics in the laboratory alone. Nevertheless some inferences on ruggedness, quantified by T g and perhaps more important, the energy of the typical competing near-ground states characterized by E g, can be made by using indirect experiments on structure and dynamics of molten globules in concert with the random energy model and polymer physics (Onuchic et al. Reference Onuchic, Wolynes, Luthey-Schulten and Socci1995). The key quantity that we need to know is the configurational entropy to be associated with molten globules. In other words, how numerous are collapsed structures that already have their secondary structure partially formed? Such ‘molten globule’ structures should make up the bulk of the decoy structures with which typical final native structures have had to compete energetically. Using the theory of helix-coil transition in collapsed states (Luthey-Schulten et al. Reference Luthey-Schulten, Ramirez and Wolynes1995), the combination of collapse and helix formation has been shown to reduce the configurational entropy to about 0·6 k B per residue from a nominal value of 2–3 k B that would be available for an amino acid chain in an expanded random coil. Thus using the random energy model we can conclude that the ground-state energy of a configuration that might compete with the native structure is approximately E g = $\overline E $−ΔE $\sqrt {0.6} $. If the energy of the folded state E f is lower than this value, the native structure will have few competitors. If, on the other hand, E f is comparable to E g, traps should appear. To quantify the possibility of there being a multiplicity of states, i.e. to measure frustration quantitatively, the ruggedness energy scale itself thus needs to be determined. This scale can be computed once a reliable energy function is given. We will use this approach throughout this review but it is worth noting that as an alternative we can try to estimate the roughness of the energy landscape from experiment by looking at fluctuation rates in molten globules, as was done rather early on. Motions in molten globule proteins can be measured via NMR. Such dynamical studies have been carried out for surprisingly few systems (Bruylants & Redfield, Reference Bruylants and Redfield2009; Dyson & Wright, Reference Jane Dyson and Wright2002; Pervushin et al. Reference Pervushin, Vamvaca, Vögeli and Hilvert2007). Using some measured millisecond fluctuation rates in molten globules in this way Onuchic et al. arrived at an estimate for the T f to T g ratio for typical helical proteins. They obtained a value of T f/T g of about 1·6. Thus for a typical protein, glassy effects would only show up strongly at temperatures well below the freezing point of water, explaining why pure funnel models work so well in practice. This estimate of T f/T g must be considered quite crude but it is nevertheless higher than what one sees in the simplest lattice models using a two letter hydrophobic–hydrophilic code. Such lattice simulations therefore are plagued by trapping effects that are actually less common for real proteins. The low level of frustration suggested by this analysis for real proteins can be achieved for lattice proteins however by using a maximally distinct three-letter residue folding code. The pedagogically popular two flavor model (hydrophobic/hydrophilic) of lattice proteins is more frustrated, giving a lower T f/T g ratio but the three-letter code may be qualitatively fine for thinking about real proteins.
2.2. A license to learn via bioinformatics: structure prediction and energy functions
Given the difficulty that direct experiments have in quantifying frustration, a purely theoretical approach to estimating ruggedness is desired. A sufficiently reliable energy function, of course, would allow us to assess computationally how funnel-like a protein landscape is and would allow us to quantify the degree of frustration that real proteins have by determining the free energies of a sampled variety of conformations. One is tempted to develop a computational approach to the ruggedness then by employing the all-atom energy functions that are routinely used to simulate the dynamics of folded proteins (Karplus & McCammon, Reference Karplus and McCammon1983). Such a direct assault however faces several difficulties today. The first difficulty is that simulating completely atomistic models in their aqueous environment still takes a lot of computing time. Only in the last few years have the design of special purpose machines and the development of algorithms made it possible to carry out fully atomistic simulations in solvent covering the millisecond time scale motions. This is the time scale for accessing truly distinct conformational minima as inferred from NMR (Zerbetto et al. Reference Zerbetto, Anderson, Bouguet-Bonnet, Rech, Zhang, Meirovitch, Polimeno and Buck2013). Mere sampling of the structures of molten globules is also insufficient: it is necessary to quantify the solvent averaged free energies (not merely energies!) of the individually sampled configurations – since we know hydrophobic effects arising from solvent averaging are important and these have a large solvent entropy component. Finally even when sampling and solvent averaging can be carried out, this direct approach to quantify ruggedness relies on the accuracy of the input all-atom force field. Alas, the accuracy of atomistic force fields while well documented to be good for configurations near to the folded state is, again for reasons of time scale, incompletely tested very far away from the native configurations (until recently!) (Piana et al. Reference Piana, Lindorff-Larsen and Shaw2011, Reference Piana, Lindorff-Larsen and Shaw2012). There are indeed many hints that many of the all-atom molecular force fields used today fail to properly describe the stabilities of incompletely folded states. One hint that there is a problem comes from a meta-analysis of folding simulations made by different research groups: there has been a distinct lack of agreement between the specific results obtained from different simulations of the early stages of folding carried out on the same proteins (Wolynes, Reference Wolynes2004). There is often a tendency to ascribe these differences between results from different groups to the sampling protocols each different simulator uses but this is probably not the main source of error. Instead what we are seeing in most of these disagreements is that the landscapes encoded by the popular all-atom force fields (like the more obviously stylized two-letter code lattice models) are generally all more frustrated than the actual protein energy landscapes really are. Thus each landscape's particular set of traps is quite sensitive to model details much as would be expected for a random heteropolymer. The conclusion we may draw from these observations is that many current atomistic models are still too frustrated to be completely realistic to describe folding in detail. This conclusion is buttressed more directly by a recent study by the Shaw group of the thermodynamics of folding for a variety of all-atom force field models. They have shown that some tuning is needed in order to get reliable folding to the correct native structure across the spectrum of small proteins and that the better tuned models were more cooperative in their thermodynamic behavior than are earlier, less well-tuned, models (used however by many other groups) (Piana et al. Reference Piana, Lindorff-Larsen and Shaw2012). Lower cooperativity correlates as we have seen with there being greater frustration and more intermediates. Even the most successful energy function studied by the Shaw group still gives less cooperative folding transitions than are found via laboratory calorimetry (Henry et al. Reference Henry, Best and Eaton2013). Thus we can infer there probably remains still some room for improving all atom force field models. We also conclude that the direction of improvement will involve further reducing the frustration of the landscape. So today's atomistic simulations probably show larger frustration effects on kinetic pathways than really occur in nature. Recently Best et al. have shown that the Shaw simulations of folding do conform to a minimally frustrated landscape (Best et al. Reference Best, Hummer and Eaton2013).
Many of the difficulties of quantifying frustration using all-atom modeling would disappear if a coarse-grained but transferable energy model were available that was also sufficiently accurate. There is a tendency to think coarse-grained models intrinsically must be less accurate than all-atom models because they are clearly less complete in having fewer degrees of freedom explicitly treated. But, when it comes to the issue of having a realistic level of frustration of the landscapes this reasoning may be flawed. As we have seen the funneled nature of the landscape must emerge from evolution, but evolution itself has had only coarse-grained abilities to reduce frustration: mutations can change only a single amino acid at a time. In contrast, mutations are not able to change single atoms in proteins one at a time! If fluoroalanine would be the best fitting residue in a protein structure, natural evolution is out of luck if it has to find a way to try to put it in – there is no atomic level mutation facility in the cell. (Well, we might imagine post-translational modifications like phosphorylation are attempts at doing atomic level mutation.) This evolutionary meta-argument suggests therefore that a sufficiently complicated but still coarse-grained energy function should be able to capture at least the bulk of the evolved low frustration level.
One of the best ways to find a good transferable coarse-grained energy function is to use the minimal frustration principle itself. The correlation between structure and energy landscape envisioned in the minimal frustration principle gives a strategy for learning energetics from protein structural databases. The minimal frustration principle provides a ‘license to do structural bioinformatics’ since it implies there is a non-random part of the sequence–structure correlation. Beyond merely providing such a license, the minimal frustration principle, in mathematical form, motivates a number of specific algorithms for optimally learning about energetics from known structures (Goldstein et al. Reference Goldstein, Luthey-Schulten and Wolynes1992b). These machine learning algorithms try to find energy functions that make the landscape of structurally well-characterized proteins as funnel-like as possible (Eastwood et al. Reference Eastwood, Hardin, Luthey-Schulten and Wolynes2001). These algorithms start by surveying the T f/T g ratios using various levels of statistical mechanical approximation to estimate the ratio. This must be done for many examples in the database of proteins with known structures but using a transferable energy function having a common set of adjustable parameters that encode the interactions between the residues which are assumed universal. Then the typical T f/T g ratios for the various different force fields can be compared. The more funneled potentials should have higher T f/T g. Thus the relative quality of two different coarse-grained force field models can be accessed, provided the T f/T g ratio can be estimated computationally with sufficient rapidity. The computer can furthermore carry out an optimization of the parameters used in the coarse-grained potential energy function to find the force fields that are most likely to fold properly. In this way, optimal protein-folding codes have been found (Goldstein et al. Reference Goldstein, Luthey-Schulten and Wolynes1992b). In the lattice world, this learning strategy has been explicitly tested in a clever way. By first simulating lattice folding with a known energy function one is able first to construct an artificial database of lattice structures that can be folded with the assumed energy function. These structures can then be used as input to the learning optimization algorithm (Yue et al. Reference Yue, Fiebig, Thomas, Chan, Shakhnovich and Dill1995). This test shows the learning procedure does reproduce to rather good accuracy the original energy function that was used to evolve (design) the structures. This kind of study has established that the mathematics of the learning algorithm is sound.
Even before the lattice model tests of the learning procedures based on the minimal frustration principle were made, the idea was used in a practical way to find transferable potentials for natural proteins by Goldstein et al. (Reference Goldstein, Luthey-Schulten and Wolynes1992a, Reference Goldstein, Luthey-Schulten and Wolynesb). The strategy was first applied to associative memory Hamiltonians (Friedrichs & Wolynes, Reference Friedrichs and Wolynes1989). These are energy functions whose form resembles the associative memory magnet models that we discussed in Section II. The associative memory function consists of a sequence-independent backbone interaction term along with a sequence-dependent ‘memory’ term that encodes particular structure–sequence correlations. $E = - \sum\nolimits_{{\rm memories},\mu} {\gamma (S_i S_j ;S_i^{\mu} S_j^{\mu} ){\rm \Theta} (r_i - r_j^{\mu} )} $ The transferable coefficients γ depend on the sequence of the target protein to be folded S i and the sequences of memory protein S j, which are aligned to the target sequence.
Some of the alignments used in the memory set may be correct, but they may represent assignments in the ‘twilight zone’ where sequence identity by itself is so low that we cannot be sure if a supposed homolog is actually evolutionarily related since completely structurally unrelated proteins can also be similarly well-aligned based on their sequences alone. Other memories used in the associative memory model are simply misalignments to proteins that are incongruent in structure to the target but which are not recognizable a priori as being wrong simply based on sequence. The pairwise distances in the example proteins or memories (r i−rjμ) encode tertiary contacts, of course but they also encode more distant spatial relationships such as the distances involved in extended helices, strands or turns. The minimal frustration principle then is used first to find the optimal γ's for a training set of proteins. The resulting γ's which are supposed to be transferable are then used to construct Hamiltonians for proteins not in the training set. Free-energy profile analysis reveals that the resulting transferable energy landscapes were funneled to near-native structures. Folding simulations with these associative memory Hamiltonians perform well, showing that an algorithm based on learning the γ coefficients could ultimately recognize correct alignments even when the alignment by itself was clearly in the twilight zone.
The optimization learning strategy based on the minimum frustration principle has also been applied to off-lattice Hamiltonians with simple contact Hamiltonians (Koretke et al. Reference Koretke, Luthey-Schulten and Wolynes1996). The interaction parameters learned in this way turn out to be quite adequate to recognize correct structures from candidate homologs in the twilight zone by direct sequence–structure alignment (Goldstein et al. Reference Goldstein, Luthey-Schulten and Wolynes1992a). Alignment recognition is much easier than molecular dynamics based recognition because the search space for alignment (hundreds of base structures) is smaller. In energy landscape terms, the smaller entropy of possible structures means that a bigger gap results between correct alignments (the ‘folded state’) and the best decoy alignment than would result from searching through all possible polymeric configurations. Still developing alignment Hamiltonians pinpointed the need to estimate better the T f/T g ratios in a self-consistent manner; the most troublesome decoys are obviously those that are produced as minima by the same Hamiltonian. This notion of self-consistent optimization is illustrated in Fig. 9 (Koretke et al. Reference Koretke, Luthey-Schulten and Wolynes1996). This makes the learning process used in the search algorithm iterative and non-linear but the learning process is still practicable since the development of parameters needs only to be done once. Such optimized but simple contact based Hamiltonians also work surprisingly well in off-lattice Monte Carlo structure searches. The server developed by Takada's group (Jin et al. Reference Jin, Kambara, Sasakawa, Tamura and Takada2003), predicts structures quite well for small proteins. It is based on such a landscape optimized contact energy function but uses molecular dynamics and fragment assembly as search algorithms.
Seeking greater reliability and accuracy in structure prediction has shown that things are not so simple however. When studying larger protein complexes Papoian, Ulander and Wolynes discovered that the simplest contact-based energy functions while being adequate for some interfaces, were inadequate for describing the binding landscape of many interfaces between proteins (Papoian et al. Reference Papoian, Ulander and Wolynes2003a). They found it was important to allow more distant water-mediated interactions between solvent separated residue pairs to contribute to the energetics (Fig. 10). The same parameter-learning strategy used for contact models can still be applied to this style of energy function. Structure prediction based on the resulting water-mediated energy function while not perfect, turns out to be reasonably robust and quite good for moderate size proteins (Fig. 11). This coarse-grained energy function used for de-novo structure prediction still appears to be more frustrated than the real folding energy landscape in so far as misfolded structures are still sometimes found via molecular dynamics, but the extra frustration largely comes from the ambiguities of the local-in-sequence structure information, which is used as input (secondary structure). Such information is encoded via associative memory terms. When a correct native secondary structure alone is used as the only memory input, very few intermediates are found in folding simulations and for most proteins folding is nearly as cooperative as folding processes in the laboratory. In this review, we will use this rather sophisticated but still coarse-grained energy function to quantify frustration. It is worth remembering therefore that the coarse-grained model we use to illustrate frustration effects in natural proteins still probably errs on the side of exhibiting greater frustration than is present in reality. This being the case, one should take most seriously then indications of very high levels of frustration from this model but probably one should not ascribe too much significance when there are only modest variations or changes in frustration level as measured by this energy function.
2.3. Localizing and visualizing frustration in natural proteins
Like the elementary landscape theory, the traditional phenomenological picture of folding kinetics using single rate coefficients provides only a global description. By global we mean that the protein as a whole is treated as one single entity having no localized substructure. This global phenomenological description works well for small, well-folded proteins but, even phenomenologically, the global picture begins to break down for larger proteins which are usually made up of identifiable domains that can fold independently as kinetic units. In mutidomain proteins, rather often, the individual domains fold in separate steps rather than all at once. The same multistep folding behavior is seen in long repeat proteins (Werbeck et al. Reference Werbeck, Rowling, Chellamuthu and Itzhaki2008). These observations impel us to the recognition that the forces in biomolecular energy landscapes act locally even though they have emergent global consequences for the kinetics. How deeply does the locally correlated nature of the dynamics go? The ϕ-value analysis of transition-state ensembles of even small proteins suggests that the more highly folded residues tend to be close to each other in the native state. A simple model of local folding on a minimally frustrated landscape, called the capillarity model (Ferreiro & Wolynes, Reference Ferreiro and Wolynes2008; Galzitskaya & Finkelstein, Reference Galzitskaya and Finkelstein1999; Wolynes, Reference Wolynes1997b) likens folding to crystallization of a small droplet. This (oversimplified) analysis suggests that a compact region encompassing about 4/27 ≈ 15 percent of a single domain protein mustcome to a near-native configuration before folding can proceed downhill to the native state in a free-energy sense (Wolynes, Reference Wolynes1997b). Typically for small proteins such a compact core contains about 20 residues. Local excursions away from the native state can be monitored by hydrogen–deuterium exchange. These unfolding elements are of a similar size to the folding core (Weinkam et al. Reference Weinkam, Zong and Wolynes2005). The size scale of both folding nuclei and unfolding units is then comparable to the protein segments encoded in the exons of eukaryotic genes (Gilbert, Reference Gilbert1987). It is plausible then that regions of this size are ‘foldons’ (Panchenko et al. Reference Panchenko, Luthey-Schulten and Wolynes1996), i.e. minimal folding units. These arguments suggest we can meaningfully discuss the level of frustration of the energy landscape of individual foldons rather than consider only the protein, globally, as a whole. Scanning contiguous regions of sequence using the Z-score criterion for the energy in the native configurations can thus be used to assign levels of frustration locally. By doing this, determining the Z-scores of contiguous segments, Panchenko et al. showed that there often is a rough correspondence between those regions that were minimally frustrated based on their internal interactions and thus would be capable of selecting a unique structure on their own and the regions encoded by genetic exons (Panchenko et al. Reference Panchenko, Luthey-Schulten and Wolynes1996).
It is hard to refrain from going still further down in length scale in characterizing the energy landscape in order to ask whether some particular physicochemical interaction between two specific residues in a protein is actually what gives rise to frustration, just as we did for the magnets, where individual plaquettes could be said to be frustrated or not depending on whether as isolated units these clusters of spins would have simple ground states. One must bear in mind though that such an analysis cannot be completely rigorous in proteins any more than it was for magnets: the global structure is determined by the cooperation of many elementary subunits; the clusters are not isolated completely. At the same time we also know that individual interactions in a protein cannot be tuned by mutation – instead several interactions change at once when a mutation is made at the amino acid level, all the interactions involving a given residue change when that residue is mutated. Still this degree of unavoidable spatial coarse graining suggests frustration can be defined at a pretty fine level. Recognizing the mutational limit on the length scale where minimal frustration can be expected suggests a reasonable, but by no means unique strategy for finding and locating the origins of frustration in particular proteins: we can computationally examine the free-energy changes that would be made upon introducing mutations and compare the energy of altered choices with what occurs in the original protein. Such a procedure actually generates a series of measures of localized frustration (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2007). The approximate localization of frustration via this approach has proved to be helpful in visualizing regions of high frustration. The localized frustration analysis, while not being rigorous makes vivid the widespread, delocalized nature of the minimally frustrated interactions that do, in the end, allow proteins to behave globally as minimally frustrated systems with smooth funnels.
When trying to assess frustration locally we are ineluctably breaking up the full energy function of the protein into parts. As we have said, this step has some arbitrariness in it. Nevertheless the breakup of the energy function seems straightforward conceptually when the force field is written in the first place explicitly as a sum over pairs. One can then imagine assigning contacts as being frustrated or not just as we did for the few letter lattice models of heteropolymers that we discussed earlier. The best coarse-grained energy functions are not pairwise additive, however. For non-pairwise additive force fields local frustration indices can be obtained using the in silico mutational strategy postulated above. A local frustration index should quantify how much a residue or residue pair contributes to the energy in a given structure compared to what it would contribute in a typical decoy or molten globule configuration. Making numerous mutations and changing local environments let us assess the mean and variance of the energies of molten globule configurations (decoys) relative to the native. By normalizing the difference between the native and the average decoy by the contribution of the same residue (or pair of residues) to the variance of the energies of the decoys we thereby can get an idea whether that contribution to the energy is typical of what would be expected in a minimally frustrated protein or more like that in a random unevolved heteropolymer. In the minimally frustrated protein, it is necessary that globally E Total is less than E g. Requiring each part of the interaction energy to satisfy a similar constraint, $E_{ij}^N \gt E_{ij,{\rm G}}^{{\rm random}} $ is of course overly stringent since only the total energy needs to be low, so we would expect (and find) that only a fraction of individual sites or of the interacting pairs would be minimally frustrated by themselves when we do not consider their cooperation with their neighbors.
At the level of a single residue, a frustration index can be assigned to each residue via such a set of mutations as
Here ET ,N is the total energy of the protein in the native configuration, taken as $E^{T,N} = \sum\nolimits_{k \ne i}^N {(E_{{\rm contact}}^{i;k} + E_{{\rm water}}^{i;k} ) + E_{{\rm burial}}^i} $, according to the tertiary interaction terms of the associative memory Hamiltonean with water mediated interactions (AMW) energy function (Papoian et al. Reference Papoian, Ulander and Wolynes2003a). This energy considers all the interactions that residue i makes with residues k, either in a direct contact, $E_{{\rm contact}}^{i;k} $ or in a water-mediated interaction, $E_{{\rm water}}^{i;k} $ and via a single-body burial energy term, $E_{{\rm burial}}^i $. The average energy of the decoys $\langle E_{i{\rm \prime}} ^{T,U} \rangle $ is computed by mutating residue i to every other possible residue. As the 20 genetically coded amino acids are not all equally probable, the decoy energy is calculated with weights according to the amino acid composition of the chain. These mutations are evaluated from the sequence-specific contact and burial terms from the AMW force field with parameters λ i,ri ,k,ρ i that correspond to the amino acid identity, interaction distance, and density, respectively (Papoian et al. Reference Papoian, Ulander and Wolynes2003a). Similar recipes could be used for other coarse-grained energy functions.
In the case of pairs of residues, we ask: how favorable is the actual native pair relative to other possible interactions? To compute the frustration index for interacting pairs of amino acids i,j simultaneous mutations on residues i and j are made. We have proposed two related but complementary ways for localizing frustration at the pairwise contact level. These ways differ in how the set of decoys is constructed. In one choice, the decoy set is made by randomizing only the identities of the interacting amino acids i,j, keeping all other interaction parameters at their native value. This scheme effectively evaluates every possible mutation of the amino acid pair that forms a particular contact in a robustly fixed structure. We call the resulting index the ‘mutational frustration’:
The decoy energy distribution is calculated by randomly selecting amino acid identities from the protein composition and fixing the density ρ i and the pairwise distances r i,j to the native conformation. It is worth noting that the energy change upon pair mutation not only comes directly from the particular contact probed but also changes through interactions of each residue with other residues not in the pair, as those contributions may also vary upon mutation. One advantage of the mutational frustration index is that, in principle, this local measure of frustration also could be experimentally determined in the laboratory by combinatorial protein engineering.
A second way of quantifying pairwise local frustration imagines that the residues are not only changed in identity, but also can be displaced in location: how favorable is the native interaction between two residues in the native structure relative to other interactions these residues could form in globally different distinct compact structures? The energy variance thus reflects contributions from the energies of molten globule conformations of the same polypeptide chain. For this index, specially suitable for examining alternative tertiary structures, the decoy set involves randomizing not just the identities but also the distances r i,j and densities ρ i of the interacting amino acids. Given a protein sequence and structure this strategy compares the energy of each native contact with the distribution of a randomly interacting set.
When $E_{i,j}^N = \langle E_{i{\rm \prime}, j{\rm \prime}} ^U \rangle, $ then the native energy would not be discriminated from the typical energy of a random interaction in the molten globule and F ijc ≈ 0. This scheme effectively evaluates the native pair with respect to a set of structural decoys that might be encountered in the folding process. We call the frustration index computed in this way the ‘configurational frustration’.
2.4. Patterns of frustration in globular proteins
In visualizing frustration in proteins, it is useful to simplify the information of the frustration analysis into a discrete form like that used for magnets or lattice models where an interaction can be said to be frustrated or not. Given the continuous nature of the frustration index, physically motivated but ultimately non-unique choices of cutoffs must be made. It is particularly interesting to highlight those interactions that are by themselves minimally frustrated – that is those whose energy is deeper than we would expect in the ground state of a random heteropolymer. Using the entropy estimate for molten states of 0·6 k B alluded to earlier; minimal frustration corresponds to having an F i index larger than 0·78 (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2007). Many of the remaining contacts are neutral, that is they lie near the center of the distribution of possible energies in compact decoy states. While these are important interactions they are so numerous that they would obscure images of the deeper stabilizing contacts. On average if they are consistent they can reinforce the funneling of the landscape but weakly. On the other hand, some contacts appear highly frustrated, that is they are not only somewhat destabilizing but are also, in fact, unusually destabilizing in the native structure. We define highly frustrated contacts to be those having high energies 1 σ above the mean. Such interactions would seem to contribute little to the specificity of the folding to the native structure. So why are they there?
To get a feeling for the patterns of frustration, in Fig. 12 we show images of the frustration pattern in the prototypical protein lysozyme. One can see that the protein is networked by a connected set of contacts all of which are minimally frustrated (green) while there are two patches of highly frustrated contacts (red) along with other neutral interactions that are not shown. The frustrated contacts are near to the active site of this protein, which functions as a glycoside hydrolase enzyme.
The bulk of the rest of this review will focus on how frustration patterns of folded proteins can be correlated with folding dynamics and functional aspects of protein dynamics for many specific systems. Before studying particular cases however it is worth making some general observations about the prevalence of frustration when surveying the proteomic database synoptically. First a surprising number of direct contact interactions in native protein structures are individually minimally frustrated. On average about 48% of the direct interactions are individually minimally frustrated and another 46% are neutral. It is easy to see then why proteins as a whole behave as if they are minimally frustrated globally. This also makes clear why the notion used in structure-based folding simulations (that nearly all native contacts are stabilizing) is a good first approximation to an evolved protein's energy landscape. Only about 6% of the direct contacts turn out to be highly frustrated. The indirect water-mediated interactions are seemingly less important in guiding the protein to its native state; only 26% of these are minimally frustrated by themselves while a much larger 61% are neutral, leaving 13% highly frustrated. The contribution of water-mediated interactions to global landscape funneling is far from negligible, however. Minimally frustrated residues are generally more buried than highly frustrated residues and are more likely to be found in α-helices or β-sheets rather than in turns or bends. It is possible that some turns actually relieve their frustration in more complex ways not visible in a frustratogram. Side-chain-backbone contacts which are commonly made in ‘turn signals’ may relieve frustration in a way not captured by the coarse-grained energy function (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2007).
The pair distribution functions of minimally frustrated and highly frustrated residues are different from each other and are distinct from the pair distribution of all pairs irrespective of the frustration level. As can be seen in Fig. 13 the minimally frustrated residues are more strongly correlated at large distance than are residues in general. This long-range correlation allows the minimally frustrated contacts to provide the main rigidifying elements of the structure. In contrast, highly frustrated contacts are strongly clustered and localized, not spread out: whole localities exhibit a high level of frustration: the question of why these frustrated regions exist takes up much of the rest of this review.
3. Frustration in the evolution of natural proteins and in protein design
3.1. Frustration and the complexity of Nature's amino acid code
Theoreticians aptly explain many aspects of the fundamentals of protein folding by studying simplified models. Sometimes seen as surreal ‘spherical cow’ models, these simplifications yield testable and communicable hypotheses such as those uncovered in the lattice worlds discussed above or that one finds using structure-based models (Clementi, Reference Clementi2008). These are thought of as exaggerations because such extreme simplifications are difficult to construct in the laboratory, although some amusing literally macroscopic models of proteins have been built (Reches et al. Reference Reches, Snyder and Whitesides2009). In working with real proteins though, one obvious simplification route would be to somehow reduce the apparent complexity of the amino acid sequences. Twenty amino acid types seems like a lot, at least to first year graduate students. It is still an intriguing fact that natural protein sequences appear to be random (Weiss et al. Reference Weiss, Jimenez-Montano and Herzel2000), while most random amino acid sequences do not appear to be proteins (Keefe & Szostak, Reference Keefe and Szostak2001). By most statistical tests natural protein sequences could encode huge amounts of information, so it has never been obvious which signals are strictly necessary and sufficient for folding and which are simply relics of protein natural history (Zuckerkandl & Pauling, Reference Zuckerkandl and Pauling1965).
Guided by the ‘sequence codes structure’ hypothesis, researchers have attempted to simplify the coding by replacing the apparently redundant amino acids, so that proteins can be made via a simpler palette of possibilities. Information redundancy could be experimentally traced by investigating the structure and function of simplified proteins with some of the apparent redundancy removed (Clarke, Reference Clarke1995). Simplified sequences have been made by simultaneously mutating multiple residues to a single amino acid type. When BPTI protein was mutated by replacing 29 of the total 58 amino acid residues with alanine, it was found that the sequences could still fold (Kuroda & Kim, Reference Kuroda and Kim2000). The effects of a simplifying substitution depend on the position of the mutation in the sequence. Nevertheless, when properly located, simplifying a site only marginally changes overall folding stability of the domain. Most mutants were found to be thermodynamically destabilized and cumulative mutations appear to have additive effects, suggesting that the sites that were chosen to target for mutation do not strongly contribute to the cooperative folding core of this domain (Kato et al. Reference Kato, Yamada, Nakamura, Kidokoro and Kuroda2007). When additional substitutions were tried, the protein became too unstable and usually a molten globule state (i.e. an ensemble of conformations) was populated. This is analogous to the situation described for the lattice model proteins where the overall destabilization of the fold relative to the disordered ensemble occurs when the variety of residue types is reduced. While the native stability often decreases because some favorable native contacts are not formed, the possibility of mis-matches of the homopolymer sections also increases. These effects can be seen as manifestations of frustration in the landscape. One key ingredient is the placing of the simplified sites where alanines were already shown to be tolerated. The point mutation A16 V in an exposed loop of BPTI severely destabilizes the fold but causes no significant conformational change in the mean native structure when folding does occur. In contrast an eight alanine substitution in the protease-binding site does not severely affect stability, but does stabilize other conformations (Cierpicki & Otlewski, Reference Cierpicki and Otlewski2002). When the effect of the substitution cannot be attributed to structural deviations, some kind of strain is usually invoked by protein engineers. Strain, tension, conflict, and unsatisfaction are all terms that reflect frustration effects on protein folding. Tinkering with protein stability by multiple alanine substitutions has also revealed strain effects in the classical models of T4 lysozyme (Liu et al. Reference Liu, Baase and Matthews2000), and ARC repressor (Brown et al. Reference Brown and Sauer1999). These studies suggest that the tolerance to mutation is not evenly distributed, but instead is rather polarized in these structures, both in the native stability and contributions to the rate-limiting steps (Gassner et al. Reference Gassner, Baase, Lindstrom, Lu, Dahlquist and Matthews1999). A series of multi alanine substitutions carried out in single contiguous stretches show that the lysozyme fold is robust to most changes in helices, turns and loop regions but that some substitutions can lead to strong effects, revealing that local signals such as helix capping act to reduce the frustration in the natural protein (Zhang et al. Reference Zhang, Baase and Matthews2002). When the T f/T g ratio is large enough, the protein is robust to substitutions. As the energy gap is lowered, the conflicting effects show up, sometimes catastrophically.
Going beyond making substitutions of a specific and precise character, combinatorial libraries have been exploited to search for foldable sequences using reduced sets of amino acids. These libraries provide a rich source of diversity for the discovery of novel proteins. At the same time these studies begin to answer questions on the minimal set requirements, while of course the practical size of the screening libraries is a concern (Bradley et al. Reference Bradley, Thumfort and Hecht2006). Is the existing genetically encoded set of 20 amino acids the minimal set needed to fold all proteins? As natural protein sequences usually derive from common ancestors and the existing fold classes are believed to be limited (Brenner et al. Reference Brenner, Chothia and Hubbard1997), robust identification of fold class has been used as a constraint to test for whether reductions of coding information is possible. It has been shown that coding alphabets reduced up to 12 letters can still detect 90% of the fold classes. Further reduction rapidly degrades the signal in homology detection (Murphy et al. Reference Murphy, Wallqvist and Levy2000). The success of any reduction approach critically relies on how the residue grouping is done, thus different scoring matrices for alignments are obtained when the criteria changes (May, Reference May1999). It may well be possible that the optimal grouping of the amino acids into fewer types depends on the fold class of the structures, and that the observed natural set of amino acids is the actual minimal one that would satisfy all the functional constraints. There is still no absolute standard against which to compare the groupings of amino acids, and chemical intuition has been the most common inspiration for reduced alphabets.
Energy landscape theory shows that increasing in the alphabet allows for the possibility of decreasing the ruggedness, while still maintaining the stability gap. Hence, bigger alphabets allow for more robust encoding of various specific folds. Once a single overall topology is specified, successful design using few-letter codes becomes easier. When the overall frustration of the starting sequence is low, competing structures are automatically destabilized by the landscape topography. Some of the complexity of protein sequences can then be reduced, as shown in several experimental cases. Riddle et al. selected a functional SH3 domain sequence that contained 95% of its amino acids reduced to a five-letter alphabet of Isoleucine-Lysine-Glutamic acid-Alanine (IKEA) and G. Physical measurements on this protein suggest that the simplified protein adopts native-like conformations (Riddle et al. Reference Riddle, Santiago, Bray-Hall, Doshi, Grantcharova, Yi and Baker1997) and conserves the overall folding dynamics (Yi et al. Reference Yi, Rajagopal, Klevit and Baker2003). Notably, this group also reported a valuable negative result: they were unable to find functional SH3 proteins when the alphabet was further reduced to the three-letter code IKE (Riddle et al. Reference Riddle, Santiago, Bray-Hall, Doshi, Grantcharova, Yi and Baker1997). While success has been reported by other groups in reducing α-helical proteins to three-letter codes (Beasley et al. Reference Beasley and Hecht1997) (see below), this apparently is not possible for the β-sheet-rich SH3 topology. Frustration in tertiary interactions can be at play. Figure 14 shows the local frustration patterns of models of this SH3 domain for various simplified sequences. It is clear that just as for the lattice model protein (Fig. 5), the density of minimally frustrated interactions diminishes as the complexity of the alphabet is reduced (Fig. 14). Also coding simplification leads to the appearance of more highly frustrated interactions. Frustration between alternate states limits the ability to reduce coding complexity. A specific structure cannot be encoded in a homopolymer. An extremely simple alphabet nevertheless allows for encoding the simplest, often highly symmetrical folds. Increasing the variety of interactions makes minimally frustrated structures easier to find even when the topology is intricate. Some of the apparent complexity of protein sequences thus arises presumably not just as a confounding result of historical accidents within a highly degenerate folding code, but may actually be needed in order to encode a suitably wide range of structures with sufficient discrimination in their energy landscapes (Wolynes, Reference Wolynes1997a). Within a rich alphabet much more varied structures become ‘designable’, as many sequences can satisfy the constraints and minimize frustration.
3.2. Frustration in artificially designed proteins
Protein design is a promising endeavor that still can frustrate protein engineers. The moving parts of macroscopic machines can be unambiguously cataloged and combined. This is harder to do for proteins where independent parts may be distinguishable on the larger scales such as domains and multimolecular assemblies, but even at the large-scales cooperative effects often surprise us. At the lower levels of engineering single domains, combinatorial libraries of de novo amino acid sequences provide an overly rich source of diversity. Even for a relatively small protein, it is impossible to sample all possible sequences, as for a chain of 100 residues composed of the 20 amino acids, there are about 1·2 × 10130 possible polymers. This is a big number. A collection containing one molecule of each sequence would fill a volume larger than Avogadro's number of universes (Beasley & Hecht, Reference Beasley and Hecht1997). As randomly generated sequences rarely fold, the effective search space of protein design must be reduced. Even strong simplifications yield libraries still too vast enough to make exhaustive search impractical. The alphabet reduction approaches described in the previous section therefore become a necessary alternative. A ‘binary code’ for protein design was a pioneering reduction to construct libraries where folded proteins were likely to be found (Kamtekar et al. Reference Kamtekar, Schiffer, Xiong, Babik and Hecht1993). In these libraries, encoding up to five polar and six non-polar amino acids, families of four helix bundles were found to be well ordered, and then could be successfully modified to introduce cofactor-binding, catalytic activity, and other functional properties producing some of the first protein-based nanomaterials (Smith & Hecht, Reference Smith and Hecht2011). Four helix bundles based on only three amino acids have been produced, exemplifying that symmetrical structures are more designable than asymmetric ones (Wolynes, Reference Wolynes1996).
The principle of minimal frustration provides a framework for practical engineering questions (Onuchic et al. Reference Onuchic, Luthey-Schulten and Wolynes1997). The essence of landscape engineering is that, in its simplest form, the theory requires quantifying only a few parameters: E s, the stability gap between the ground (native, folded) state of the protein, E d, the mean of the excited (misfolded) states and δE, the roughness of the energy landscape. Natural foldable proteins maximize T f/T g, suggesting a physically meaningful criterion in the design of energy functions for protein folding in silico. The essential feature of a potential function is that the energy of a sequence in the native structure measured from the mean of the misfolded states is much larger than the width of that distribution. In such a case, the folded conformation will automatically have energy lower than that of all alternative conformations.
Inspired by these principles, the Takada laboratory developed an automated procedure to design sequences to fold to a given simple structure (Jin et al. Reference Jin, Kambara, Sasakawa, Tamura and Takada2003). Folding simulations of a natural (target) fold were performed and a set of misfolded structures or ‘decoys’ was selected. Design requires crafting a global funneled landscape by figuring out how to design out a vast number of such misfolded configurations. This can be done with statistical energy landscape theory. Sequences were searched to either simply minimize the energy in the native state or to maximize the energy gap between folded and misfolded forms as landscape theory would endorse. It was found that the sequences optimized taking into account the gap display stable secondary and tertiary interactions. The one whose design was based on simply getting a low-energy native structure did not show a well-dispersed NMR spectrum, suggesting that it did not have a unique fold (Jin et al. Reference Jin, Kambara, Sasakawa, Tamura and Takada2003). Figure 15 shows the local frustration patterns of the designed and the natural proteins. All three actually have a large fraction of minimally frustrated interactions, with the energy-optimized sequence presenting an unusually high number of them. The largest cluster of highly frustrated interactions in the natural protein coincides with the albumin-binding site. This frustrated region (apparently evolved for function) is absent in the designed sequences. To explore the effects of the tertiary energetic frustration in these systems, Truong et al. performed folding simulations and structural predictions using AWSEM (Truong et al. Reference Truong, Kim, Schafer and Wolynes2013). They found that small differences in frustration level do impact the quality of the predictions. The fold of the natural sequence is indeed best predicted by simulated annealing, unlike the designed sequences that show lower fidelity in a structure prediction. Unlike the nicely funneled landscape of the natural sequence, landscapes of the designed proteins display complicating features. The main difference between the designed proteins and the natural ones was indeed related to the gap-criterion used for design. The natural sequences turn out to display the largest difference between the native energy and the energies of decoys (Truong et al. Reference Truong, Kim, Schafer and Wolynes2013).
A very nice example showing the possible breadth of protein design was realized by the Baker laboratory in their creation of Top7, a protein with a non-natural fold (Kuhlman et al. Reference Kuhlman, Dantas, Ireton, Varani, Stoddard and Baker2003). The design strategy started with a trace of the topology for which fragments of naturally occurring sequences in other proteins were selected to match the desired secondary structure elements. The sequences were then subjected to further cycles of mutation and selection with the computer scoring the energy of the sequences in the target structure. The selection criteria were based primarily on minimizing the total energy, with some few additional restrictions on the sequences in order to maintain exposed polar residues in an attempt to avoid aggregation. The resulting sequence had no significant similarity to any natural sequence but was shown to fold to a monomeric soluble form whose crystal structure matched the targeted goal (Kuhlman et al. Reference Kuhlman, Dantas, Ireton, Varani, Stoddard and Baker2003). Study of the folding kinetics of designed proteins showed that Top7 folds fast relative to natural proteins of similar size. Nevertheless Top7 displayed complex multiphasic kinetics unlike its natural counterparts (Scalley-Kim & Baker, Reference Scalley-Kim and Baker2004). The effect was judged to come from slow interconversion of collapsed folding intermediates that become populated, a sign of frustration in the landscape. Apparently not all protein sequences that can be crystallized have energy landscapes as smooth as those of natural proteins. Perhaps crystal formation itself helps to separate different conformers. The source of frustration was suggested to be in the extreme regularity of the β-strands, which could favor out-of-register pairings. It is also clear that even the basic topology of Top7 makes for its non-cooperative folding (Truong et al. Reference Truong, Kim, Schafer and Wolynes2013; Zhang & Chan, Reference Zhang and Chan2010). Local frustration analysis indicates Top7 is unusually unfrustrated in an energetic sense. It has 60% of the interactions classified as minimally frustrated and only 6% highly frustrated. This is much less frustrated than are natural proteins of similar size and secondary structure (Fig. 15). Folding simulations of Top7 and a cognate S6 protein with a non-additive structure-based model show that the topological constraints make the natural counterpart fold smoothly in a two-state manner for a fully unfrustrated landscape, while in contrast the Top7 topology allows there to be populated intermediates even on a perfectly funneled surface (Fig. 15). These kinetic topological traps have a polarized structure, with the C-terminal fragments mostly folded with clusters of minimally frustrated interactions, unlike the transition state of the natural protein that is structurally homogeneous (Fig. 15). The high stability of the fold is consistent with the original design, but the low-free-energy structures appear because the sequence was not explicitly designed to avoid them. Folding simulations with the AWSEM potential also show that several non-native structures are energetically competitive, having non-native β-swapping as the main source of degeneracy, as earlier suggested. These and other results (Yadahalli & Gosavi, Reference Yadahalli and Gosavi2014) highlight the idea that the topology of Top7 is in itself a source for conflicts in folding, leading to a kind of ‘topological frustration’.
4. Folding kinetics and frustration
4.1. Kinetics on idealized funnel landscapes
Very often the dominant folding routes as determined via the ϕ-value analysis can be predicted from the native structure alone by assuming a uniform stabilization gain for forming any given contact (Clementi et al. Reference Clementi, Nymeyer and Onuchic2000). In the folding transition-state ensembles, many contacts usually are formed simultaneously. Therefore the law of large numbers allows the differences in energy between different specific native contacts to become averaged out. This averaging allows a uniform purely structure-based model to work. Even when such an averaged ‘vanilla’ funnel landscape model is inadequate, other models (slightly more nuanced) having non-uniform interaction energies work well. These heterogeneous models (Portman et al. Reference Portman, Takada and Wolynes1998, Reference Portman, Takada and Wolynes2001a, Reference Portman, Takada and Wolynesb; Shen et al. Reference Shen, Hofmann, Oliveberg and Wolynes2005; Shoemaker et al. Reference Shoichet, Baase, Kuroki and Matthews1999; Zong et al. Reference Zong, Wilson, Shen, Wolynes and Wittung-Stafshede2006) while nevertheless having completely minimally frustrated landscapes make improved predictions of the variations of rates. At the very least such heterogeneous contact energy models predict why for a particular system there may be an unusual sensitivity of folding route to details. Such sensitivity occurs, for example, when there are two symmetrically related routes to the folded state which compete. The successes of such pure funnel models in predicting experimental results have been reviewed several times before (Oliveberg & Wolynes, Reference Oliveberg and Wolynes2005; Onuchic & Wolynes, Reference Onuchic and Wolynes2004). We also note that the dominance of native contacts in folding routes assumed in idealized funnel models is supported even by recent analyses of state-of-the-art all-atom folding simulations (Best et al. Reference Best, Hummer and Eaton2013; Henry et al. Reference Henry, Best and Eaton2013; Piana et al. Reference Piana, Lindorff-Larsen and Shaw2012).
Folding in idealized funnels does not always progress as smoothly as might be imagined. Real proteins are not geometrically homogeneous systems, so the energy gain for folding some part of the protein may not always completely compensate the chain entropy loss of fixing the residues in the native positions. Thus the free energy changes along the various routes to the native state may differ giving rise to the appearance of distinct families of macroscopic pathways. For the sake of simplicity, contrast the case in which the native interactions are made mostly between residues close in sequence (such as a α-helical bundle) with another set of routes where the native contacts to be formed are distant in sequence space (such as a β-sheet of distant residues). In the helical bundle, fortuitous interactions will be more probable between residues close in sequence, and the entropy loss will be easily compensated with the interaction energy gain. Since the entropy is already reduced, neighboring interactions will be favorably formed, leading to a zipper-like folding mechanism (Zimm& Bragg, Reference Zimm and Bragg1959). In the second case of the β-sheet, the entropy loss of forming sequence distant interactions is higher, and thus more interactions will be needed to offset the corresponding entropy loss, making it less probable to form a stable enough structure. This argument is at the core of the correlation between folding rate and contact-order which is a rough proxy for loop-entropy (Clementi & Plotkin, Reference Clementi and Plotkin2004; Plaxco et al. Reference Plaxco, Simons and Baker1998; Plotkin et al. Reference Plotkin, Wang and Wolynes1997). Even in simple situations, this imperfect cancellation of chain entropy by contact energy can be sufficient to make a sizable free-energy barrier appear, and for the protein to behave as two-state system, populating mostly folded or unfolded configurations and not much in between. So even when energetic frustration is removed, and the contact energies between residues are taken to be the same for every pair, the mere topology of the native state gets in the way in defining the locations of the barriers and preferred routes in the folding mechanism. This type of conflict between the need for making native interactions and the necessity of preserving chain connectivity is called ‘topological frustration’.
A simple way to investigate the effects of topological frustration is to explore the folding of a protein with structure-based models (Noel et al. Reference Noel, Whitford, Sanbonmatsu and Onuchic2010). These models use native structure as the sole input of the dynamic calculation and assign a favorable energy to every pair of residues that are ‘in contact’. Contact definitions may differ, but usually a distance cutoff is sufficient to determine the most salient characteristics (Noel et al. Reference Noel, Whitford and Onuchic2012). Molecular dynamics is then performed to explore the phase-space and both thermodynamic and kinetic parameters can be determined because here there is no energetic frustration to slow down the sampling. A natural reaction coordinate here is ‘how native-like the protein is’, which is closely related to the number of native contacts that can be easily counted in the snapshots of the simulation (Cho et al. Reference Cho, Levy and Wolynes2006). Although deceptively simple, these schemes based on perfectly funneled landscapes have been proven extremely useful over the past decade. It has been shown that the structures of transition-state ensembles (Ejtehadi et al. Reference Ejtehadi, Avall and Plotkin2004), the folding rates (Chavez et al. Reference Chavez, Onuchic and Clementi2004), the existence of folding intermediates (Clementi et al. Reference Clementi, Nymeyer and Onuchic2000), dimerization mechanisms (Levy et al. Reference Levy, Wolynes and Onuchic2004), and domain-swapping events (Yang et al. Reference Yang, Cho, Levy, Cheung, Levine, Wolynes and Onuchic2004), are often very well predicted with these kind of models.
The study of topological frustration with structure-based models is not only a playground toy to understand the basics of protein folding, but it has also given insights into diverse functional mechanisms. Studies on the inflammatory cytokine interleukin-1β (IL-1β) have yielded significant insights about the modulation of the folding landscape by perturbing functional regions. Structure-based models have identified that a functionally relevant β-bulge (located between β-strands 4 and 5) lies in a region that is topologically frustrated. These simulations suggest that contacts with this region direct folding route selection (Gosavi et al. Reference Gosavi, Whitford, Jennings and Onuchic2008). Deletion of the β-bulge maintains high-affinity receptor binding but abrogates signaling activity, effectively converting the agonist into an antagonist molecule. This insight inspired experimental studies of malleability of the folding process that relate this malleability to functional properties, using proteins where the conflicting region was deleted (Capraro et al. Reference Capraro, Roy, Onuchic, Gosavi and Jennings2012). Both the structure-based modeling and the detailed folding kinetics show that topological frustration causes premature folding of some parts that ideally would have folded later. Undoing the traps becomes a kinetic bottleneck, a phenomena termed ‘backtracking’ (Gosavi et al. Reference Gosavi, Chavez, Jennings and Onuchic2006). Occupying these traps can be inevitable and in natural environments eventually becomes a functional feature, acting as a signal facilitating export of the slow-unfolding protein, the formation of a functional excited state, etc. Still, topological frustration is not trivial to detect from first principles. We generally still need to run extended simulations to localize the trapping states and characterize the kinetic bottlenecks, all of which are somehow encoded in the native topology. ‘It is nice to know that the computer understands the problem. But I would like to understand it too.’ E. Wigner.
4.2. Chemical frustration
Most of the forces that give three-dimensional shape to one-dimensionally connected protein molecules are individually quite weak, a few k BT being a typical interaction energy between amino acid side chains. Some protein chains however are partly assembled using stronger forces such as covalent disulfide bonds or the coordination bonding so prevalent in metalloproteins to stabilize the three-dimensional structure. Because these ‘chemical’ interactions are stronger than most of the interresidue forces giving rise to the folding funnel, they can lead to an especially large degree of frustration and kinetic trapping in the folding mechanism, especially when folding is studied far away from physiological conditions. While these chemical interactions are very strong they are more specific and relatively fewer in number than the other forces making it possible often to enumerate structural possibilities explicitly.
Disulfide-bonded proteins are especially interesting because of their historical place in studying protein folding. Anfinsen's famous experiment (Anfinsen, Reference Anfinsen1973) did not directly show that the complete three-dimensional fold was spontaneously formed but rather only showed that the cysteines which can pair up in multiple ways when ribonuclease is oxidatively refolded under redox buffered conditions did indeed finally find their way to their proper native pairings – a rather coarse-grained view of the structure! In the language of this review, then, the possible frustration of mispairing disulfides turned out to play a minimal role at least when the disulfide/sulfide exchange is well catalyzed by the glutathione buffer. A historically important next step was the pioneering study of the oxidative folding of bovine pancreatic trypsin inhibitor BPTI carried out by Creighton & Goldenberg (Reference Creighton and Goldenberg1984). Here chemical trapping via alkylation revealed several populated species, many of which, surprisingly, exhibited non-native pairings: a sign of frustration. Indeed this laboratory observation was very much on the mind of one of us (PGW) when he first began to study frustration effects on folding in the 1980s (Bryngelson & Wolynes, Reference Bryngelson and Wolynes1987). Of course the free-energy barrier to rearranging the covalent disulfide pairings depends very much on the oxidation environment so it is not a surprise that chemical frustration can play a role when there are barriers to disulfide interchange. Rather soon after landscape ideas were introduced into folding, ironically it was found by Kim and Weissman that the non-native intermediates originally found in Creighton's study were actually not well populated when the disulfide rearrangements were better catalyzed with a different buffering system so that the ‘chemical frustration’ was actually minimized, leading to a dominantly funneled mechanism for folding via on-pathway intermediates in which primarily native structure was partially formed and where there was very little long-lived non-native structure (Weissman & Kim, Reference Weissman and Kim1991, Reference Weissman and Kim1992). The disulfide exchange processes, of course, are influenced by tertiary structure formation and premature burial of a disulfide can lead to an intermediate that must ‘back-track’, undoing and remaking disulfides to complete folding. This scenario seems to apply to hirudin folding as do some topologically frustrated proteins without disulfides (Chang, Reference Chang2011).
Metalloproteins provide many other examples of chemical frustration. The paradigm for metalloprotein folding, again of great historical importance to the folding field is cytochrome c, which has been the subject of many experimental observations (Englander, Reference Englander2000) and more recently several theoretical studies (Weinkam & Wolynes, Reference Weinkam, Wolynes, Wittung-Stafshed and Gomes2010a; Weinkam et al. Reference Weinkam, Zong and Wolynes2005, Reference Weinkam, Zimmermann, Sagle, Matsuda, Dawson, Wolynes and Romesberg2008, Reference Weinkam, Zimmermann, Romesberg and Wolynes2010b). The intense color of the heme made studies of ultrafast folding particularly attractive. While the iron in the heme of cytochrome c is permanently ligated to a cysteine, the sixth ligand for the iron, normally a methionine, can transiently be one of many choices as a coordination partner, each of these gives a characteristic optical spectrum for the heme absorption. Under differing solution conditions then different misligated species can be populated and their concentrations monitored. Flooding the system with imidazole or going to low pH, thereby protonating the histidines, prevents such non-native histidine–heme misligation. Then by minimizing this source of chemical frustration essentially by changing solution conditions, completion of folding turns out to speed up by several orders of magnitude (Weinkam et al. Reference Weinkam, Romesberg and Wolynes2009). In contrast, chemical frustration can be increased by raising the pH. This allows many histidines and lysines to lose their protons and now be able to ligate the iron, stabilizing alternate configurations, eventually leading to an alkali denatured species. The species populated by alkaline denaturation turn out to have misligated lysines but otherwise are partially structured in a native-like fashion through the remaining minimally frustrated interactions, which of course include a dominant contribution from hydrophobic interactions of the chain with the heme. Models of perfectly funneled landscapes supplemented by just the specific strong chemically frustrated misligation interactions with lysines have been shown to describe the structural properties of the alkali denatured state quite well (Weinkam et al. Reference Weinkam, Zong and Wolynes2005). These models also do a good job of predicting the thermodynamics in terms of bond strengths of the coordinating ligands and the thermal unfolding characteristics of the native proteins (Weinkam et al. Reference Weinkam, Zimmermann, Romesberg and Wolynes2010b).
4.3. Symmetry effects
Symmetry turns out to provide an efficient way for the minimal frustration in biomolecules to have evolved. The basis for this idea is easy to see qualitatively using the notion of the ‘double-or-nothing’ strategy of high stakes gambling. If you are playing a game where there is a minimum threshold that must be met in order to survive so your goal is not just maximizing your expected winning, in a fair game you are very much better off to increase the size of your individual bets and to make proportionately fewer of them. For folding biomolecules the energetic threshold is set by the floor corresponding to the ground-state energy of random heteropolymers, E g/N must be achieved for survival.
In a symmetric structure, the same set of interactions is used multiple times – this multiplicity is the meaning of symmetry, after all. So if a strong interaction is found in a symmetric molecule it will appear more often than would otherwise be expected. This corresponds to evolution's making bigger energetic bets, but again of course, fewer independent bets are made for the same size molecule. This is the double-or-nothing strategy in molecular evolution. Because the distribution of energies is thus wider for the symmetric case, low-energy symmetric structures (see Fig. 16) occur much more often than might be naively expected. Symmetric structures are, indeed, prevalent, and often found by chance (symmetric homo-dimers are often found by crystallographers, even when unexpected). In addition, structural symmetries are believed to have evolved from fused gene duplications even when the actual sequences found in present day examples leave no exact repetition or apparent symmetry in the sequence but only in the three-dimensional structure. A rather indirect argument along these lines was made by Wolynes (Reference Wolynes1996) and was more elegantly formalized by Wales (Reference Wales1998). A more direct version is due to André et al. (Reference André, Strauss, Kaplan, Bradley and Baker2008).
The tendency for symmetric structures to be low in energy means they can show up, seemingly out of the blue, in folding kinetics studies as well as in the natural history of proteins. A cause celebre of this type is the folding of the ROP dimer, a four helix bundle made up of two sets of paired helical coiled coils (Fig. 17). The usual native structure is made by an antiparallel packing of one helical pair on the partner pair of helices. The unusual kinetic anomaly noted by Regan and co-workers was that in some cases making alanine mutations in the individual monomers leads to a dramatic speeding up of the folding rate, while at the same time leaving the equilibrium stability unchanged (Munson et al. Reference Munson, Anderson and Regan1997). This is not consistent with the idea of there being a unique folding funnel and was cited as a problem for the landscape theory. Although a variety of alternative hypotheses were put forward (Lum et al. Reference Lum, Chandler and Weeks1999), Levy et al. conjectured that symmetry was the problem (Levy et al. Reference Levy, Cho, Shen, Onuchic and Wolynes2005). An alternate non-native parallel pairing in the dimer structure was predicted by structure-based simulations to fold much faster than would the usual native antiparallel form, i.e. the parallel form is less topologically frustrated. Both parallel and antiparallel forms use similar contact interactions and are thus energetically comparable and can act as dual funnels. While originally the fast alanine mutants were thought to have the same structure since like the natural protein they bound RNA in vitro, in fact, the mutant was not active in vivo where the RNA concentration was lower. This prediction that there was a structural change based on basic energy landscape theory was confirmed by recent single molecule FRET experiments on ROP (Gambin et al. Reference Gambin, Schug, Lemke, Lavinder, Ferreon, Magliery, Onuchic and Deniz2009). By labeling the monomers it was possible to confirm that both parallel and antiparallel packing coexist for the mutant (Fig. 17). The in vitro RNA binding, owing to the high RNA concentration, pulls the structural ensemble over to the natural one by the law of mass action.
4.4. Frustration in repeat proteins
Many natural proteins contain tandem repeats of similar amino acid stretches. These have been broadly classified in groups according to the length of the minimal repeating units. Short repeats up to five residues usually fold into fibrillar structures such as collagen or silk, while repeats longer than about 60 residues usually fold as independent globular domains (Kajava, Reference Kajava2012). There is a class of repeat proteins that lie in between these for which the repeating units couple their folding. For these proteins unique ‘domains’ are not obvious to define (Parra et al. Reference Parra, Espada, Sánchez, Sippl and Ferreiro2013). Repeat proteins are believed to be ancient folds. Their biological activity is usually attributed to mediating specific protein–protein interactions. Their modular structure allows a versatility of recognition parallel to that of antibodies. Successful design of repeat-protein domains with novel functions based on simple sequence statistics (Binz et al. Reference Binz, Stumpp, Forrer, Amstutz and Plückthun2003; Tamaskovic et al. Reference Tamaskovic, Simon, Stefan, Schwill and Plückthun2012) suggests that the folding and the functional signals present in their sequences can be partially segregated. Still, when studied in detail the coupling between folding and binding of natural proteins turns out to be intimately related to their biological function (Ferreiro & Komives, Reference Ferreiro and Komives2010). Typical repeat-proteins are made up of tandem arrays of ~20–40 similar amino acid stretches that fold up into elongated architectures of stacked repeating structural motifs. Quasi-one-dimensional, these non-globular folds are stabilized only by interactions within each repeat and between neighboring adjacent repeats, with no obvious contacts between residues more distant in sequence. The internal symmetries of repeat-proteins suggest that the overall folding properties of a complete ‘domain’ (the stability and cooperativity of the array) may be derived from a microscopic description of the energy balance within each folding element and its interactions with its neighbors (Aksel & Barrick, Reference Aksel and Barrick2009). Because of the delicate energetic balance in each subunit, subtle variations in the interactions in and between the repeats can give the impression of major changes in the folding landscape while in fact it is well funneled (Ferreiro & Komives, Reference Ferreiro and Komives2007). Such variations may ‘decouple’ the folding of the elements, and partially folded species become populated in kinetic studies. Sufficient information about the population of these states can yield rather quantitative models of the energetic distribution along the protein (Ferreiro et al. Reference Ferreiro, Walczak, Komives and Wolynes2008b; Schafer et al. Reference Schafer, Hoffman, Burger, Craig, Komives and Wolynes2012). Behind the apparent simplicity of repeating amino acid sequences, an extremely rich behavior emerges.
Energy landscape theory argues that the three-dimensionally connected globular proteins must fold along a funneled landscape. One-dimensionality weakens this necessity (Luthey-Schulten et al. Reference Luthey-Schulten, Ramirez and Wolynes1995). Folding simulations based on perfectly funneled model landscapes nevertheless have revealed how finer details of the energetic contributions contribute to repeat-protein folding and are able to recapitulate the experimental results (Barrick et al. Reference Barrick, Ferreiro and Komives2008). Mello & Barrick, (Reference Mello and Barrick2004) have derived a funneled energy landscape from equilibrium unfolding data on the Notch seven-ankyrin-repeat domain and a simple energy function based on the folding of each repeat and the interaction with its neighbors. Werbeck & Itzhaki, (Reference Werbeck, Rowling, Chellamuthu and Itzhaki2007) have shown that the energy landscape of a large ankyrin-repeat protein is funneled yet the kinetics can be described as that for two approximately independent six-repeat subdomains. The different stabilities of these subdomains hint at unevenness in the energy landscape that results in a broad ensemble of species of similar free energies that cannot be distinguished from one another. Single-point mutations cause further landscape unevenness (Werbeck et al. Reference Werbeck, Rowling, Chellamuthu and Itzhaki2008) and give rise to multistate folding kinetics (Itzhaki & Lowe, Reference Itzhaki and Lowe2012). It is important to note that this ‘unevenness’ is of a different sort than the ‘roughness’ discussed in landscape theory, by which is meant the energetic frustration that leads to non-native interactions.
Hagai et al. have recently compared the folding kinetics of two repeat-protein systems that, besides having similar contact-order, display very different folding rates in both experiments and simulations. They used energetically unfrustrated models and showed that modulating the energetic strength of the interface between repeats (relative to that of the intra-repeat) can drastically change the stability, folding rate, and cooperativity of the repeating array (Hagai et al. Reference Hagai, Azia, Trizac and Levy2012). Their study suggested that a conflict appears between forming interfaces to the left or to the right of an already folded internal repeat-unit. They showed that the formation of one interface is attenuated when the other is already formed. They argue this topological frustration will often arise as a consequence of the quasi-monodimensional geometry of repeat proteins. According to their analysis, these systems can be designed to fold faster by decreasing the strength of the interface or parts of it, resulting in less frustration in the formation of the two interfaces, and consequently in a higher probability for forming the intra repeat contacts. This reduces the conflict involved in forming the two interfaces of a given repeat unit and in turn gives rise to a less cooperative behavior (Hagai et al. Reference Hagai, Azia, Trizac and Levy2012).
Unlike globular domains, solenoid repeat architectures can be severely truncated and nevertheless still retain cooperative folding properties. Javadi & Main (Reference Javadi and Main2009) have described the folding kinetics of a series of designed TPR-repeat proteins assembled with identical repeats, from two to ten repeats, and have described how the energy landscape changes with the addition of repeat units. Consistent with theoretical predictions (Ferreiro et al. Reference Ferreiro, Walczak, Komives and Wolynes2008b), they observed that the longer proteins display more complex multistate relaxations. Moreover, although the initial folding event of all these proteins involves a nucleus with similar solvent accessibility, the subsequent folding of the other regions depends directly on the repeat number, with the longest protein populating an apparent off-pathway intermediate (Javadi et al. Reference Javadi and Main2009). Other repeat protein topologies show congruent folding characteristics. Long arrays of Ankyrin and Heat repeats display complex denaturant-dependent relaxations, indicative of sequential transition states and parallel pathways (Itzhaki & Lowe, Reference Itzhaki and Lowe2012). Kinetic studies on site-directed mutants suggest that the sequential transition states arise from uneven unzipping of the repeat array with local trapping, and parallel pathways result from the inherent symmetry of repeat-protein structures that initiate folding in either or both N-terminal to C-terminal, depending on the local stability of the initially nucleating repeats. Moreover, the addition of terminal stabilizing repeats can shift the transition-state ensemble toward those regions, rerouting the folding of the whole repeat array (Tripp & Barrick, Reference Tripp and Barrick2008). Local energetics are thus crucial in determining the routes actually followed and the traps populated in these quasi one-dimensional proteins. Experimentally tracing the kinetics of repeat-protein folding further details the richness of mechanisms that can be gained by molding the traps. As mentioned, these traps may arise purely topologically or as the effects of local energetic frustration.
Energetic frustration seems to be at work in determining the traps in the folding of the ankyrin repeat protein D34. This protein unfolds via a fast process that yields an intermediate that can also be detected at equilibrium (Werbeck & Itzhaki, Reference Werbeck and Itzhaki2007). The barrier separating the native from the intermediate states appears ‘broad,’ consistent with the picture of neighboring repeats unfolding sequentially (Fig. 18). A slower unfolding phase corresponds to unravelling the C-terminal subdomain, where mutations affect both the folding rate and its urea dependence in a manner consistent with unfolding by two parallel routes. How does this change in folding regime appear? Local frustration analysis points out that there are three regions enriched in frustrated interactions in an otherwise strongly cross-linked web of minimally frustrated contacts (Fig. 18). One of the highly frustrated regions colocates with the central region where the experiments suggest the intermediate's folding boundary is found (Ferreiro & Wolynes, Reference Ferreiro and Wolynes2008). The highly frustrated interactions are apparently responsible for the roughness that gives rise to a long-lived folding intermediate. Modification of this interface so that it is minimally frustrated should destabilize the trap, making the D34 landscape appear much smoother. Experiments redesigning repeat-proteins so as to conform to minimally frustrated ‘consensus’ sequences show that indeed traps are destabilized and kinetic barriers are considerably reduced (Kramer et al. Reference Kramer, Wetzel, Plückthun, Mittl and Grütter2010; Street & Barrick, Reference Street and Barrick2009). The overall picture emerging from repeat-protein-folding studies is that the differences in folding rates are related to escape from local traps. These traps appear as manifestations of frustration, either topological, energetic or both. There is a fine balance between these signals that can be at play at the secondary structure and/or the tertiary contact levels. A local effect is felt globally because the near-neighbor interactions are extraordinarily important in stabilizing the repeats, and weak biases can tip the balance to complete folding. Single substitutions can affect local biases (such as helix propensity), and thus exert profound effects on the overall folding of these domains. The larger repeating arrays are more likely to tolerate ‘cracks’ and the folding at the repeats ends may become anticorrelated (Ferreiro et al. Reference Ferreiro, Sánchez and de Prat Gay2008a). Such sensitivity allows for specific encoding of traps by means of making few sequence or environmental modifications. This may be used in intracellular targeting and sorting. Owing to the symmetry of the repeating array, long arrays can be fine tuned to populate partially folded states, and these ensembles can be coopted in functional mechanisms. Changing the stability of a single repeating element (by post-translational modifications or binding of other macromolecules) may affect the behavior at a distant site, providing a coupling mechanism that can transmit allosteric signals to long distances within a single repeating geometry.
4.5. Anomalous ϕ values
In the laboratory, ϕ-values are determined by measuring the changes in both overall stability and folding speed of a protein and single point mutant (s) of it and comparing them (Fersht et al. Reference Fersht, Matouschek and Serrano1992). The stability change is ${\rm \Delta \Delta} G_{{\rm mut - wt}}^{{\rm eq}} = {\rm \Delta} G_{{\rm mut}}^{U - F_{{\rm mut}}} - {\rm \Delta} G_{{\rm wt}}^{U - F_{{\rm wt}}} $; and the change in the folding rate usually can be ascribed to the effect of the mutation on the activation free energy, ${\rm \Delta \Delta} G_{{\rm mut - wt}}^{{\rm TSE}} = - RT\,{\rm ln}(k_{{\rm mut}} /k_{{\rm wt}} )$. This analysis assumes that the transmission coefficient, related to configurational diffusion and internal friction, remains constant upon mutation, which may not always be the case. The ϕ-value is defined as ΔΔG TSE/ΔΔG eq. For minimally frustrated proteins, realistic pictures of the rate-limiting steps in folding can be constructed by measuring the effect of mutations not just at the folding midpoint but over a range of thermodynamic conditions. In contrast to the situation for a minimally frustrated protein, on a very rugged landscape the ϕ-values become very difficult if not impossible to interpret in terms of the native folded protein structure alone. This is because the kinetics of re-organization is determined by the stability of unknown non-native contacts in the transition-state ensemble. Since these need not be present in the final native structure for highly frustrated systems, their energetic change would be unrelated to stability changes of the protein as a whole. The globally unfrustrated character of the energy landscape of natural proteins indicates that there will be relatively few thermally accessible states that can be associated with such folding ‘mistakes’ at the folding temperature (Tf) . When T f/T g is large, the structures will have mostly native interactions formed and only a few weak non-native interactions are anticipated. Partially folded forms of the protein may become populated, but highly misfolded alternate structures are not expected to occur in general. This simplifying feature of a minimally frustrated heteropolymers’ landscape is what allows one to infer the existence of a relation between the energetic contributions a residue pair makes in the native state with the energetic contribution to the populated intermediate or transition-state ensembles in the folding routes. The thermodynamic contribution of an interaction will be simply scaled by the fraction of the time that the pair contact is actually formed in the ensemble, and the folding rates will largely follow native-state stability giving rise to ‘rate equilibrium free-energy relations’ (Leffler, Reference Leffler1953). This regularity is predicted by a funneled landscape and provides the ultimate theoretical basis of the analysis of protein-folding kinetics using ϕ-values (Matouschek et al. Reference Matouschek, Kellis, Serrano and Fersht1989). Strong deviations from this simple behavior then are likely manifestations of localized frustration in the energy landscape.
Many natural proteins with simple kinetic behavior have been subjected to the ϕ-value determination and analysis. Almost 1000 mutations have been independently evaluated and the general behavior is in accordance with the expectations outlined above (Naganathan & Muñoz, Reference Naganathan and Muñoz2010). Most of the ϕ-values indeed lie in the range 0<ϕ<1, i.e. they are consistent with an ensemble averaged fractional occupancy of local native structure (reviewed elsewhere (Onuchic et al. Reference Onuchic, Luthey-Schulten and Wolynes1997)). In general, the ϕ-value increases linearly with the stability perturbation and, in conditions of isostability, where the unfolded and folded forms are equally populated, most of the ϕ-values cluster about ϕ = 0·36 ± 0·11 (Naganathan et al. Reference Naganathan and Muñoz2010). This average value is anticipated from both mean field and capillarity theories on perfectly funneled landscapes if the activation free-energy barrier originates from an imperfect cancellation of the entropy loss by the energy gain upon folding, as we have discussed previously. ϕ-values are not completely uniform, however. Site-specific variations of ϕ-values can be roughly predicted by knowing the topology of native structure under the assumption of a minimally frustrated energy landscape (Clementi et al. Reference Clementi, Nymeyer and Onuchic2000; Muñoz & Eaton, Reference Muñoz and Eaton1999). These predictions are pretty good when the conformations belonging to the transition state ensemble (TSE) have a large connected web of native contacts established. Sometimes, especially for α-helical proteins, inhomogeneities of the energies do make the vanilla predictions less accurate (Cho et al. Reference Cho, Levy and Wolynes2009). If the interactions around the probed site are sparse, mutations can change the location of the TSE dramatically and the perturbations can no longer be treated as linear free-energy perturbations (Sánchez & Kiefhaber, Reference Sánchez and Kiefhaber2003b). Very large deviations from structure-based predictions or ϕ-values outside the expected 0–1 range are signatures that energetic frustration could be at play. Sometimes, the deviations can be directly attributed to highly frustrated interactions that one can identify in a native structure. Figure 19 shows the local frustration pattern of four example proteins where specific mutations perturb the folding rate more than would be expected by the native stability change (ϕ > 1, Fig. 19a) or where the mutations have opposite effects on the TSE and the native state (ϕ < 0 Fig. 19b). In both CI2 and FKBP12, the sites indicated by these anomalous ϕ-values map to regions that are enriched in highly frustrated interactions in the native state (Fig. 19). However, in the cases of the fynSH3 and srcSH3 domains, the mutations for which anomalous ϕ-values were measured do not coincide with frustrated interactions for tertiary contacts (Fig. 19c, d). For these sites the anomaly must be attributed to other sources of frustration. One major component of the stabilization energy that is not visible in the frustratograms of Fig. 19 is the secondary structure energy term. Strongly stabilizing tertiary interactions may locally conflict with the secondary structure propensity of a backbone that will otherwise occupy other conformations if unrestrained. Conflict between secondary structure formation and tertiary interactions has been commonly observed in the SH3-fold class (Klimov & Thirumalai, Reference Klimov and Thirumalai2002). Note that in these cases the regions of anomalous ϕ-values are enriched in minimally frustrated tertiary interactions, perhaps compensating for the secondary/tertiary conflict (Fig. 19c, d).
4.6. Non-native intermediates
Immunity Protein 7 (Im7) is an 86-residue-long protein that in vitro folds through an on-pathway intermediate state. Kinetic studies suggested that this intermediate contains some non-native interactions of three of the four helices around a specific hydrophobic core (Capaldi et al. Reference Capaldi, Kleanthous and Radford2002). This exception to the behavior expected for minimally frustrated heteropolymers hints that a functional constraint may be conflicting with rapid folding. Does frustration change the way the Im7 intermediate is formed? Sutto et al. explored this question using simulations that include native bias between residues close in sequence but that allowed for non-native (and possibly frustrated) interactions between residues farther apart (Sutto et al. Reference Sutto, Lätzer, Hegler, Ferreiro and Wolynes2007). These simulations predict an intermediate state stabilized by favorable non-native interactions formed by residues that, in the native form, cluster on the surface to form a locally frustrated region (Fig. 20). Further mapping of the structural transitions using protein engineering combined with kinetic folding experiments and restrained molecular dynamics characterized more completely the details of the basins and bottlenecks, and confirm that indeed the energy landscape of Im7 is unusually rugged (Friel et al. Reference Friel, Smith, Vendruscolo, Gsponer and Radford2009). Rather than folding by progressively increasing the number of native contacts, the sequence of Im7 contains other signals that constrain efficient folding. Many of the residues identified as forming non-native contacts during the early stages of folding of Im7 lie in regions that have a functional role: they are involved in the recognition and inactivation of colicin toxins forming the interaction interface with them. By coupling protease susceptibility and antibiotic resistance traits, Foit et al. developed a novel assay to screen for protein stability in vivo (Foit et al. Reference Foit, Morgan, Kern, Steimer, von Hacht, Titchmarsh, Warriner, Radford and Bardwell2009). When random mutagenesis and selection for antibiotic resistance was conducted on the Im7 protein, most of the mutations that were found to enhance thermodynamic stability map again to the predicted binding surface as would be expected from the local frustration analysis (Fig. 20). It appears that the functional constraint for specific binding restricts the sequences and gives rise to energetic conflicts with the rest of the polymer. Smooth folding of this small domain is compromised by frustration and in the case of Im7, traps with non-native structure give rise to an intermediate, which is observed both in experiment and simulations (Figueiredo et al. Reference Figueiredo, Whittaker, Knowling, Radford and Moore2013; Whittaker et al. Reference Whittaker, Clayden and Moore2011). The observation of clusters of frustrated interactions in the native state points the way to a general mutational strategy to reduce the ruggedness of folding landscapes. The most effective approach to destabilize the intermediate of Im7 should be to stabilize the native state by alleviating the energetic conflicts in this region (Sutto et al. Reference Sutto, Lätzer, Hegler, Ferreiro and Wolynes2007). When redesigned mutants were simulated, the intermediate is no longer populated when residual native-state frustration is reduced. The redesign strategy based on minimizing native-state frustration is more effective than the specific negative design of the intermediate. A mutant specifically redesigned to destabilize the wild-type intermediate did not fold without an intermediate but rather retained an intermediate having a different structural ensemble, but nevertheless involving some non-native contacts (Sutto et al. Reference Sutto, Lätzer, Hegler, Ferreiro and Wolynes2007).
4.7. Friction and roughness effects on rates
The Im7 example shows that one of the effects of frustration is to introduce kinetic traps or non-native intermediates into the folding mechanism, a possibility first pointed out by Bryngelson & Wolynes (Reference Bruylants and Redfield1987). If the system is small the number of trapping states will be small and can be described as arising from the specific intermediates. The structural uniqueness of these intermediates is however in each case an open issue. As we saw in the Im7 example destabilizing one particular misfolded structure, generally just gives rise to an alternate trap, if frustration still remains in the native structure. What is the effect of these ensemble traps on the folding kinetics? In general, searching through the traps slows folding down. Bryngelson & Wolynes (Reference Bryngelson and Wolynes1987) showed how this slowing effect, especially where there are many traps, could be thought of as a source of ‘friction’ on the diffusive progress of the protein as it is organized by the remaining guiding forces of the funnel. A very nice experimental system that seems to illustrate this idea is the extensive study of the folding of spectrin domains by Clarke and co-workers (Wensley et al. Reference Wensley, Batey, Bone, Chan, Tumelty, Steward, Kwa, Borgia and Clarke2010). The experiments of Clarke and co-workers have shown that structurally highly homologous members of the spectrin domain family fold with very different rates. The frustratograms of the three natural systems that have been studied are shown in Fig. 21. R16 and R17 that have a central cluster of frustrated residues fold three orders of magnitude slower than does R15, which is more uniformly minimally frustrated. This central cluster of frustrated residues has been shown to be involved in the main step crossing from an initial transition state for folding to a later exit transition state for folding (Borgia et al. Reference Borgia, Wensley, Soranno, Nettels, Borgia, Hoffmann, Pfeil, Lipman, Clarke and Schuler2012; Wensley et al. Reference Wensley, Kwa, Shammas, Rogers, Browning, Yang and Clarke2012). As in other quasi-one-dimensional proteins such as the repeat-proteins, this extended transition region entails what may be described either as a broad malleable transition-state region or a high-energy intermediate (Sánchez & Kiefhaber, Reference Sánchez and Kiefhaber2003a).
One of the effects of having localized frustration in the part of the protein that is being studied at this key kinetic bottleneck is thermodynamic – the transition-state region is destabilized by the frustrated contacts and, by itself, this effect diminishes the folding rate. This thermodynamic effect might be said, with more precision, to be an effect of energetic heterogeneity rather than frustration per se since no alternate structure is involved. Yet a residual search or friction effect contributes to the slowing as well. This frictional effect of frustration is revealed by the viscosity dependence of the folding reactions, which can be studied by varying the solvent, while keeping the stability unchanged. For the unfrustrated R15, the rate scales inversely to the solvent viscosity while for the frustrated systems R16 and R17 the folding rates are independent of solvent viscosity (Wensley et al. Reference Wensley, Batey, Bone, Chan, Tumelty, Steward, Kwa, Borgia and Clarke2010).
Rather general arguments suggest that solvent viscosity effects depend on how much chain motion occurs as the transition state is traversed (Frauenfelder & Wolynes, Reference Frauenfelder and Wolynes1985; Portman et al. Reference Portman, Takada and Wolynes2001b). On the smooth R15 landscape, large amounts of chain motion occur but when frustrated traps are present, as in R16 and R17, many individual smaller steps must be made searching through the misfolds. These smaller reconfiguration events displace little solvent. The idea that several misfolded configurations are involved in the frustrated system is supported by atomistic simulations of the R16 system that show explicitly helix sliding events coming from mismatches in the registry of the helices in this frustrated region (Best, Reference Best2012).
5. Functional consequences of frustration
5.1. Binding sites are often frustrated
Landscape theory approaches have already proven useful for describing protein–protein interactions (Zheng et al. Reference Zheng, Schafer, Davtyan, Papoian and Wolynes2012). From the point of view of the localized frustration index, a natural question is whether and how often the residues that can conflict with folding of a single domain are there specifically to allow protein–protein association. Ferreiro et al. analyzed the distribution of the frustration index in protein assemblies on a curated set of 85 non-redundant hetero-dimeric protein complexes (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2007). Figure 22 shows the frustration indices of the contacts and residues in some examples from the database. When the calculations of local frustration are performed in the unbound monomers, indeed one sees that near the binding sites the proteins are enriched in highly frustrated interactions (Fig. 22). The statistics over the whole database showed that the highly frustrated contacts both cluster and tend to be closer to the binding residues than they are to other surface residues. Minimally frustrated interactions are also particularly excluded from the vicinity of binding residues and neutral interactions follow the native topological distribution (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2007). Interestingly, when the non-binding surface is examined, some clusters of highly frustrated interactions can still be detected (Fig. 22). Whereas many of the highly frustrated interactions are close to the binding residues as defined in the co-crystal structures, these additional frustrated patches on the protein surface suggest there may also be other functional reasons to retain conflicting residues. These reasons may be binding to other unknown molecular partners or, perhaps, these frustrated regions are sites relevant for allostery (vide infra). A corollary to this observation is that if the binding surfaces are too big with respect to the area of the domain, they are expected to enrich the total amount of frustrating interactions, compromising the folding of the unbound domain. It is interesting to note that coupling between folding and binding is more prevalent when the interaction surface is large (Papoian & Wolynes, Reference Papoian and Wolynes2003b).
Changes in frustration upon binding may help the guide-specific association. A survey of the change in local frustration upon binding was performed by calculating the frustration indices for proteins in the complexes and comparing these to the frustration indices of the unbound monomers making up those complexes (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2007). When calculated over a non-redundant set of hetero-oligomer complexes, the newly formed interactions at the binding interface have on average a local frustration distribution similar to that for contacts in single domains. This implies that the specific association can as well be accounted by the principle of minimal frustration, as suggested earlier (Papoian & Wolynes, Reference Papoian and Wolynes2003b). Most of the contacts strictly within the monomers do not change their frustration index upon binding, but approximately 25% of them do become less frustrated upon association. A minor fraction about 7%, change in the opposite direction, becoming more frustrated. The decrement in frustration index is nearly constant and mostly comes from the change in the solvent accessibility upon binding, captured by the burial term and the contact type in the AMW energy function. In summary, the frustration index of the regions close to the binding site changes upon association, usually becoming less frustrated in the complex, and the newly formed interfaces have a similar distribution of local frustration to the interiors of unbound domains.
5.2. Frustration in allostery and local conformational changes
If the minimal frustration principle were to be satisfied everywhere, and the temperature sufficiently low, the protein molecule becomes a beautiful sculpture that would tend to move as a rigid body. For real proteins this rigor mortis cannot be the whole story, as the basic functions of proteins usually involve responding to changes in their environment. The detection and logical control capabilities of proteins rely on their ability to amplify the effect of a relatively small environmental change so as to produce a large discrete transition in a different chemical realm. Monod emphasized how the apparent gratuity of such ‘transductions’ provide the essential computational capabilities of protein molecules (Monod, Reference Monod1973). We can thus expect that the minimal frustration principle will be at least locally violated by natural functional proteins. Local violations open possibilities for more complex energy landscapes, such as those needed for allostery and large-scale conformational changes.
A statistical survey of the local frustration patterns of allosteric protein domains indeed does show that the regions that reconfigure are often enriched in patches of highly frustrated interactions (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2011). In contrast, the more rigid parts of the proteins, which are locally structurally superimposable in both forms, are connected by a dense web of minimally frustrated interactions (Fig. 23). Regions that are highly frustrated typically are where the action is and are seen to reconfigure locally between the two different forms. In some cases, frustrated regions display rather extensive reconfigurations of compact regions, whereas in others, the frustrated clusters localize around apparent pivot points joining the more rigid elements. In general, clusters of high local frustration colocate with residues whose local environment shifts between the two structures (see Fig. 23). It is apparent that the major structural ‘core’ is conserved between the large-scale taxonomic conformational substates, and differences between them tend to localize on the protein surface. The changing frustrated regions often include the binding site of an allosteric effector. While conformational changes enter, keeping about 80% of the pairwise interactions, 10% of the interactions are exclusively found in one conformational substate or the other. The distribution of local frustration for contacts that are common to both structures is similar to the distribution for contacts exclusively found in only one substate. The interactions that are found in both substates do not dramatically change, but instead new contacts are formed in one substate that does not exist in the other. Thus the conformational change is brought about by a balanced exchange of local frustration upon switching substates. In surveying several allosteric domains, there are examples in which only one structure of a pair has a relevant frustrated cluster but where the frustration is relieved in the other conformation (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2011). In other cases, both structures display highly frustrated interactions in the same region, and these residues interchange the interactions among each other. There are no examples of structure pairs of monomeric proteins where the locally displaced regions are completely minimally frustrated in both configurations.
The term allostery was in fact first used only to describe the regulation of protein activity through changes in quaternary structure of multisubunit complexes (Changeux, Reference Changeux2013). Landscape theory underscores how preserving symmetry is an easy path for achieving low-energy near-degenerate conformations, an idea taken as axiomatic by Monod, Wyman, and Changeux. Using the same counting argument that makes symmetric structures individually more easily designable, pairs of symmetric structures provide a statistically favored way of achieving a switch-like behavior. The local frustration distribution in the multimeric forms of the allosteric proteins show that, on average, interfacial (quaternary) interactions are less frustrated than the monomeric (tertiary) interactions (Ferreiro et al. Reference Ferreiro, Hegler, Komives and Wolynes2011). The interfaces in multimers appear in fact thus to be depleted of highly frustrated interactions, suggesting that the symmetry is reflected in the local frustration patterns when alternate configurations are used as decoys. The primal example of allosterism, hemoglobin is shown in Fig. 24. Here the interface between the subunits is not frustrated in either form; rather a rigid-body rotation allows symmetrically related interactions to occur. These distant structural forms involve a different knob-in-hole packing of the helices of one monomer on the others, and both packings appear to be minimally frustrated (Fig. 24). Multiple funnels to structurally distinct low-free-energy states can be achieved with nearly rigid subunits packing in a number of symmetrically equivalent ways. In this sense, the natural ‘engineering’ of frustration may be not essential for allostery. The classic Wyman–Monod view of symmetry of multimeric assemblies leading to near degeneracy does not depend on frustration at all. Because this usually involves only a few k BT of free energy, weakly assembled complexes may accidentally exhibit allostery without any strong frustration, as recently argued by Kuriyan & Eisenberg (Reference Kuriyan and Eisenberg2007).
The precise mechanism by which localized frustration allows allostery clearly deserves further investigation. Local frustration may simply allow a small discrete set of configurations interconverting by local motions. The allosteric dynamical transition may then involve specifically defined paths of transformation much like a small molecule or a macroscopic machine with hinges. Alternatively, the local frustration may destabilize a part of the protein in favor of an ensemble of rather high entropy. In other words, a local region can locally unfold or ‘crack’. This possibility has been entertained previously (Miyashita et al. Reference Miyashita, Onuchic and Wolynes2003) and elegantly explains the observations of the way denaturants can catalyze conformational changes (Zhang et al. Reference Zhang, Sheng, Pan and Zhou1997). Resolving experimentally the issue of hinges versus cracks requires an analysis of the effect that local mutations exert on the conformational kinetics (Whitford et al. Reference Whitford, Onuchic and Wolynes2008).
Adenylate kinase is celebrated example of large-scale conformational change related to a functional transition. The gross opening and closing of this protein requires at least two reaction coordinates to be taken into account. Steered molecular dynamics calculations have identified several locations where ‘hinge’ motions are believed to occur (Henzler-Wildman et al. Reference Henzler-Wildman, Thai, Lei, Ott, Wolf-Watz, Fenn, Pozharski, Wilson, Petsko, Karplus, Hübner and Kern2007). The frustratograms of both substates are shown in Fig. 24. Nearly rigid-body motions of the lid and the core are involved in the closure. An extensive minimally frustrated network of contacts rigidifies the molecule in the closed form, and the places identified as hinges indeed co-locate with highly frustrated regions. Miyashita et al. have examined the motion of adenylate kinase along the low-frequency normal modes contributing to closure and concluded that there were regions where high stress accumulates at the transition state for conformational change (Miyashita et al. Reference Miyashita, Onuchic and Wolynes2003). This high-stress region can however crack or locally unfold releasing the strain and catalyzing the motion. This high-strain region is highly frustrated in both forms. The presence of these interactions that conflict with folding the enzyme may be a general theme in the design of effective catalysts.
Protein phosphorylation is a paramount example of allosteric control of enzymatic activity. The activation of protein kinases usually occurs in a sharp switch-like fashion upon phosphorylation, making the ‘kinome’ a rich source of computational cellular complexity where recurring networks can be constructed. These days particular attention has come to some members of the multitude of kinases, because mutations of kinases have been found to be associated with pathological cell growth. In some cancers the activating oncogene involves substitutions on ABL and EGFR kinases quite often. Oncogenic mutations perturb the minimally frustrated core of these proteins, while the locally frustrated clusters appear related to highly dynamic regions that are involved in allosteric activation (Dixit et al. Reference Dixit and Verkhivker2011). In particular, the cancer-related mutations were found to both amplify dynamics of the inactive state and relieve local frustration in the active state (Dixit et al. Reference Dixit and Verkhivker2011). Thus, the redistribution of local frustration may facilitate conformational transitions and at the same time change the relative stability of the substates.
5.3. Frustration in metastable and multistable proteins
The universe of conformational substates of any protein molecule is huge. Therefore the complete energy landscape of proteins is complex, yet many experiments at physiological conditions can be interpreted using a simpler free-energy spectrum of states, which can be considered as ensembles of low-temperature substates (Hegler et al. Reference Hegler, Weinkam and Wolynes2008). The low-free-energy part of the spectrum of excitations corresponds with partially folded ensembles that get naturally populated as a consequence of the hierarchical nature of the mostly funneled landscape. Often the properties of these excited states can be predicted by native topology-based models (see above). As we have seen, frustration, the conflict between inconsistent stabilizing interactions evolved for other purposes than optimizing folding, offers another mechanism for forming low-free-energy excitations. If these states have large structural differences, then the interactions that stabilize one form are expected to destabilize the other, making natural proteins metastable in certain conditions. Given that minimal frustration corresponds to folding stability, does high local frustration correspond to metastability?
Hemagglutinin (HA) is a fascinating example of metastability. The process of viral infection starts with the interaction of multiple HA molecules on the infecting virion interacting with multiple sialic acids on the cell to be infected (Luo, Reference Luo2012). This interaction triggers endocytosis, and the virion is transported to the endosome where the pH is 1·5–2 units lower than outside the cell. At this lower pH, the HA molecule undergoes an irreversible and dramatic conformational change that was first mapped-out by Wiley and co-workers in the early 1990s (Bullough et al. Reference Bullough, Hughson, Skehel and Wiley1994). The molecule rearranges so that parts of it move by as much as 100 Å from one structure to the other. The local frustration patterns of the HA molecule crystallizedat pH 7 show several regions of high frustration (Fig. 25). The frustration of these region changes upon rearrangement and is diminished in the structure crystallized at pH 5·5. Thus, the regions that are involved in the conformational change are already destabilized by local interactions, and are more likely to ‘crack’, as in the cases of allostery describe above. Another interesting feature of these frustratograms is the high local frustration of the ‘fusion peptide’ in the pH 7 structure (Fig. 25). This peptide is poised to interact with the membrane of the endosome upon triggering, and promote the fusion of the virion membrane with that of the infected cell's endosome. It is evident that when the peptide is occluded in the pre-transition structure, it conflicts with the core of the HA trimer. Presumably this structure can be held in place by the minimally frustrated network of interactions that the other residues of the core contribute.
Serpins are another large group of metastable proteins that undergo a dramatic conformational change when cleaved by target proteases (Huntington et al. Reference Huntington, Read and Carrell2000). The protease is itself strongly inhibited by the cleaved serpin. The cleavage occurs on an exposed amino acid sequence in the serpin molecule (called the reactive-center loop, RCL), and the resulting cleaved loop becomes an additional β-strand in the central β-sheet (colored orange in Fig. 26). The uncleaved serpin is thought to be more stable than the cleaved form (Im et al. Reference Im, Ahn and Yu2000). Part of the energy difference comes from the cleavage of a peptide bond in the RCL, but other energies are also implicated. The ‘strain’ of the metastable form of the serpin is evident in the frustatograms. The RCL itself is enriched in highly frustrated interactions. High mutational frustration is observed both before and after cleavage in the region of the structure where the conformational change occurs, again highlighting the role of local destabilization promoting conformational change. The structural ‘openness’ required to accommodate a newly inserted β-strand apparently requires sub-optimal interactions among residues in that region of the protein. There is also high local frustration in the α-helix that stacks over the face of the central β-sheet (Fig. 26). These regions have unusually rapid hydrogen/deuterium exchange dynamics (Tsutsui et al. Reference Tsutsui, Kuri, Sengupta and Wintrode2008) and may remain metastable even in the cleaved form. Also it has been suggested that the frustrated helix may be easy to detach from the face of the β-sheet so as to initiate serpin aggregation.
Certain serpins (for example, plasminogen activator inhibitor-1 and antithrombin III (AT-III)) can spontaneously undergo a conformational rearrangement without RCL cleavage to form an inactive latent protein. Sometimes, ‘beads-on-a-string-type’ structures have been observed in electron micrographs, suggesting that this latent form is an ordered oligomer. In addition, mutations in some serpins promote the formation of highly stable, inactive polymers in the endoplasmic reticulum of liver cells. It was originally thought that the mutation might subvert the conformational flexibility required for the inhibitory activity of serpins, and that polymerization occurred through the insertion of the RCL of one serpin molecule into the β-sheet of another, and so on. However, Yamasaki et al. recently solved the structure of an antithrombin domain-swapped dimer, the structure of which revealed how any serpin adopting a similar configuration could form long-chain polymers. The domain swap involves a third of the β-sheet, which forms an equivalent region in another molecule (Fig. 26). In Yamasaki and colleagues’ dimer structure, the two molecules simply exchange part of the β-sheet with one another (a classic domain swap). The strand that swaps is the same strand that inserts during serpin maturation after cleavage (colored orange in Fig. 26). Examination of the frustration in the AT-III monomer both before and after thrombin cleavage reveals that as was seen in the α1-antitrypsin, the base of the β-strand that swaps is frustrated (Fig. 26). In addition, the helix that is thought to have to dislodge from the face of the β-sheet in order to allow strand insertion and/or swapping is highly frustrated. Comparison of the ATIII monomer before and after cleavage reveals that the observed frustration is present both before and after strand insertion hinting that this part of the protein can readily unfold even after strand insertion. A few highly frustrated contacts are observed in the center of the swapped strand, but it is not obvious why the strand swap is favored. While the ubiquity of domain swapping is a consequence of minimal frustration (Yang et al. Reference Yang, Cho, Levy, Cheung, Levine, Wolynes and Onuchic2004), a frustrated region of a protein can nevertheless unfold more readily, encouraging partial domain swapping if the protein contains frustrated elements. To form a polymer capable of incorporating more than two monomers, extensive domain swapping among multiple serpin monomers is required, but it is readily apparent how this can occur from the structure (Yamasaki et al. Reference Yamasaki, Li, Johnson and Huntington2008). The fact that the overlaying helix remains frustrated in all of the serpin forms suggests that domain swapping may occur even after serpin activation.
5.4. Frustration, dynamics, and catalysis
We have reviewed here how natural sequences are shaped via constraints imposed by the interaction of the protein with its immediate environment. These create conflicts in the folding of a polypeptide that can even compromise its foldability, become multistable or, in the extreme case, appear unfoldable, ‘intrinsically disordered’. On the other hand, these frustrating interactions facilitate excursions out of the thermodynamic minima so that the polypeptide can explore low-lying excited states that can become fundamental for function. For enzymes, an obvious requirement for activity is, in addition to foldability, catalytic power, which in the cellular context must be jointly met with the requirement for regulation. Usually the regulation comes about through an allosteric site, allowing otherwise disconnected activities to become coupled, gratuitously, in functional space (Monod, Reference Monod1973).
Catalysis requires detailed chemistry. Reaction rates strongly depend on the precise arrangement of reacting groups at sub-Å resolution. Because of this, catalytic sites are generally stringently conserved, bringing together in space residues that would otherwise adopt different interactions. Thus, in principle, we would expect natural enzymes to have locally frustrated catalytic sites. Some examples of frustration patterns of enzymes are shown in Fig. 27. In most cases the catalytic residues are involved in highly frustrated interactions either between them or with the rest of the molecule. Additionally however these sites are often surrounded by a dense minimally frustrated network of interactions that could be, in principle, related to the requirement of second-shell interactions to be favorable enough to compensate for the frustrating catalytic groups keeping it in place. It is apparent that large enzymes also contain other surface patches of highly frustrated interactions that mark other functional sites, such as the binding of an allosteric effector (Fig. 27). These general aspects of frustration in enzymes appear to hold irrespective of enzyme class or catalytic mechanism. This observation is consistent with the fact that mutations at catalytic sites often make the protein fold more stable (Meiering et al. Reference Meiering, Serrano and Fersht1992; Shoichet et al. Reference Shoichet, Baase, Kuroki and Matthews1995). Conversely, mutations that decrease the stability of folds are often associated with increased catalytic power, usually referred to as to stability-function trade-offs (Tokuriki et al. Reference Tokuriki, Stricher, Serrano and Tawfik2008). Many studies have noted that the evolution of new enzymatic activities is accompanied by a loss of protein stability. Also mutations that modulate activity are as destabilizing as any other random mutation (Tokuriki & Tawfik, Reference Tokuriki and Tawfik2009). On the other hand, substitutions in the substrate-binding pockets compromise stability to a larger extent than surface mutations that underline neutral, non-adaptive evolutionary changes. It is suspected that new enzymatic activities arise with the help of compensatory stabilization from otherwise silent mutations (Tokuriki et al. Reference Tokuriki, Jackson, Afriat-Jurnou, Wyganowski, Tang and Tawfik2012). These compensatory changes can be located in regions of the protein that were neutral to the stability of the original enzymes. The core of the protein would therefore act as a buffer for the accumulation of mutations in other regions. Enzymes are polypeptidic foldable systems for which many functional requirements simultaneously need to be met: bind substrate/s, catalyze, release product/s, bind effectors, switch on/off, and evolve. Each of these activities may require compromise with robust folding, resulting in sites of higher local frustration. In the cases of the enzymes shown in Fig. 27, it is still not possible to deconvolute the contribution of each of these effects to the observed frustration patterns. Many of the activities in the catalytic mechanism overlap: the site where the substrate is bound is necessarily close to the catalytic residues. Thus we cannot distinguish whether the local frustration comes about just for thermodynamic binding, for kinetic catalysis or for both. Experimentally these differing causes could be distinguished by mapping perturbations that differentially affect catalysis and substrate recognition. Perhaps these effects can be computationally distinguished by analyzing the changes in energy distributions upon binding, transition-state formation and conversion. Because of the sensitivity of chemistry to structural details, such a study requires fine-grained energy functions along with extensive sampling of conformational substates.
A common denominator to any catalytic mechanism is the requirement for motion. The details of how enzyme dynamics affect the activity have aroused much controversy (Karplus & McCammon, Reference Karplus and McCammon1983; Olsson et al. Reference Olsson, Parson and Warshel2006). The regulation of enzymatic activity is perhaps less controversial than the catalytic acts themselves. An interesting example is thrombin. Research on the coagulation protease thrombin, has suggested that its activity is allosterically regulated by changes in its dynamics. The core serine protease domain structure typified by trypsin and chymotrypsin is highly conserved in thrombin. It consists of two interacting β-barrel structures. Compared to trypsin and chymotrypsin, however, thrombin contains insertions at several of the loops connecting the β-strands. These loop insertions create a bipartite substrate (fibrinogen) binding site where fibrinogen is bound at the catalytic site as well as at a second site on the backside of the protease domain termed anion-binding exosite-1. This exosite is also the binding site of the allosteric regulator protein, thrombomodulin, and thrombomodulin binding alters the catalytic activity of thrombin toward a different substrate, protein C. Binding studies have demonstrated thermodynamic linkage between these two sites, but no differences are seen in the various bound structures determined by x-ray crystallography. Evidently the large change in enzymatic activity comes about from changes in the dynamics of the enzyme, which arise from a rather malleable ensemble of conformations, a situation that has been termed ‘entropic allostery’. Recently, a combination of NMR and MD studies on thrombin has revealed the large extent of changes in motions on binding, particularly in the active site loops, spanning time scales from picoseconds to milliseconds (Fuglestad et al. Reference Fuglestad, Gasper, McCammon, Markwick and Komives2013). To capture the longer time scale motions, accelerated molecular dynamics protocols were implemented, from which the average residual dipolar coupling constants for the backbone NH groups of the Boltzmann-reweighted ensembles could be back calculated and compared with NMR observables. By ascertaining which acceleration level produced the ensemble that best matched the NMR data, it was possible to explore excited states that were otherwise inaccessible but in fact occur in solution. Accelerated-MD of thrombin has identified that the changes upon active site occupation occur in groups of residues linked through networks of correlated motions and physical contacts (Fuglestad et al. Reference Fuglestad, Gasper, McCammon, Markwick and Komives2013). Upon active site ligation, loop motions are quenched, but new ones appear connecting the active site with distal sites where allosteric regulators bind. The analysis of local frustration revealed that all of these loops are highly frustrated in the mean native structure of thrombin (Fig. 28).
We can use the computed structural models of the excited states of thrombin to look for the changes in frustration along the ensembles. Both the inhibitor bound and apo thrombin have regions of high frustration in the surface loops (Fig. 28). The regions of high frustration map to dynamic regions, as seen with NMR dynamic measurements obtained at different time scales. On the one hand, the order parameters derived from conventional MD simulations agree with order parameters derived from NMR relaxations experiments proving motions in the nanosecond time regime. On the other hand, order parameters derived from the NMR-calibrated accelerated MD identify disorder resulting from motions on longer time scales, typically microsecond to milliseconds. The fast dynamics mapped in this system do not correspond well to the regions of high local frustration, while the correspondence with the dynamics on longer time scales is remarkable (Fig. 28). These results highlight the notion that regions of high local frustration are evolved features of the proteins that allow control of the motions on the slower time scales, which can occur with relative energetic ease. Energetically unfavorable interactions within the surface loops facilitate large amplitude motions, smoothing conversion between the ensembles of states needed for catalysis.
5.5 Frustration and the initiation of aggregation
Most of our biochemical illustrations of the frustration concept have focused on frustration within a single biomolecule or functional association, but what about frustration between biomolecules in the proteome which are not supposed to interact? Biomolecules are never alone during their life in the cell. Proteins are not even always born singly, but in prokaryotes especially are born in a litter – they are synthesized in ‘bursts’ when several ribosomes gather to translate a single messenger RNA. During this burst the local protein concentration can be quite high. After being born and becoming first folded, they will occasionally unfold and have to re-fold, again in the presence of neighbors, some of whom might be unfolded themselves. How do these unfolded proteins then avoid binding to the wrong partners? The principle of minimal frustration helps explain why even within the cell re-folding can be robust. (By the way it is obviously not always the case under all thermodynamic conditions – we can scramble eggs and do not expect them to unscramble when they cool!) The lack of frustration in finding partners is just what would be expected if achieving minimal frustration is a result of evolution. While the native contacts within a protein are quite stable, random contacts between two different proteins will not be as stable, so it will be rare for proteins with distinct sequences to be able to interact strongly with each other. Transient aggregates can unfold and eventually settle in their proper natively folded states. Clearly the dangerous partners that a protein must avoid are other copies of itself: native-like interactions between distinct partners can lead to association. This is called ‘domain swapping’ and if it goes on sequentially indeed an aggregate can form, as we have seen in the Serpin example (Fig. 26).
Domain swapping indeed does occur and seems to play a role in the initiation of some aggregation processes. The possibility of misfolding via domain swapping is a symmetry effect, the possibility of which cannot be easily eliminated by evolution. Simply symmetrizing the interactions of a structure-based pure funnel model for a monomer allows one to predict possible domain-swapped dimers (Yang et al. Reference Yang, Cho, Levy, Cheung, Levine, Wolynes and Onuchic2004). Different dimers differ in stability primarily through topological effects that determine the fraction of contacts that are prevented from being made in the more crowded dimer situation. Transient domain swapping seems to be a universal phenomenon when a protein is present at high concentrations but transient domain swapping resolves itself, suggesting the monomeric native form still has at least an entropic advantage at the concentrations found in most cells. A problem can occur if the local concentration is high. One of the worst cases of high concentration is in multidomain tandem repeat proteins, where the covalent connection means two identical domains can never get far away from each other. Evolution does seem to take notice. Han et al. have shown by surveying multidomain sequences in several genomes that in such multidomain repeat proteins the nearest-neighboring domains, even if structurally similar, are very distinct in sequence, while more sequentially separated domains often can still have pretty high-sequence identity (Han et al. Reference Han, Batey, Nickson, Teichmann and Clarke2007).
Domain swapping is likely to be a major reason aggregates formed in disease states usually involve a single protein. Zheng et al. computationally studied the misfolding of dimers of domains in titin (Zheng et al. Reference Zheng, Schafer and Wolynes2013). This system had been experimentally studied in the laboratory by Clarke and provided support for the ideas about domain swapping found in the bioinformatic survey (Wright et al. Reference Wright, Teichmann, Clarke and Dobson2005). This simulation study showed that there was another source of misfolding – the ‘self-recognition’ of short sequences in two copies of the same protein. Figure 29 shows both the domain-swapped dimer and a misfolded dimer with the self-recognizing segments paired up. This self-recognition resembles the first steps in the formation of an amyloid. Indeed the self-recognizing fragment is recognized by software trained to recognize amyloidogenic fragments bioinformatically. In Fig. 29, we show histograms of the energies per residue of hexapeptide fragments in the I27 fused dimer system along with the native energy. The amyloid fragment comes perilously close to competing with the native structure locally. But it does not compete globally. A histogram of the global energies of native structures, self-recognizing structures, and domain-swapped structures are shown in Fig. 29. Here the native structure clearly wins. At low temperature folding should be fine but the amyloid-like structures can act as kinetic traps at higher temperature due to their disordered parts giving them entropic stability (Zheng et al. Reference Zheng, Schafer and Wolynes2013). Globally the energy landscape is a funnel but still misfolding can give a kinetic trap, which can act as the initial state of further aggregation. We see that frustration can manifest itself more strongly in partially disordered ensembles even when it may not be always apparent in the final crystal structure. This example shows that looking at pair-level frustration as in typical frustratograms does not reveal all situations where energetic frustration can show up.
Pair-level frustration in the monomer can however encourage aggregation. An example is insulin. The metastability of insulin is revealed in its natural tendency to self-assemble into oligomers, a feature that was first appreciated when the structure was determined in the 1960s (Adams et al. Reference Adams, Blundell, Dodson, Dodson, Vijayan, Baker, Harding, Hodgkin, Rimmer and Sheat1969; Baker et al. Reference Baker, Blundell, Cutfield, Cutfield, Dodson, Dodson, Hodgkin, Hubbard, Isaacs and Reynolds1988). The insulin monomer contains two polypeptide chains, an A chain of 21 amino acids and a B chain of 30 amino acids, which contain helices between residues A1 and A8, A13 and A19, and B9 and B19. The two chains are covalently linked by disulfide bridges between A7 and B7 and A20 and B19. Identical hydrophobic surfaces of two monomers come together forming an antiparallel beta-strand between residues B40 and B46 of each monomer. In the crystal, insulin is a hexamer consisting of three dimers aggregated around two zinc ions forming a soluble, globular protein structure (Baker et al. Reference Baker, Blundell, Cutfield, Cutfield, Dodson, Dodson, Hodgkin, Hubbard, Isaacs and Reynolds1988). The hexamer has been seen to adopt one of three conformational states known as T6, T3R3, and R6, depending on the arrangement of the first eight residues of the B chain. In the T state (Fig. 30, PDB code 3P2X; residues B9–B12 are colored orange and residues B16–B19 are colored yellow), residues B9–B12 adopt an elongated conformation, while in the R state (PDB code: 1MPJ; same coloring scheme), they form part of the B-chain helix. Residues B9–B12 do not appear to be making many contacts with the rest of the insulin molecule, as has been suggested by NMR studies on the T-state (Hua et al. Reference Hua, Nakagawa, Jia, Huang, Phillips, Hu and Weiss2008). The A-chain is highly frustrated in all structures, and one might surmise that it is mainly held-together by the disulfide bonds it makes with the B-chain. Researchers have identified a single residue, a proline at position B28, which is largely responsible for making favorable contacts with residues B20–B23 of the adjacent monomer. Experiments with B-chain C-terminus substitutions and truncations have shown that residues B26–B30 are essential for dimer formation, and single substitutions at residue B28 markedly increase the tendence to form monomeric units upon dilution (Brems et al. Reference Brems, Alter, Beckage, Chance, DiMarchi, Green, Long, Pekar, Shields and Frank1992). We explored the residual local frustration in the B28 Pro – Asp mutant, which crystallized in an R6-like conformation (PDB code: 1ZEH). Plots of the fraction of contacts that are either minimally or maximally frustrated show that the region near the site of mutation is highly frustrated (Fig. 30b). Close examination of this region reveals that the Asp makes inter-chain electrostatic contacts that form an extended backbone structure; however, these contacts are frustrated. The increased local frustration is likely the cause of the increased conformational flexibility in the B-chain C-terminus observed in this mutant (Whittingham et al. Reference Whittingham, Edwards, Antson, Clarkson and Dodson1998). It is interesting to speculate that the frustration might partly be ameliorated by the binding of small aromatic molecules such as phenol or cresol, which bind in this region of the protein, and may provide stabilization of the R6 conformation.
Insulin can, on occasion cause an amyloid disease. This phenomenon was investigated recently by Weiss and colleagues who managed to crystallize the fibrillar form of insulin (Weiss, Reference Weiss2013). Interestingly, the molecular interactions observed in the crystal structure showed the participation of the A-chains, the most frustrated part of the insulin structure. No obvious alterations in the amount of minimally frustrated contacts were observed in the fibrillar form, however (Fig. 30c).
5.6. Frustration in nucleic acids and protein complexes
We have reviewed here why proteins are such a special kind of polymers: minimally frustrated globule-like structures can be encoded within an amino acid sequence. The richness of fold types is constrained at every level by the variety of the alphabet and the environmental conditions where the correct decoding occurs. Energy landscape theory shows that in the case of polypeptides the present 20-mer alphabet is more than sufficient to allow robust folding of a vast collection of topological families. However, proteins are not the only polymers found in living organisms. How does the concept of frustration apply to other biological polymers?
In terms of dry mass, polysaccharides abound in cells. Some of them are simple linear concatenations of the same monosaccharide units (cellulose), others display branched covalent structures (glycogen, starch), and even more complex varieties are still being characterized. Admittedly more needs to be done with respect to their understanding. Yet it seems that most polysaccharides do not spontaneously fold to specific structures, but rather form disordered colloids. Although the variety of saccharide units is very high due to local modifications, the chemical ligation properties of the individual units is sufficiently uniform along the chain, so that the interactions between any two parts is as nearly (de)stabilizing as any other. In energy landscape terms, the energetic variance of alternative structures ΔE 2 is small but the gap to the ground state appears to be nearly negligible, and a glass transition is expected to be predominant (Honey is a good example!). Here frustration is the rule that plagues the landscape with local minima. Not surprisingly specific synthesis and handling of even the most simple biological polysaccharides is challenging, and much could be gained if this were not the case (imagine making paper out of plain sugar!).
However, there is a special sort of polysaccharide that does spontaneously fold. Nucleic acids are essentially linear phosphopentoses in which each unit is modified with a nitrogen-rich base, typically one out of four possibilities. In these polymers, the unit–unit interaction modes are restricted, with electrostatics dominating repulsion and hydrophobic base–base interactions favoring association. The most stable structures are periodic packings of bases between two or more chains forming helices. Ever since the proposal of the antiparallel double-helix (Watson et al. Reference Watson and Crick1953), it has been evident that this structure is to a certain extent nearly independent of the sequence of the monomers, as long as specific base-pairing is maintained. This makes nucleic acids the prime molecular species where faithful templated polymerization can occur. Spontaneous formation of the double-helical structure is at the core of genetic engineering technologies, yet our present understanding of the folding landscape of DNA is limited. The melting temperature of short chains can be well approximated with simple empirical rules that take into account the unit composition and the saltiness of the solution (Schildkraut et al. Reference Schildkraut and Lifson1965). This corresponds with a landscape in which the ‘native’ structure can be reached with little competition of alternative states, with a large gap dominating the energetics. However, longer chains display more complicated behaviors, with marked hysteresis, history and concentration-dependent kinetics, all hints of glassiness of the landscape. It should be kept in mind that the differences in stabilization energy of specific base-pairing interactions is low compared to the mere stacking of the bases, and thus the discrimination between correct and incorrect base-pairing is expected to become more difficult as the chain grows longer. In contrast to proteins, nucleic acid folding is dominated by the formation of continuous secondary structure stretches, with small relative contribution of long-range tertiary interactions, which under ordinary conditions are discouraged by electrostatics. The apparent simplicity of the short oligonucleotide folding allows for engineering large assemblies by adjusting local rules (Rothemund, Reference Rothemund2006). Indeed statistical mechanical theories of the nucleic acid structure formation suggest that due to its quasi-one-dimensional character the landscape is naturally hierarchical (Fernández & Shakhnovich, Reference Fernández and Shakhnovich1990).
The most common source of frustration in nucleic acids is the competition between base-pairing continuous stretches of a single chain and between similar chains. The formation of alternative, metastable stem-loop structures was early suggested by Crick as a plausible mechanism for gene regulation (Crick, Reference Crick1971). In recent years, single molecule spectroscopy has allowed for detailed characterization of the overall time scales in which such transitions occur, providing direct evidence of the existence of many collapsed intermediate structures (Ma et al. Reference Ma, Wan, Wu and Zewail2007). The free-energy landscapes drawn out from these and other experiments portray pictures of overall smooth surfaces with discrete metastable traps (Woodside et al. Reference Woodside, Anthony, Behnke-Parks, Larizadeh, Herschlag and Block2006). The location of the traps and energy barriers separating them can be readily manipulated by adjusting the hairpin sequence, such that the population of folding intermediates can be partially controlled introducing mismatches and changing loop length. Frustration between these structures provides the grounds for engineering nucleic acid polymers that can act as switches, with ingenious applications envisioned (Viasnoff et al. Reference Viasnoff, Meller and Isambert2006).
The natural occurrence of functional aspects of frustration in nucleic acids is best exemplified by RNA, not DNA. Small stereochemical restrictions in the dihedral angles in the backbone make the small alphabet of ribonucleotide sequences generally not sufficient to specify a set of secondary and tertiary interactions. Studies on the folding of catalytic RNAs (ribozymes) have highlighted the peculiar propensity of RNA to adopt alternate conformations in solution. This conformational heterogeneity is attributed to misfolded structures or long-lived folding intermediates and it is frequently observed in natural isolates. The high probability of even random RNA sequences to have fairly stable stretches of base-pairs makes the number of alternate structures significant. Still, the RNA-folding complexity is reduced by its more hierarchical nature as compared to proteins. Typically secondary structures form fast and last for minutes to hours at physiological temperatures, whereas tertiary interactions are comparatively weak, interconverting in the time scale of milliseconds to seconds (Solomatin et al. Reference Solomatin, Greenfeld, Chu and Herschlag2010). This marked difference in stability and kinetics constrains the accessible tertiary structures. Conversely, isolated secondary structural elements of larger RNAs often fold properly in absence of tertiary interactions. This phenomenon underlies the successful prediction of RNA structure by two-dimensional mixing and matching of fragments, and also gives rise to deep folding intermediates separated by large barriers. These long-lived alternatives can be coopted to perform the specific functions and should be treated as alternative ‘native’ states. Such functional alternatives have been described in the autocatalytic Tetrahymena Group I intron, sarcin–ricin loop, the ribosome, and in vitro selected aptamers (Marek et al. Reference Marek, Johnson-Buck and Walter2011). Furthermore, the stability of these frustrated states can depend on the differential binding to other molecules, as exemplified in riboswitches (Serganov & Nudler, Reference Serganov and Nudler2013). In these examples, binding of a small ligand to a segment of mRNA is sufficient to affect a structural transition in the mRNA that in turn regulates the translation rate, making them effective modulators of protein abundance.
Frustration has an important functional role when proteins meet up with DNA. Some proteins display the remarkable ability to differentially recognize a specific stretch of DNA out of a vast excess of structurally similar sites. Transcription factors locate their cognate site in a genome with rates that appear to break the diffusion speed limit. Delbrück first envisioned a mechanism to explain this puzzle proposing that proteins may bind to any site on a linear chain of DNA and then slide along it, reducing the three-dimensional search problem to one dimension. Since then the existence of such ‘facilitated diffusion’ mechanisms has been confirmed in a variety of systems (von Hippel & Berg, Reference von Hippel and Berg1989). This mechanism requires the protein to have a relatively small variance of the sequence-dependent energy landscape of interaction, so that fast sliding can occur. Gerland et al. have analyzed this with the same mathematics employed by Bryngelson and Wolynes for folding proteins (Gerland et al. Reference Gerland, Moroz and Hwa2002). At the same time however a DNA-binding protein must also display a relatively large thermodynamic stability at its target site. The requirement of both fast search and stability of the cognate protein–DNA complex impose conflicting constraints on the landscape: whereas rapid search requires a fairly smooth landscape, the stability of the protein–DNA complex requires low energy at the target site. Slutsky & Mirny (Reference Slutsky and Mirny2004) showed that this apparent paradox can be ameliorated if the protein occupies (at least) two distinct conformational states that differ in the variance of the binding energy distribution. In one state, the protein binds DNA sites with a small variance, thus it cannot strongly discriminate different sequences but allows sliding to occur rapidly. The other state is characterized by a large variance with greater differences in affinity for the various sequences. The differences in free energy and the switching rate between the protein states define an energy gap. In this model, the search time to stably bind a cognate site is optimized when the protein spends most of the time in the state that favors sliding but can undergo fast conformational transitions to the other state. In this scenario, frustration in protein–DNA binding effectively creates a landscape with only a little ruggedness and a few alternate sites with low energies (Slutsky & Mirny, Reference Slutsky and Mirny2004).
Biochemical studies have highlighted the fact that DNA-binding proteins bind non-specifically to any DNA stretch (or any polycationic surface for that matter), with electrostatic interactions predominating the energetics. The surfaces of DNA-binding proteins usually display positively charged patches that complement the negatively charged DNA backbone, providing a rationale for the lack of sequence specificity. On the other hand, the structures of several specific protein–DNA complexes show that these are stabilized by intimate contacts between residues and bases that could discriminate the subtle structural differences between different DNA sequences, mostly present at the bottom of the major groove (Norambuena & Melo, Reference Norambuena and Melo2010). The differential ‘binding modes’ may thus be related to the formation of distinct interaction regions. A recent survey of the structures of DNA-binding proteins has found that some proteins utilize one interface to specifically recognize a cognate sequence while another region is employed to establish non-specific associations, resembling two distinct active sites. In contrast, sometimes these two patches overlap (Marcovitz & Levy, Reference Marcovitz and Levy2011). The system now has to choose between these alternatives, frustrating the protein–DNA interaction. The free-energy barrier between the binding modes can in principle be related to the overlap of these binding surfaces. MD simulations of the sliding/binding process show that as the local frustration increases, the rate of switching between sliding and binding decreases (Marcovitz & Levy, Reference Marcovitz and Levy2013). On the one hand, high frustration within the protein smoothes the landscape for sliding, but it makes the switching to the specific binding mode slower, which slows down the overall target location rate. On the other hand, if frustration is lowered too much switching becomes rare and the protein often misses the cognate site. The situation encountered is reminiscent to the effect frustration has in folding a polypeptide chain: a little is helpful, lots can be disastrous.
The formation of a specific protein–DNA complex may be coupled to the folding of the protein. While protein flexibility is involved in induced-fit recognition and to a larger extent in intrinsically unstructured DNA-binding proteins, the electrostatic field of the DNA can induce flexibility and a partial unfolding of the protein (Levy et al. Reference Levy, Onuchic and Wolynes2007). Protein disorder facilitates the three-dimensional diffusive search through the ‘fly-casting’ mechanism where the protein binds DNA non-specifically in a partially folded state and completes its folding when it binds the specific site (Shoemaker et al. Reference Shoemaker, Portman and Wolynes2000). Indeed, several DNA-binding proteins are known to be partially unstructured in the unbound state, which only fold upon binding to the target (Spolar & Record, Reference Spolar and Record1994). The interplay between fly-casting and electrostatics is observed even for weak electrostatic forces and is expected to vary with the electrostatic screening due to salt and the intrinsic folding barrier, both of which can be modulated. Increasing the strength of the electrostatic forces destabilizes the isolated protein due to frustration between general electrostatic interactions and the specific native interactions. In the presence of DNA, a larger destabilization effect is seen as a result of competition between folding and binding (Levy et al. Reference Levy, Onuchic and Wolynes2007). This latter frustration reflects a ‘tidal force’ from electrostatics tending to pull the molecule apart. Under strong electrostatic forces, the negatively charged DNA pulls the protein apart, and the protein–DNA association can therefore advance protein folding.
Papillomavirus E2c protein is a remarkable model in which the effects of frustration in protein–DNA association have been experimentally studied. When confronted with a small double-stranded oligonucleotide the E2c protein finds its specific site using two parallel routes (Sánchez et al. Reference Sánchez, Ferreiro and de Prat Gay2011). In one of them the process is well described as a two-state reaction with a small activation barrier of entropic origin. The effect of single-point mutations on the DNA-contacting surface to the kinetics and thermodynamics indicated that the formation of direct contacts between residues and bases constitutes the bottleneck of complex formation (Ferreiro et al. Reference Ferreiro, Sánchez and de Prat Gay2008a). Few non-specific ionic interactions mainly stabilize the final consolidated complex. The binding landscape along this route is smooth, consistent with a hydrated protein–DNA interface. In contrast, binding along the other route involves several steps. Attaining an initial, diffuse, transition-state ensemble of structures with some non-native contacts is followed by formation of extensive non-native interactions that drive the system into a kinetic trap (Sánchez et al. Reference Sánchez, Ferreiro, Dellarole and de Prat-Gay2010). These non-native contacts then slowly rearrange into native-like interactions. Dissimilar protein–DNA interfaces are formed in the intermediate and the final complex, as anticipated from the computational studies described above. Frustration between the binding modes creates a degeneracy of interactions that drives the complex into long-lived non-native species (Sánchez et al. Reference Sánchez, Ferreiro, Dellarole and de Prat-Gay2010). It has been speculated that this slowing down of the binding reaction allows for extra functionality, as this metastable species exposes different surfaces that could be linked with other binding activities, as usually seen in eukaryotic protein–nucleic acid complexes.
6. Frustration is a fundamental aspect of biochemistry
Some biologists take for granted that proteins just ‘do-their-job’ as if they were robots made by some blind adaptive designer. Physicists are often amazed by the intricacies of how this peculiar kind of chemical scum can organize itself at the temperatures the biosphere experiences. We have explored the foundations of the concept of Frustration and how it bridges these fields in the study of biomolecules. The impossibility of simultaneously satisfying constraints leads to degeneracy of states. Finite degeneracy allows for captivating emergent behaviors where apparently unrelated parts of a system respond to subtle changes in local energetics. Too much frustration and degeneracy leads to qualitatively different dynamics where searching through the many states cannot be carried out in a meaningful time and the system freezes out in a glass. On the other hand, systems that lack any frustration at all are well behaved, display sharp phase transitions, get ordered spontaneously in the right conditions and, much like crystals, are beautiful but rather dull. As in many other realms of natural processes, interesting things happen within a balance between orderly and chaotic regimes (Frauenfelder et al. Reference Frauenfelder, McMahon and Fenimore2003; Kauffman, Reference Kauffman2013).
The principle of minimal frustration has lead to an understanding of the general aspects of protein folding. It has fruitfully lead to testable hypotheses, successful interpretations of a diverse range of experiments and to the development of useful computational tools. As with any good theory, the deviations from its expectations mark situations in which discoveries await. Frustration opens up possibilities for degeneracy, recognizing proteins as metastable systems that can compute and modify their structures according to environmental inputs. In turn, the fact that these computations are robust to small perturbations make proteins evolvable, such that the overall behavior can be finely tuned, and trained, over time (Bray, Reference Bray1995). Present-day protein molecules contain information both about historical chances and physical necessities imprinted in their structures. It is still challenging for humans to distinguish the relevant information from the inevitable noise that overwhelms our current intellectual capabilities.
Studying the aspects of frustration averaged over many proteins provides ways to infer energy functions for structure prediction and has allowed us to appreciate the apparent complexity of the protein-folding code. Frustration affects laboratory folding kinetics and deviations from perfectly funneled landscapes reflect previously unrecognized functional constraints. We have reviewed here how a large part of the functions of proteins is related to subtle local frustration effects and how frustration influences the appearance of metastable states, the nature of binding processes, catalysis, and allosteric transitions. ‘Doing something’ is the prime characteristic of protein molecules. The teleonomic, apparent purposefulness of biological function often involves several ‘activities’ coming together in a coordinated way (Monod, Reference Monod1973). At any scale it is challenging to disentangle what are the meaningful parts of living things and what are the relevant interactions between them (Simondon, Reference Simondon2005). The apparent individuality of the polypeptides we can purify from the present-day organisms is restricted by our skewed view and becomes highlighted when a theory based on statistics is available. The study of life beyond the scale of single proteins has to treat basic aspects of the stability and dynamics of finite states, and as such we envision that frustration is a fundamental concept that will help (often frustrated) researchers for many years.
The huge number of proteins in a cell requires that proteins must interact with a relatively high degree of specificity, otherwise most of the cellular matter would be tied up in a mess of inappropriate interactions, and instead of cells we would have scrambled egg cells. This functional specificity has been achieved by the coevolution of sequences that also lead to little scrambling even within each individual protein molecule. Low-free-energy excited state ensembles for a biomolecule can come about in several ways: harmonic motions of large length scale or special harmonic motions utilizing weakly connected links, local folding/unfolding on a funneled landscape, frustrated local interactions giving energetically degenerate structures and sequence symmetry that gives rise to alternate structures. Various combinations of such motions with low-free-energy costs occurring at the same time will also give low-free-energy excited states and kinetically dominant reaction paths. An example of such a combination mode is ‘cracking’ in which the system makes excursions along a low-frequency normal mode (not costly in energy), during this excursion the molecule locally unfolds (not too costly in free energy) and then refolds. If the region that unfolds is frustrated, so much the better: its unfolding will be still faster and cheaper in free-energy terms. Cracking seems to be a likely mechanism for many physiologically relevant biochemical processes.
The problem of self-organization for large nanoscale objects is not stability but the kinetics of getting to a stable structure. Kinetic barriers can be avoided if sufficient entropy remains to fluidize the motion. Up to a point, robust self-assembly in a cell therefore often relies on folding upon binding – a sort of a just-in-time policy for macromolecular economy. Nevertheless, kinetic barriers for folding (even for folding upon binding) tend to scale with the size of the assembled system. The larger barriers occur for frustrated systems but also may grow with size even when the minimal frustration principle is respected, as in the capillarity picture. At some size scale the assembly rates generally become too slow and far-from-equilibrium assembly takes over. This crossover already seems to occur when the largest proteins are folded in vivo. Rubisco, a very large protein, requires chaperones in order to fold. It is likely the chaperoned catalysis of proper folding is actually a kind of kinetic proofreading, in which misfolded protein molecules are ripped apart by unfolding machines and given a chance to try again (Gulukota & Wolynes, Reference Gulukota and Wolynes1994; Todd et al. Reference Todd, Lorimer and Thirumalai1996). The role of the ATP-induced motions of a chaperone can be thought of as increasing the effective temperature for a limited fraction of the degrees of freedom of the protein.
Cells are not formless bags of molecules. Organization persists even at the cellular scale and changes its nature at this supramolecular or ultra structural level. We are only beginning to probe the atomically weak but colloidally strong forces that act on this larger scale length. It is also clear that in living things, motion on this size scale is not just equilibrium Brownian motion. Instead, chemical energy is flowing through the system: various molecular motors carry parts around in a hurly-burly reminiscent of a large city or ant colony (Wang & Wolynes, Reference Wang and Wolynes2011). This motorization probably allows the problem of large barriers and high glass transition temperatures to be overcome much as in kinetic proofreading by chaperones where ATP is used. By comparing responses to fluctuations on the colloidal scale (Lau et al. Reference Lau, Hoffman, Davies, Crocker and Lubensky2003), an effective temperature of the cytoskeleton has been measured to be 10 000 K. Obviously this high temperature applies to only a minute fraction of the degrees of freedom. Studying the organization of living matter on this longer length scale will require new tools, both experimental and theoretical. We can anticipate, however, that the strong foundation of energy landscape concepts learned in the study of molecular folding and function, will be of great help in finding the required conceptual framework for this far-from-equilibrium regime.
7. Acknowledgements
We thank Weihua Zheng, Nick Schafer, Bobby Kim, Ha Truong, and Rocío Espada for stimulating discussions and for providing help with figures. Supported by Grants R01 GM44557 and P01 GM071862 from the National Institute of General Medical Sciences. Additional support was also provided by the D. R. Bullard-Welch Chair at Rice University, the Agencia Nacional de Promoción Científica y Tecnológica (ANPyCT) and the Consejo Nacional de Investigaciones Científicas y Técnicas de Argentina (CONICET).