Introduction
I have been fascinated with transition states for more than 60 years – a passion for understanding structure and mechanism which has directed my research at the borderlines of chemistry, physics and biology. Transition states of simple covalent reactions are traditionally studied by structure-activity relationships whereby perturbations of the energetics of kinetics and equilibria of reactions on small changes in the structure of reagents are correlated to give clues about the structure of the transition state. Much of biological chemistry is dominated by weak noncovalent interactions, especially those of proteins. The advent of protein engineering enabled structure-activity relationships to be applied to the noncovalent transition states of those biological processes. This invited review outlines the history of key steps by my research group and by others in translating those structure-activity methods of classical physical and organic chemistry to analyse noncovalent transition states. It begins with their introduction via protein engineering to the quantitative study of noncovalent interactions in enzyme catalysis and specificity and then their extension to protein folding to give Φ-value analysis. I discuss in particular how the combination of those methods and computer simulation has been used in solving problems of protein folding pathways.
It is a particularly appropriate time for this topic as it is the centenary of the publication of the landmark paper in the history of physical-organic chemistry that led to structure-activity studies, the discovery of general-base catalysis and its dependence on the strength of the base by Brønsted and Pedersen (Reference Brønsted and Pedersen1924). That discovery and the ensuing Brønsted β-value have inspired much of my research and the contents of this review. It is also the half-centenary of my paper that sent me down the slippery slope of analysing non-covalent interactions in transition states (Fersht, Reference Fersht1974). Pertinent also it is the centenary of the chess grandmaster S. G. Tartakower’s ‘Die Hypermoderne Schachpartie’ in which he wrote ‘Die Fehler sind dazu da, um gemacht zu warden’ (Tartakower, Reference Tartakower1924, p. 90). The usual translation ‘The mistakes are all there, waiting to be made’ should be the watchword of every experimentalist and theoretician as well as chess player, especially in areas as complex and with pitfalls as protein folding.
Transition states in covalent chemistry
Transition states are the transient structures at the peaks of plots of free energy as a reaction progresses as opposed to intermediates that are in a basin (Figure 1). Simple transition state theory relates the rate constant for a reaction to the energy difference between the transition and ground states, $ \Delta {G}^{\ddagger } $, as if the two states were in equilibrium: the rate constant for the reaction going through the transition state, k, is given by:
where: k B is the Boltzmann, h is the Planck, R is the gas constants, T is the temperature, and κ is a transmission coefficient (Pelzer and Wigner, Reference Pelzer and Wigner1932; Evans and Polanyi, Reference Evans and Polanyi1935; Eyring, Reference Eyring1935). Examination of the transition state structure relative to the ground states gives important clues as to what drives a reaction and how its rate or even its products may change by altering the structure of the reagents, the reaction conditions or employing catalysts. For example, the rate of attack of a negatively charged nucleophile on a reagent can be increased by introducing electron-withdrawing substituents. Transition states are essential structures in defining reaction pathways. To solve a reaction pathway, we must characterise all the ground states and the transition states linking them. Ground states and intermediates are best studied by direct observation. The only state between ground states that can be characterised experimentally is the elusive transition state and the only current experimental means is by using indirect evidence from structure-reactivity relationships.
Linear-free-energy relationships: LFER and REFERs – β- and α-values
The classical physical-organic chemist’s approach to analysing the structure of a transition state of a reaction is to use quantitative measurements of the changes in reactivity and equilibria on small changes in the structure of reagents. For example, Brønsted and Pedersen began the analysis of the effects of strengths of bases and acids on their powers of catalysis of simple organic reactions in solution (Brønsted and Pedersen, Reference Brønsted and Pedersen1924). They found, for example, that there is often a simple equation relating the second-order rate constant (k 2) for catalysis of a reaction by a general base to the pK a of the conjugate acid (2).
This is an example of a linear-free-energy relationship (LFER) since it is equivalent to:
where $ \Delta {G}^{\ddagger } $ is the free energy of activation and $ \Delta {G}^0 $ the equilibrium free energy change of a process. β is for base but we usually call it the Brønsted β. The equation can be formulated for a wider range of reactions, $ \Delta {G}^{\ddagger }=A+\alpha \Delta {G}^0 $, as described by Leffler (Reference Leffler1953), and the description rate-equilibrium-energy relationship (REFER) alternatively used.
L. P. Hammett translated these LFERs to chemical reactions involving aromatic compounds by measuring the effects of chemical substituents in the meta and para positions of benzoic acid on its pK a to assign a σ-value for each substituent (corresponding to the change it makes in the pK a) and relating the sensitivity of the logarithms of rate constants for chemical reactions to σ by a parameter ρ, equivalent to the Brønsted β (Hammett, Reference Hammett1937). The meta and para positions are chosen to minimise direct steric interactions with the seat of reaction (Hammett, Reference Hammett1940).
The simple reasoning behind the magnitude of the β and ρ values in many chemical reactions is that they often result from electrostatic effects. For example, in the transition state of the general-base-catalysed attack by acetate ion of H2O on an ester (Figure 2), an H+ is in the process of being transferred from the H2O to the $ -{\mathrm{CO}}_2^{-} $ catalyst, partly neutralising its negative charge. If a substituent that has an electron-withdrawing or donating propensity is put into the −CH3 of acetic acid, it will perturb its pK a by $ \Delta \Delta {G}^0 $ because of the electrostatic interactions with the negatively charged carboxylate relative to the neutral state. The electrostatic interaction of the substituent with the partly neutralised negative charge on the $ -{\mathrm{CO}}_2^{-} $ in the transition state, $ \Delta \Delta {G}^{\ddagger } $, will be less than $ \Delta \Delta {G}^0 $ because of the H+ being transferred so that:
where β approximates to the extent of bond formation with the H+ in the example of Eq. (4) or in other cases a covalent bond in the transition state. $ \beta =0 $ means there is no transfer of the proton to the base and $ \beta =1 $ means complete transfer, and fractional values are something in between. One possible generic basis of LFERs is explained in Figure 3 where the reagents are in two energy wells that intersect at the transition state. Applying a simplified version of the treatment by Marcus (Reference Marcus1968) of outer sphere electron transfer reactions, I assume the energy functions are simple harmonic wells. For the starting material S, $ \Delta {G}_{\mathrm{S}}={\lambda}_1{r}^2 $ and for products $ \Delta {G}_{\mathrm{P}}={\lambda}_2{\left(1-r\right)}^2-\Delta {G}^0 $, which gives for $ \alpha =\Delta \Delta {G}^{\ddagger }/\Delta \Delta {G}^0 $ (Fersht, Reference Fersht2004b):
For the special case of $ {\lambda}_1={\lambda}_2 $, $ a={r}_{\ddagger } $. But, apart from the extreme values of the position of the transition state $ {r}_{\ddagger }=0 $ or 1, r ‡ does not generally = α (or β) (Fersht, Reference Fersht2004b). The situation is, of course, even more complicated than the above for fractional values. The reaction coordinate diagram is not two-dimensional and there can be movement in other dimensions with much complexity (Jencks, Reference Jencks1985).
LFERs have been found in many types of physical-chemical processes, and the interpretation is usually simply phenomenological with the value of α or β interpreted only qualitatively for mechanism and semi-quantitatively for predictive purposes. In the qualitative analysis of the effects of changes of structure on reactivity, it is just the changes in $ \Delta {G}^{\ddagger } $ in the $ \kappa \left({k}_{\mathrm{B}}T/h\right)\exp \left(-\Delta {G}^{\ddagger }/ RT\right) $ term of the transition state theory that are examined, and the pre-exponential component cancelling out in the comparison of rate constants. $ \Delta \Delta {G}^{\ddagger } $ and $ \Delta \Delta {G}^0 $ are the key quantities. My first published paper as a graduate student centred on using LFERs to analyse transition states in chemical mechanisms, with a series of substituted aspirins (Fersht and Kirby, Reference Fersht and Kirby1967), and LFERs figure prominently in my textbook on enzymes (Fersht, Reference Fersht1977, Reference Fersht1985).
Transition states in noncovalent chemistry: biological catalysis and specificity
Classical chemistry is dominated by covalent bonds and strong ionic interactions. Much of chemistry in biology, on the other hand, is dominated by weak noncovalent interactions, such as van der Waals interactions, hydrogen bonds, salt bridges, and the hydrophobic effect. Utilisation of these weak interactions is the hallmark of biological specificity in general and modulation of catalysis by enzymes.
Enzyme catalysis and binding of the transition state
The rates of enzyme-catalysed reactions are many orders of magnitude greater than simple reactions catalysed in solution by acids and bases or nucleophiles. To answer why, Haldane proposed that enzymes might catalyse reactions by straining the structures of the substrates towards that of the products (Haldane, Reference Haldane1930). Pauling refined that concept by stating that an enzyme could have a structure complementary to that of the activated complex or transition state of the substrate, and hence stabilise it (Pauling, Reference Pauling1948). Classical studies varying the structures of substrates of α‒chymotrypsin, for example, showed that binding energy could be distributed between tighter binding of substrate and higher rate constants (Jencks, Reference Jencks1975). Analogues mimicking the structure of transition states of substrates may also bind more tightly than the substrates themselves (Schramm, Reference Schramm1998). So, free energies of activation of the covalent chemical reaction, $ \Delta {G}_{\mathrm{cov}}^{\ddagger } $, can be modulated by changes in binding energies, $ \Delta {G}_{\mathrm{noncov}}^{\ddagger } $.
The Michaelis–Menten equation (6) relates the reaction rate v of a substrate S to the total concentration of enzyme, [E]0, an apparent first-order rate constant k cat, and an apparent dissociation constant K M.
In the simplest case, K M is the dissociation constant for the E.S complex, K s, and k cat is the rate constant for its giving products. But, these apparent rate and equilibrium constants can hide a complexity of additional terms, from additional chemical steps to non-productive binding. Crucially, however, the ratio k cat/K M is an apparent second-order rate constant for the process of free enzyme, [E], and free substrate, [S] proceeding to the highest transition state on the reaction pathway to give products, and complicating factors are usually cancelled in the ratio k cat/K M, Eq. (7).
Applying simple transition state theory suggests two notional processes in the evolution of maximal rate (Fersht, Reference Fersht1974). The enzyme evolves to have a structure that is complementary to that of the transition state of the reaction, which maximises the value of k cat/K M. And, if rate is the prime concern, the enzyme will also evolve to increase K M at constant k cat/K M until the K M is higher than the physiological substrate concentration. This is because low-energy intermediates can be thermodynamic pits where there is a higher $ \Delta {G}^{\ddagger } $ going from them to the transition state than there is from the initial state. The strain theories of Haldane and Pauling propose strong binding of the transition state and concomitant weak binding of the substrate, and the highest catalysis occurs when the binding energy in the E.S complex is sufficiently weak such that it is the complex is largely dissociated and intermediates do not accumulate on reaction pathways (Fersht, Reference Fersht1974).
Specificity depends on the relative binding of transition states
When two substrates A and B are competing for the active site of an enzyme, their relative rate of reaction at all concentrations of free [A] and [B] is given by (Fersht, Reference Fersht1974):
As k cat/K M is for the process of unbound enzyme and unbound substrate proceeding to the transition state ES‡, the specificity is independent of the interactions in the enzyme-substrate complex and depends only on the relative binding of transition states. Accordingly, both the magnitude and specificity of enzyme catalysis depends upon the binding of transition states.
Equation (8) is very useful for measuring the apparent contributions to binding energy of parts of substrates by comparing modified versions of them. For example, a substrate, containing a particular radical can be compared with the substrate modified to have, say, an -H replacing that radical to give an empirical measure of the energetics of binding of that radical. The aminoacyl-tRNA synthetases have evolved to maximise the specificity of competing amino acids, for example, the isoleucyl-tRNA synthetase with isoleucine versus valine. We measured ratios of k cat/K M for cognate versus non-cognate amino acids with different aminoacyl-tRNA synthetases to explore the upper limits of binding energies under evolutionary pressure (Fersht, Reference Fersht1981).
Noncovalent interactions in enzyme transition states: LFER analysis
We would like to know how the structures of proteins change in the transition states of biological processes and how it contributes to them. The way experimentally to characterise those details by analogy with covalent chemistry is by using similar systematic structure-reactivity relationships, which is something I had been wanting to do since starting in enzymology. The introduction of site-directed mutagenesis at the end of the 1970s to revert mutants of bacteriophage φX174 (Hutchison et al., Reference Hutchison, Phillips, Edgell, Gillam, Jahnke and Smith1978) made this possible and laid open the new field of protein engineering, which was left largely unploughed for 4 or 5 years.
The initial paradigm: protein engineering the tyrosyl-tRNA synthetase
Gregory Winter and I began a collaboration and published the first paper on protein engineering studies on a protein of known structure (Winter et al., Reference Winter, Fersht, Wilkinson, Zoller and Smith1982). It may seem surprising that the practical application of the mutagenesis technology of the 1978 paper (Hutchison et al., Reference Hutchison, Phillips, Edgell, Gillam, Jahnke and Smith1978) took so long. Site-directed mutagenesis was then very difficult to do on the genes of recombinant proteins; the necessary oligonucleotides were not commercially available; only a few protein chemists were using recombinant DNA technology; and some did not believe that site-directed mutagenesis was anything more than a new form of chemical modification (reported by Bryan, Reference Bryan2000). I spent a sabbatical in 1978–1979 in Arthur Kornberg’s laboratory to learn recombinant DNA technology and worked on reverting mutants of φX174 to study the fidelity of DNA replication (Fersht, Reference Fersht1979). Gregory Winter had sequenced the genes of aminoacyl-tRNA synthetases, and we chose to do protein engineering of the tyrosyl-tRNA synthetase from Bacillus stearothermophilus. His goal was to use it as an entry into making novel proteins, paralleling synthetic organic chemistry, and he subsequently pioneered antibody engineering. My goal was to use it for structure-activity studies to understand the chemistry of noncovalent interactions in biology, paralleling physical-organic chemistry. This thermophilic enzyme is an exceptional paradigm for this latter purpose: it may be expressed in Escherichia coli, and any activity of contaminating mesophilic enzyme that could obscure steady-state kinetics removed by heating; it is amenable to study by pre-steady kinetics so intermediates can be directly observed; and as a bonus, it is an enzyme whose chemical pathway was known but nothing about what groups were involved in catalysis. The first step in the aminoacylation of tRNA is the nucleophilic attack of the carboxylate of the amino acid on the α-phosphate of ATP to generate an enzyme-bound aminoacyl-adenylate, which subsequently transfers the tyrosine to its cognate tRNA (9).
Tyrosyl-adenylate is highly reactive in solution but is sequestered and stable in the complex with the enzyme in the absence of tRNA. The crystal structure of the complex reveals a large number of protein side chains binding the intermediate, principally by making hydrogen bonds.
The strategy for structure-activity studies of transition states of proteins
The fundamental strategy for structure-activity studies is simple and taken straight from classical chemistry: make small rational changes in structure and measure the changes in the equilibrium free energies and activation free energies of the chemical steps. Here, the steps are: (1) truncate the side chains that are hydrogen bond donors or acceptors with the substrate to give quantitative information on the effective strengths and to provide the $ \Delta \Delta {G}^0 $ terms for the application of LFERs; and (2) do kinetics on mutants to measure the corresponding $ \Delta \Delta {G}^{\ddagger } $ values. Step 1 is useful in general per se as it provides empirical quantitative data on biological interactions. The same strategy is applied analogously to other processes such as protein folding.
The first experiments measured the strengths of hydrogen bonds using Eq. (8) and the ratios of k cat/K M from steady-state kinetics for wild-type and mutants. The apparent energies spanned 0.5–1.5 kcal/mol (Fersht et al., Reference Fersht, Shi, Knill-Jones, Lowe, Wilkinson, Blow, Brick, Carter, Waye and Winter1985). I usually refer to these as apparent binding energies because they measure the relative binding energies that are found in practice but not absolute energies – all binding reactions in water represent an exchange reaction with H2O of solvation (Fersht et al., Reference Fersht, Shi, Knill-Jones, Lowe, Wilkinson, Blow, Brick, Carter, Waye and Winter1985). In general, energies from mutagenesis experiments have complex components, which I have emphasised from the start, but sometimes overlooked (Fersht, Reference Fersht1987, Reference Fersht1988).
LFER analysis uncovers a novel enzyme mechanism just involving binding energy
The second step of the strategy was to determine $ \Delta \Delta {G}^{\ddagger } $ and $ \Delta \Delta {G}^0 $ for individual steps in Eq. (9) using rapid reaction pre-steady state kinetics (Wells and Fersht, Reference Wells and Fersht1985). There is a progressive increase in the apparent binding energy of the hydrogen bonds, as illustrated in Figure 4 where Cys-35 and His-48 are truncated to Gly, and the energies of the mutant compared with wild-type plotted. These progressive curves were described in terms of difference energies (Wells and Fersht, Reference Wells and Fersht1986). Subsequently, the ratio of $ \Delta \Delta {G}^{\ddagger }/\Delta \Delta {G}^0 $ was used and called a β-value, in homage to Brønsted (Fersht et al., Reference Fersht, Leatherbarrow and Wells1987). This is effectively a series of two-point LFERs around the substrate for each interaction from a side chain. As seen in Figure 4, mutation of side chains that bind the sugar ring of ATP hardly weakens the binding of ATP in the E.Tyr.ATP complex but develops in the E.[Tyr-ATP]‡ transition state (Leatherbarrow and Fersht, Reference Leatherbarrow and Fersht1987). And, there is a further twist on this. The tyrosyl-adenylate is a high-energy compound, as well as being highly reactive, and the equilibrium constant for its formation from enzyme-bound tyrosine and ATP would normally be very low. But, the side chains bind the adenylate tightest of all, and so displace the equilibrium to stabilise its formation as well as sequester it from solution (Wells and Fersht, Reference Wells and Fersht1989).
Interestingly, the individual values of $ \Delta \Delta {G}^{\ddagger } $ and $ \Delta \Delta {G}^0 $ for the different mutations that bind the ribose of ATP could be combined to give sets of multi-point LFERs with β-value slopes, Figure 5 (Fersht et al., Reference Fersht, Leatherbarrow and Wells1986, Reference Fersht, Leatherbarrow and Wells1987). These linear plots are not generally found in mutagenesis experiments as conformational changes are usually inhomogeneous, and so comparison of two-point plots and local clustering is the mainstay of the approach. The finding of subsets of LFERs in the sets of two-point measurements is a bonus here and in folding (Fersht and Sato, Reference Fersht and Sato2004). The presence of a multipoint localised LFER for the residues that bind the sugar ring shows the enzyme generates a local pressure on the substrate to form the transition state, which validates Haldane’s: ‘Using Fischer’s lock and key simile, the key does not fit the lock perfectly, but exercises a certain strain on it’ (Haldane, Reference Haldane1930). The most dramatic mutational site, located by model building, has residues that barely affect the binding of the substrate or tyrosyl-adenylate product but just greatly stabilise charges developed on the α-phosphate in the transition state with $ \beta >\hskip-0.6em >1 $, Figure 6 (Leatherbarrow et al., Reference Leatherbarrow, Fersht and Winter1985; Fersht, Reference Fersht1987), more consistent with Pauling’s general idea of transition state stabilisation (Pauling, Reference Pauling1948).
There are no chemical groups on the enzyme directly involved in catalysis. The carboxylate of the substrate tyrosine is a competent nucleophile and it appears that the mechanism of catalysis is the utilisation of binding energy to stabilise the transition state and displace an unfavourable equilibrium. By good fortune, the first application of protein engineering to study noncovalent interactions in enzyme catalysis discovered the first example of a natural enzymatic reaction being catalysed purely by transition state stabilisation without any of the classical mechanisms of chemical catalysis.
Basis for Φ-analysis for folding studies
Our 1987 paper provided the template for the analysis and choice of mutations for the analysis of folding pathways (Fersht et al., Reference Fersht, Leatherbarrow and Wells1987). In it, we introduced two-point βs for individual mutations from ratios of $ \Delta \Delta {G}^{\ddagger }/\Delta \Delta {G}^0 $ in the difference energy plots, and elaborated on the possible groupings of them together to give true multipoint LFERs. We classified the mutations into six categories for choosing them: Nondisruptive Deletion, ‘a side chain is replaced by another that lacks a group involved in a specific interaction’; Disruptive Deletion, ‘replacement of a side chain may lead to a perturbation elsewhere in the structure’; Conservative Substitution, ‘a side chain is replaced by one that can substitute in the same interactions’; Semiconservative Substitution, ‘some of the function is conserved on replacement’; Disruptive Substitution, ‘substitution of a large size chain for a small one in a buried close packed region of a protein’; and Nondisruptive addition, ‘bulky groups may be added to the surface of proteins without necessarily causing perturbation of structure’. We documented the caveats about the effects of reorganisation of structure and effects of changes in solvation obscuring the analysis, which I discussed in more detail (Fersht, Reference Fersht1988). The protein-engineering β methodology that was developed for studying binding and catalysis was directly transferable to the problem of protein folding.
Naming the ratio $ \Delta \Delta {G}^{\ddagger }/\Delta \Delta {G}^0 $ as β, though well-intentioned, was misleading as the interpretation of protein engineering values differs in crucial ways from the Brønsted β of covalent chemistry because of the effects of mutation on denatured states among other details. β was renamed Φ in its first application to protein folding (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989) as Φ is not strictly a linear free energy quantity but approximates to one in certain circumstances. To avoid confusion, β is now reserved for the classical β of covalent catalysis and Φ for its counterpart in protein engineering (sections ‘From β- to Φ-value analysis’ and ‘Differences between β- and Φ-value analysis’ below).
Noncovalent transition states in protein folding: Φ-value analysis
The protein folding problem
The ‘protein folding problem’ consists of three closely related puzzles: (a) What is the folding code? (b) What is the folding mechanism? (c) Can we predict the native structure of a protein from its amino acid sequence? (Dill et al., Reference Dill, Ozkan, Shell and Weikl2008). Part (c), prediction of the three-dimensional structure of a protein from its linear amino acid sequence, goes back to Anfinsen (Reference Anfinsen1973); and (b) the determination of the pathway to the folded structure from the unfolded to Levinthal (Reference Levinthal1968). The ‘code’ is how the information to fold is distributed along the structure. There is now a huge database of experimentally determined three-dimensional structures that has been the basis of very successful machine learning procedures for structure prediction, as embodied in AlphaFold (Jumper et al., Reference Jumper, Evans, Pritzel, Green, Figurnov, Ronneberger, Tunyasuvunakool, Bates, Zidek, Potapenko, Bridgland, Meyer, Kohl, Ballard, Cowie, Romera-Paredes, Nikolov, Jain, Adler, Back, Petersen, Reiman, Clancy, Zielinski, Steinegger, Pacholska, Berghammer, Bodenstein, Silver, Vinyals, Senior, Kavukcuoglu, Kohli and Hassabis2021). However, it is a black box that does not reveal the code or the pathway (Ooka and Arai, Reference Ooka and Arai2023). Determination experimentally of the pathway of folding of a protein is extremely difficult because a polypeptide chain progresses through a multitude of transient states as noncovalent interactions are formed and rearranged, and they are not amenable to direct experimental study.
The ‘Levinthal Paradox’ was that proteins could not fold in finite time in a random search. (See an interesting aside from Baldwin who was present at its initial presentation (Baldwin, Reference Baldwin2017).) To solve this paradox, Wetlaufer proposed that one solution for the kinetics of folding was a nucleation-growth mechanism where a small local element of secondary structure slowly formed a nucleus and the structure rapidly grew around it (Wetlaufer, Reference Wetlaufer1973). Ptitsyn proposed a framework (Ptitsyn, Reference Ptitsyn1973) or diffusion-collision mechanism (Karplus and Weaver, Reference Karplus and Weaver1976), whereby a framework of elements of secondary structure formed an intermediate rapidly in which they diffused and collided to dock on each other. Another proposal was hydrophobic collapse where non-specific tertiary interactions are rapidly made to form a molten-globule, which rearranges to give the final folded structure (Ptitsyn, Reference Ptitsyn1991; Figure 7). Simple theoretical models, usually based on simulations on lattices, showed that the paradox arose because the original assumption was for an unbiased search for the folded state on a flat energy surface. In contrast, mechanisms utilising the gradual or otherwise acquisition of native interactions funnelling folding to the desired state obviated the paradox (Sali et al., Reference Sali, Shakhnovich and Karplus1994a; Bryngelson et al., Reference Bryngelson, Onuchic, Socci and Wolynes1995; Dill et al., Reference Dill, Bromberg, Yue, Fiebig, Yee, Thomas and Chan1995; Onuchic et al., Reference Onuchic, Wolynes, Luthey-Schulten and Socci1995; Karplus, Reference Karplus2011; Takada, Reference Takada2019; Finkelstein et al., Reference Finkelstein, Bogatyreva, Ivankov and Garbuzynskiy2022). There was, however, an apparent conflict between the ‘classical view’ of protein folding proceeding along defined pathways with intermediates and a supposed ‘new’ view of folding on an energy landscape (Baldwin, Reference Baldwin1995). From these theoretical studies, we now envisage proteins folding on multi-dimensional energy landscapes with a large number of conformations in the denatured state ensemble with high entropy converging on decreasingly smaller ensembles in transition states and intermediates to the final structure, with the gain in enthalpy from native interactions compensating the loss of entropy. We can represent these ensembles as states along a two-dimensional energy diagram, Figure 8 (Eaton et al., Reference Eaton, Thompson, Chan, Hage and Hofrichter1996). It must be emphasised that what the experimentalist sees as the denatured state, D, under conditions that favour folding, Dphys, is not usually a random coil, U, but a more structured state varying from having flickering interactions (Figure 8a) to a fairly structured on- or off-pathway intermediate (Figure 8b). The basics of protein folding studies are discussed in more detail in Fersht (Reference Fersht1999, Reference Fersht2017, Reference Fersht2018, Chaps. 17–19).
Nucleation mechanisms went out of favour because the early experimental examples of protein folding were found to proceed via intermediates on the pathway (Ptitsyn, Reference Ptitsyn1987; Kim and Baldwin, Reference Kim and Baldwin1990), and nucleation is characterised by not having intermediates that would accumulate.
From β- to Φ-value analysis
Studies on the effects of point mutations on folding kinetics had begun in the late 1980s with Matthews analysing natural mutants of the α α-subunit of tryptophan synthase (Matthews, Reference Matthews1987). Goldenberg protein engineered mutants of bovine pancreatic trypsin inhibitor (Goldenberg et al., Reference Goldenberg, Frieden, Haack and Morrison1989). We began applying the technology and Φ-strategy developed on the tyrosyl-tRNA synthetase to the folding of a small RNase, Barnase (Kellis et al., Reference Kellis, Nyberg, Sali and Fersht1988; Sali et al., Reference Sali, Bycroft and Fersht1988; section ‘Barnase: the test bed’). The two-point LFER approach used for the mapping the progress of noncovalent interactions in enzyme catalysis is directly applicable to studying transition states and transient intermediates in folding. But, there are crucial refinements, which were laid out in the initial LFER paper (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989 and subsequently expanded in more depth (Fersht et al., Reference Fersht, Matouschek and Serrano1992; Fersht and Sato, Reference Fersht and Sato2004), relying on the thermodynamic cycles in Figure 9, which are essential to the analysis (the use of such alchemical cycles was perhaps not obvious and queried at the time (Buchner and Kiefhaber, Reference Buchner and Kiefhaber1990). Accordingly we used the same strategy as before: (1) make chemically sensible mutations in a suitable protein by truncating side chains to remove stabilising interactions (avoid mutations that cause stereochemical clashes or unstable charges within the protein – the nondisruptive deletions, especially of hydrophobic side chains); (2) measure the change in the free energy of folding of the protein on mutation, $ \Delta \Delta {G}_{\mathrm{N}-\mathrm{D}} $ ($ \hskip-1em =\Delta {G}_{\mathrm{N}\mathrm{\prime }-\mathrm{D}\mathrm{\prime }}-\Delta {G}_{\mathrm{N}-\mathrm{D}}, $ where N = native state, D denatured state, and N’ and D’ refer to mutants); and (3) measure the rate constants of folding, k f, of the wild-type and mutant proteins to determine the changes in the free energies of activation $ \Delta \Delta {G}_{\ddagger -\mathrm{D}}(\hskip-0.3em =\Delta {G}_{\ddagger \mathrm{\prime}-\mathrm{D}\mathrm{\prime }}-\Delta {G}_{\ddagger -\mathrm{D}}=-\hskip-0.3em RT\mathrm{ln}(k{\mathrm{\prime}}_{\mathrm{f}}/{k}_{\mathrm{f}}))\hskip-0.7em $, and rate constants for unfolding, k u, to give $ \Delta \Delta {G}_{\ddagger -\mathrm{N}}\left(\hskip-0.3em =\Delta {G}_{\ddagger \prime -\mathrm{N}\prime }-\Delta {G}_{\ddagger -\mathrm{N}}=- RT\mathrm{ln}\left(k{\prime}_{\mathrm{u}}/{k}_{\mathrm{u}}\right)\right) $.
We then defined a parameter Φ for folding. In the direction of folding:
And for unfolding:
We can derive from the thermodynamic cycles in Figure 9 that $ \Delta \Delta {G}_{\mathrm{D}-\mathrm{N}}=\Delta {G}_{\mathrm{D}\prime -\mathrm{D}}-\Delta {G}_{\mathrm{N}\prime -\mathrm{N}} $; $ \Delta \Delta {G}_{\ddagger -\mathrm{N}}=\Delta {G}_{\ddagger \prime -\ddagger }-\Delta {G}_{\mathrm{N}\prime -\mathrm{N}} $; and $ \Delta \Delta {G}_{\ddagger -\mathrm{D}}=\Delta {G}_{\ddagger \prime -\ddagger }-\Delta {G}_{\mathrm{D}\prime -\mathrm{D}} $. Accordingly,
Ignoring the changes in covalent energies on mutation as they cancel out in subsequent calculations, the term $ \Delta {G}_{\mathrm{N}\prime -\mathrm{N}}=\Delta {G}_{\left(\mathrm{N}\prime -\mathrm{N}\right)\mathrm{noncovalent}}+\Delta {G}_{\left(\mathrm{N}\prime -\mathrm{N}\right)\mathrm{reorg}}, $ where $ \Delta {G}_{\left(\mathrm{N}\prime -\mathrm{N}\right)\mathrm{noncovalent}} $ is the change in noncovalent interactions from the mutation and $ \Delta {G}_{\left(\mathrm{N}\prime -\mathrm{N}\right)\mathrm{reorg}} $ is any energetics of reorganisation of the structure of the folded protein. There are similar equations involving $ \Delta {G}_{\mathrm{reorg}} $ for the change in energetics of the denatured and transition states including changes in solvation, $ \Delta {G}_{\mathrm{solv}}\hskip-0.5em $. For denatured states that are highly unfolded, $ \Delta {G}_{\mathrm{solv}} $ is the major term in $ \Delta {G}_{\mathrm{reorg}} $ but often for the interior in folded proteins $ \Delta {G}_{\mathrm{solv}}=0 $.
Building on our classification of mutations (Fersht et al., Reference Fersht, Leatherbarrow and Wells1987) and thermodynamic analysis (Fersht, Reference Fersht1988), it was spelled out clearly in the first paper what type of mutations to make in the light of incursion of $ \Delta {G}_{\mathrm{solv}} $, and how the choice affects the observed values of Φ (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989. Assuming that the effects of mutation on the noncovalent interactions are localised to the site of the side chain, the two extreme situations are readily interpretable (Figure 10). If the side chain is as unstructured in the transition state as in the denatured state, $ \Delta {G}_{\ddagger \prime -\ddagger }=\Delta {G}_{\mathrm{D}\prime -\mathrm{D}} $ and so $ {\varPhi}_{\mathrm{F}}=0 $ and $ {\varPhi}_{\mathrm{U}}=1 $. Conversely, if the side chain is as structured in the transition state as in the native state, $ \Delta {G}_{\ddagger \prime -\ddagger }=\Delta {G}_{\mathrm{N}\prime -\mathrm{N}} $, and so $ {\varPhi}_{\mathrm{F}}=1 $ and $ {\varPhi}_{\mathrm{U}}=0 $. This is the same as the extreme cases of the Brønsted β. For mutations of larger to smaller aliphatic side chains, which are the most suitable as we cannot emphasise enough, $ \Delta {G}_{\mathrm{D}\prime -\mathrm{D}} $ (i.e. $ \Delta {G}_{\mathrm{reorg}} $) should be small. For example, mutation of Ile→Ala and Ile→Val have $ \Delta {G}_{\mathrm{solv}} $ = −0.21 and −0.16 kcal/mol, respectively. The deletion of a −CH2− group will lead to minimal G reorg. Accordingly, $ {\varPhi}_{\mathrm{F}} $ is related to the extent of local structure formation in the native and transition states (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989; Fersht et al., Reference Fersht, Matouschek and Serrano1992; Fersht and Sato, Reference Fersht and Sato2004). This is especially so for Ala→Gly scanning in helices (section ‘Ala→Gly scanning of secondary structure’).
Differences between β- and Φ-value analysis
In many ways, the interpretation of Φ-values is analogous to that of β, but there are important differences that must be minimised for the successful application of Φ. In the classical chemical LFERs, the structural changes made in the reagents are at positions separated from the reacting bonds and the effects of the substituents transmitted through the molecule. $ \Delta {G}_{\mathrm{reorg}} $ terms for β in covalent chemistry are ignored because they are relatively small or non-existent. Basically, β (or α) = $ \mathrm{\partial \Delta }{G}_{\ddagger -\mathrm{S}}/\mathrm{\partial \Delta }{G}_{\mathrm{P}-\mathrm{S}} $ in Figure 3. In the protein engineering LFERs, the very groups making the bonds are changed and there can be a significant $ \Delta {G}_{\mathrm{reorg}} $ in the native state and possibly in a structured denatured state. There can also be $ \Delta {G}_{\mathrm{solv}} $ terms for both states. To acknowledge these differences, as mentioned previously, β was renamed Φ, and Φ-analysis experiments designed to minimise or accommodate those ∆G terms (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989). When this is done, Φ is very similar to β.
(Water molecules surrounding the reactants and catalyst in classical chemical LFER experiments may rearrange on changing a substituent and cause significant changes of $ \Delta {H}_{\mathrm{reorg}}^{\ddagger } $ and -$ T\Delta {S}_{\mathrm{reorg}}^{\ddagger } $ but those changes tend to compensate and cancel out in $ \Delta \Delta G $, although they do complicate attempts to measure the $ \Delta H $ and $ \Delta \mathrm{S} $ components of the actual chemical steps.)
REFERs: β Tanford (βT), Leffler/Brønsted plots and Φ
Protein folding has other important differences such as the difficulty in choosing a suitable reaction coordinate. A global average may be defined for overall folding but the formation of structure is not homogeneous and the local reaction coordinates for substructures are what define the formation of transition states and intermediates. The interpretation of Φ-values is more complicated than that of β and extra procedures may be involved. A simple overall reaction coordinate was introduced by Tanford (Reference Tanford1968, Reference Tanford1970). All parts of a protein are stabilised by denaturant, Den, and its free energy increases linearly with [Den] and the solvent accessible surface area (SASA). There is a decrease in SASA on going from $ \mathrm{D}\to \ddagger \to \mathrm{N}, $ so $ \Delta {G}_{\ddagger -\mathrm{D}}=\Delta {G^0}_{\ddagger -\mathrm{D}}-{m}_{\ddagger -\mathrm{D}}\left[\mathrm{Den}\right]; $ $ \Delta {G}_{\ddagger -\mathrm{N}}=\Delta {G^0}_{\ddagger -\mathrm{N}}+{m}_{\mathrm{N}-\mathrm{D}}\left[\mathrm{Den}\right]; $ and $ \Delta {G}_{\mathrm{D}-\mathrm{N}}=\Delta {G^0}_{\mathrm{D}-\mathrm{N}}-({m}_{\mathrm{N}-\mathrm{D}})[\mathrm{Den}]; $ where for 2-state kinetics $ {m}_{\mathrm{N}-\mathrm{D}}={m}_{\ddagger -\mathrm{D}}+{m}_{\ddagger -\mathrm{N}} $ (all m-values +ve). The relative change in surface area in the transition state, which I renamed $ {\beta}_{\mathrm{T}} $ in homage to Tanford (Matouschek et al., Reference Matouschek, Otzen, Itzhaki, Jackson and Fersht1995), is given by: $ {\beta}_{\mathrm{T}}={m}_{\ddagger -\mathrm{D}}/{m}_{\mathrm{D}-\mathrm{N}} $. The Tanford plot is a true REFER.
Leffler plots, which are also called Brønsted plots, of $ \Delta {G}_{\ddagger -\mathrm{N}} $ versus $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ or $ \Delta {G}_{\ddagger -\mathrm{D}} $ versus $ \Delta {G}_{\mathrm{N}-\mathrm{D}} $ also give an indication of the overall change in energetics. However, they can exhibit scatter depending on the inhomogeneity of structure formation in the transition state (see later in the discussion of Figure 15, section ‘Chymotrypsin inhibitor 2: computer simulations’). Just as the finding of multipoint LFERs/REFERs for the tyrosyl-tRNA synthetase is a bonus, resulting from concerted movement of parts of the binding site relative to the substrate, the same can sometime be found for Φ-analysis. Part of a helix in barnase, for example, is uniformly present in the transition state and its formation can be benignly probed by truncating surface exposed side chains to Ala and then Gly to give a series of overlapping 3-point Leffler/Brønsted plots (Matthews and Fersht, Reference Matthews and Fersht1995; Fersht and Sato, Reference Fersht and Sato2004). Accordingly, Φ is a true REFER for those mutations.
ψ-value analysis
Disulphide crosslinks tie together residues in both the transition states and denatured states as well as native states, with predictable effects on kinetics that can detect when the linked elements of structure are formed during the folding pathway of wild-type protein (Clarke and Fersht, Reference Clarke and Fersht1993). This is a highly specific procedure and very limited in applicability. Sosnick has pioneered a more general mutational procedure for this crosslinking approach for surface residues, ψ-value analysis (Krantz and Sosnick, Reference Krantz and Sosnick2001; Baxa and Sosnick, Reference Baxa and Sosnick2022). Pairs of histidine residues as metal-binding sites are introduced on the surface typically close to each other in the folded state, for example, at positions i, i+4 in an α-helix or at neighbouring strands in a β-sheet (‘nondisruptive additions’). A metal ion can then crosslink the pair. This contrasts with Φ-value analysis in that ψ- adds new interactions to the protein and analyses their effects on the mutants whereas Φ-analysis uses non-disruptive deletions that probe the extent of formation of interactions present in the wild-type structure. ψ-value analysis is not an REFER but the values of 1 or 0 should be interpretable (Fersht, Reference Fersht2004a; Bodenreider and Kiefhaber, Reference Bodenreider and Kiefhaber2005). Indeed, simulation of the transition state for the folding of ubiquitin is consistent with ψ-values of 1 or 0 but not the fractional ones. It is a useful tool for those values (Varnai et al., Reference Varnai, Dobson and Vendruscolo2008).
Interpretation of Φ-values
Weak, medium, and strong categorisation of Φ
The values of $ \varPhi =0 $ or 1 may be interpreted with confidence. Mutations such as Ile→Val, Ala→Gly, and Thr→Ser are particularly suitable and Ile→Ala can be good – see section ‘Experimental approach to Φ-value analysis’. In general, the Φ‒values should be interpreted only semi-quantitatively and with caution: $ 0<{\varPhi}_{\mathrm{F}}<0.2 $, ‘low’ or ‘weak’, little or no structure in transition state; $ 0.3<{\varPhi}_{\mathrm{F}}<0.6 $, ‘medium’ significant to strong; and $ 0.7<{\varPhi}_{\mathrm{F}}<1 $, ‘high’ or ‘strong’, very significant structure (with flexibility as to the boundaries) – like weak, medium and strong NOEs used as distance constraints in molecular dynamics (MD) calculations in structure determination by NMR (Fersht and Sato, Reference Fersht and Sato2004; Garcia-Mira et al., Reference Garcia-Mira, Boehringer and Schmid2004) and such classification has been applied with success in computer simulations of the structure of transition states (Geierhaas et al., Reference Geierhaas, Salvatella, Clarke and Vendruscolo2008). As discussed later, Φ‒values may be powerfully combined with computer simulations of unfolding and folding trajectories to give true atomic-level descriptions of protein folding pathways. It is important to make many mutations and over-sample to find consistent results that then give reliable information. Φ‒values by themselves can give gross and near atomic resolution details on the structures of transition states. There are some areas that are more problematic, which I next describe and how they may be resolved.
Φ and non-native interactions: $ \varPhi <0 $ or$ \varPhi >1 $
Φ, like β, is predicated on a single bond or set of bonds being formed, with limits of 0 for no formation and 1 for complete. It parallels in some ways the Gō model in simulation that assumes that only native contacts are involved in the folding process and they consolidate (Taketomi et al., Reference Taketomi, Ueda and Go1975; Takada, Reference Takada2019). If there are non-native interactions in transition states or intermediates, then unnatural values of Φ of <0 or >1 may be observed, and they are a useful signal for that. Residual structure in the denatured state can give rise to non-classical values (Cho and Raleigh, Reference Cho and Raleigh2006). Small two-state single-domain proteins are the most likely not to involve non-native interactions (Best and Hummer, Reference Best and Hummer2016), and Gō model simulations can fit well with Φ measurements (Clementi et al., Reference Clementi, Nymeyer and Onuchic2000; Wu et al., Reference Wu, Zhang, Qin, Liu and Wang2008; Naganathan and Orozco, Reference Naganathan and Orozco2011).
Double-mutant cycles to identify native partners in interactions
Φ-value analysis interprets changes in energy to changes in structure and assumes that the native interactions are involved, and there can be complications from non-native interactions. Strong evidence about which residues interact can found by the procedure of double-mutant cycles (Figure 11), first introduced for the tyrosyl-tRNA synthetase (Carter et al., Reference Carter, Winter, Wilkinson and Fersht1984). Two residues that interact in the native state of the protein are mutated individually, and then pairwise. An interaction energy between just those two residues $ \Delta \Delta {G}_{\mathrm{int}} $ is measured without complications from an unfolded denatured state (Fersht et al., Reference Fersht, Matouschek and Serrano1992). The same is true for the interaction in the transition state, $ \Delta \Delta {G^{\ddagger}}_{\mathrm{int}}\hskip-0.3em $. Values of $ {\varPhi}_{\mathrm{int}},=\Delta \Delta {G^{\ddagger}}_{\mathrm{int}}/\Delta \Delta {G}_{\mathrm{int}} $, show with high certainty whether or not and by how much those interactions are formed in the transition state (Horovitz and Fersht, Reference Horovitz and Fersht1990, Reference Horovitz and Fersht1992; Horovitz et al., Reference Horovitz, Serrano and Fersht1991; Fersht et al., Reference Fersht, Matouschek and Serrano1992; Pagano et al., Reference Pagano, Toto, Malagrino, Visconti, Jemth and Gianni2021). They can be used to provide constraints for computer simulations of transition state structure (Salvatella et al., Reference Salvatella, Dobson, Fersht and Vendruscolo2005). Multi-mutant cycles can also be performed (Horovitz and Fersht, Reference Horovitz and Fersht1990, Reference Horovitz and Fersht1992).
Parallel pathways and fractional Φ-values
A fractional Φ-value is usually interpreted as arising from a single transition state ensemble that has weakened interactions. But there could be parallel pathways, as in, Figure 12, with some having full structure at the point of mutation and others disordered and these could give an apparent fractional value (Baldwin, Reference Baldwin1994; Sali et al., Reference Sali, Shakhnovich and Karplus1994b). This can be tested, however, by making a series of additional mutants that would have different and predictable effects on the disordered and structured pathway states, and the fractional values of Φ for the protein CI2 (below) being consistent with a single pathway through the transition state (Fersht et al., Reference Fersht, Itzhaki, elMasry, Matthews and Otzen1994).
Residual structure in denatured states
Denatured states can have residual structure even at high concentrations of denaturants (Dill and Shortle, Reference Dill and Shortle1991; Cho and Raleigh, Reference Cho and Raleigh2006) and especially at low concentrations where the most stable denatured state may be a folding intermediate or an off-pathway state with non-native interactions. Residual structure is melted out less slowly by denaturants and temperature as there are smaller changes in surface area. These states can severely affect folding kinetics of all types. But, unfolding kinetics and $ \Delta {G}_{\ddagger -\mathrm{N}} $ from the folded state are unaffected as the denatured states are after the rate-determining transition state. $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ is measured at higher concentrations of denaturant but there could be significant $ \Delta {\mathrm{G}}_{\left(\mathrm{D}-\mathrm{N}\right)\mathrm{reorg}} $ terms with mutations affecting structure in the denatured state. Values of $ {\varPhi}_{\mathrm{U}} $ close to 0 will be relatively unaffected but values closer to 1 may have artefacts. For these reasons, we gave up the terminology ‘U’ = unfolded for the denatured state and call it D or Dphys under physiological conditions.
Experimental approach to Φ-value analysis
$ \Delta {G}_{\mathrm{reorg}} $ and choice of mutation
The presence of $ \Delta {G}_{\left(\mathrm{N}\prime -\mathrm{N}\right)\mathrm{reorg}} $ and similar terms dictates the choice of mutation. To recapitulate earlier points, a mutation of a buried side chain to a larger one will likely cause a significant $ \Delta {G}_{\left(\mathrm{N}\prime -\mathrm{N}\right)\mathrm{reorg}} $ as will changes in buried charges. Accordingly, mutations that preferably delete interactions, non-disruptive deletions, or are isosteric are most suitable. The changes in energetics must be sufficiently large to be able to be measured accurately but not too large, otherwise the position of the transition state may be perturbed or there will be a local rearrangement of structure on making a too-large deletion.
Our preferred strategy is: (1) to mutate the buried hydrophobic moieties Ile→Val→Ala→Gly; Leu→Ala→Gly; Thr→Ser; and Phe→Ala→Gly. Deletion of a −CH2− has minimal effects on the solvation energies of the denatured state and low $ \Delta {G}_{\mathrm{reorg}} $ in all states; (2) make a wider range of surface mutations; (3) mutate Ala→Gly positions in secondary structural regions (‘Ala→Gly scanning’, see 7.2), especially in α-helices, because they provide an exquisite probe of secondary structure in the helix since mutation perturbs mainly intra-helical interactions; and (4) use, sparingly, double-mutant cycles in which changes in solvation and reorganisation energies tend to cancel out. Mutation of a long aliphatic side chain in the hydrophobic core, such as that of isoleucine, can give information on the degree of consolidation of the core on mutation to Ala, and then the structure of the helix during that process on subsequent mutation to Gly. Successive deletion of different parts of larger side chains may give multiple probes of structure (Serrano et al., Reference Serrano, Neira, Sancho and Fersht1992b). These types of mutation tend to give values of $ \Delta \Delta {G}_{\mathrm{D}-\mathrm{N}} $ in the range of 0.6–2 kcal/mol, which can be measured with adequate precision and typical of the interactions that report on secondary structure as well as local interactions in hydrophobic cores (Friel et al., Reference Friel, Capaldi and Radford2003; Fersht and Sato, Reference Fersht and Sato2004; Garcia-Mira et al., Reference Garcia-Mira, Boehringer and Schmid2004; Sato et al., Reference Sato, Religa and Fersht2006). Larger changes can lead to a movement of the transition state on the energy landscape (Fersht and Sato, Reference Fersht and Sato2004).
Ala→Gly scanning of secondary structure
Mutation of Ala→Gly in helices is a particularly clean tool (Matthews and Fersht, Reference Matthews and Fersht1995). The CH3− side chain of Ala stabilises an α-helix relative to the H− of Gly mainly by burial of the hydrophobic surface area, from 0.4 to 2 kcal/mol, and mutation has minimal structural perturbation (Serrano et al., Reference Serrano, Matouschek and Fersht1992a,Reference Serrano, Sancho, Hirshberg and Fershtc). Further, unfolded alanine- and glycine-containing peptides are approximately isoenergetic in noncovalent interactions (Scott et al., Reference Scott, Alonso, Sato, Fersht and Daggett2007) and so mutation of Ala→Gly has minimal $ \Delta {G}_{\mathrm{reorg}} $ terms in both states. Accordingly, $ {\varPhi}_{\mathrm{Ala}\to \mathrm{Gly}} $ is the most reliable measure of structure formation of all Φ-values.
Experimental determination of $ \Delta G\mathrm{s} $
The changes in $ \Delta {G}^{\ddagger } $ and $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ are mostly measured from variation of the rate constants of folding and unfolding and the equilibrium constant with concentration of a denaturant such as urea or guanidinium chloride. Usually, logarithms of the rate and equilibrium constants for unfolding increase linearly with concentrations of denaturant under the accessible experimental conditions, but sometimes with small deviations at very low concentrations (Tanford, Reference Tanford1968, Reference Tanford1970). For two-state kinetics, the logarithm of the rate constants for folding decrease linearly with denaturant concentration (Tanford et al., Reference Tanford, Aune and Ikai1973) and plots of the combinations of logk u and logk f give so-called chevron plots as in Figure 13 (Jackson and Fersht, Reference Jackson and Fersht1991a. For multi-state systems, the refolding limb is usually characterised by ‘rollover’ where the folding rate constant tends to plateau at low denaturant concentration as there are changes in rate-determining steps, Figure 13, inset (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989. The proteins in Figure 13 refold on the tens of ms time scale, the kinetics measured by rapid-mixing stopped-flow methods. Smaller single-domain proteins can fold even faster on the μs time scale as for the 37-residue Formin-Binding Protein, FBP28, a canonical three-stranded β-sheet WW domain, Figure 14 (Petrovich et al., Reference Petrovich, Jonsson, Ferguson, Daggett and Fersht2006). Its kinetics of folding and unfolding are too fast for rapid mixing but are readily and accurately measured using temperature-jump apparatus. The unfolding of such small proteins exposes only a relatively small amount of buried surface area and so the transition is spread out over a wide range of concentration of denaturant. The FBP28 domain has a very polarised transition state as readily seen directly from the chevron plots. Some plots have the folding limbs nearly superposed, showing $ \Delta {G}_{\ddagger -\mathrm{D}}\sim 0 $ and so $ {\varPhi}_{\mathrm{F}}\sim 0/\Delta \Delta {G}_{\mathrm{N}-\mathrm{D}} $, that is ~ 0 for non-zero values of $ \Delta \Delta {G}_{\mathrm{N}-\mathrm{D}} $. Conversely, other plots have the unfolding limbs nearly superposed, showing $ \Delta {G}_{\ddagger -\mathrm{N}}\sim 0 $ and so $ {\Phi}_{\mathrm{U}}\sim 0/\Delta \Delta {G}_{\mathrm{D}-\mathrm{N}} $, that is ~0. As, $ {\varPhi}_{\mathrm{U}}+{\varPhi}_{\mathrm{F}}=1 $ for two-state kinetics, these chevrons of $ {\varPhi}_{\mathrm{U}}\sim 0 $ have $ {\varPhi}_{\mathrm{F}}\sim 1 $. These values of $ {\varPhi}_{\mathrm{F}}\sim 0 $ or 1 are also determined with the highest confidence as the errors around $ \Delta \Delta {G}_{\ddagger -\mathrm{N}} $ and $ \Delta \Delta {G}_{\ddagger -\mathrm{N}}\sim 0 $ are small. An error of, say, ±0.1 for a mean of $ {\varPhi}_{\mathrm{U}}=0.05 $ is a very high percentage error in the absolute value of $ {\varPhi}_{\mathrm{U}} $ but in the context of where $ {\varPhi}_{\mathrm{U}} $ is on the scale of 0 to 1 is sufficiently accurate for the purposes of interpretation. Accordingly, the most readily interpretable values of Φ, 0 and 1, are the ones most amenable to confident measurement.
I advocate for optimising precision measuring differences in $ \Delta {G}^{\ddagger } $ and $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ directly under the same reaction conditions (same concentration of denaturant, [Den]) and not extrapolating to the absence of denaturant. In our laboratory, we can measure $ \Delta \Delta {G}_{\mathrm{D}-\mathrm{N}} $ with adequate precision down to ~0.6 kcal/mol from the differences in the midpoints of equilibrium denaturation curves of wild-type and mutants (Clarke and Fersht, Reference Clarke and Fersht1993) or from the unfolding and folding rate constants (Fersht and Sato, Reference Fersht and Sato2004) as do other (Friel et al., Reference Friel, Capaldi and Radford2003; Garcia-Mira et al., Reference Garcia-Mira, Boehringer and Schmid2004). First-order rate constants for unfolding and refolding can be determined with high precision. Attention to detail is important. We make up stock solutions of denaturant for each concentration, using volumetric flasks rather than diluting one concentrated stock solution into buffer. I avoid using phosphate buffer with guanidinium chloride as it lowers the pK a greatly with increasing [Den] because its ionic component displaces the ionisation equilibrium H2PO4− = HPO42− + H+ as according to the Debye–Huckel equation the activity coefficient of an ion depends on the charge squared (Debye and Huckel, Reference Debye and Huckel1923). (The application to kinetics was implemented in the Brønsted–Bjerrum equation.). Instead, I prefer an amine buffer at neutrality or at lower pH acetate because their ionizations parallel more closely the principal protein ionizations at those pHs; histidine/α-amino groups, and aspartate/glutamate (Fersht and Petrovich, Reference Fersht and Petrovich2013). Urea does not have this problem. To minimise problems from changes of pH with temperature and denaturant concentrations and so forth, measurements are best made at pHs where free energies and kinetics are pH independent.
Combining Φ-values with and benchmarking computer simulation
The complete conscription of folding pathways of proteins can be achieved only by computer simulation. This is possible de novo only when the energy potentials are sufficiently reliable, or a black box machine learning is applicable. The role of the experimentalist has been to provide the structures of all the states along the pathway as a starting basis for simulation and to benchmark simulation within the limitations of current energy functions. Φ-values since their initial introduction have provided the crucial benchmark for interactions in the transition state for the folding of the small domains, the most easily studied computationally because of the limitations on computing power. They are being used for testing more complex folding of large proteins (Ooka and Arai, Reference Ooka and Arai2023). There are methods for calculating Φ-values directly (Best and Hummer, Reference Best and Hummer2016).
Barnase: the test bed
Φ-value analysis was pioneered on the 110-residue RNase, Barnase, from Bacillus amyloliquefaciens. It is a most suitable small protein for structure-activity studies using protein engineering, readily expressed from E. coli and does not have complications from disulphide bridges or cis-prolines in the folded state. The strategy for studying it has two steps as for the tyrosyl-tRNA synthetase studies; (1) mutate the protein sensibly and extensively to build up a library of the common interactions that stabilise proteins; and (2) select suitable mutants for kinetic analysis.
Step 1: library of interaction energies that stabilise proteins
The magnitudes of the hydrophobic effect and other interactions were usually measured from simple free energies of transfer from organic solvents to water (Fersht, Reference Fersht1999, Reference Fersht2017, Reference Fersht2018 Ch. 11) or more appropriately for α-helixes the stabilities of synthetic peptides in water (Padmanabhan et al., Reference Padmanabhan, Marqusee, Ridgeway, Laue and Baldwin1990). We made the first systematic measurements of the common interactions that stabilise proteins directly in a protein from the values of $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ of wild-type barnase versus mutants whose side chains had been truncated by non-disruptive deletions. The deletion of −CH2− group from a residue in the hydrophobic core lowers stability by up to 1.6 kcal/mol compared with 0.68 kcal/mol in the simple chemical models (Kellis et al., Reference Kellis, Nyberg, Sali and Fersht1988, Reference Kellis, Nyberg and Fersht1989). The mutation of Ala→Gly in the exposed surface of helices lowers stability of 0.4–2 kcal/mol and depends on the amount of surface area of the CH3− group of Ala buried (Serrano et al., Reference Serrano, Matouschek and Fersht1992a, Reference Serrano, Neira, Sancho and Fersht1992b). Mutants from these studies with suitable values of $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ were chosen for the kinetic studies.
Step 2: kinetics
The initial study was on the unfolding of the protein as it starts from the best-characterised state on the pathway and the folding direction can be beset by problems of residual structure in the denatured state or even intermediates (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989. Unfolding kinetics provides in general the most reliable data and is very relevant to biology because many diseases are initiated by protein unfolding. The folded state is the best-characterised starting point also for computer simulation. The unfolding transition state for folding is generally the highest energy state on the folding pathway.
Barnase is a multimodular protein, having regions that make more interactions within themselves than with the rest of the protein, with three hydrophobic cores and a mixed $ \alpha +\beta $ architecture. Some of the regions have Φ-values near 1, others have values of 0, and some regions are intermediate. The centre of the sheet and the C-terminal portion of helix 1 have Φ-values of approximately 1. There are fractional Φ-values for the edges of the sheet and for the packing of the N-terminal α-helix on the β-sheet, which constitutes the major hydrophobic core. The second domain, containing helix2, and the loops have Φ-values ~0. The multimodular barnase has a polarised major transition state, which occurs late on the reaction pathway with much of the secondary structure being formed and the hydrophobic core between the major α-helix and β-sheet in the process of being consolidated (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989; Serrano et al., Reference Serrano, Matouschek and Fersht1992a).
Folding intermediate or structured Dphys?
The downward curvature in the refolding limb of the logk obs versus [Urea] plot (Figure 13) was the initial evidence that there is either a folding intermediate or structured denatured state, Dphys, whose concentration or properties change with concentration of denaturant (Matouschek et al., Reference Matouschek, Kellis, Serrano, Bycroft and Fersht1990. A structured Dphys that progressively unfolds in a non-cooperative transition could give rise to a variable two-state process. Φ-values probe the structure of this state (Matouschek et al., Reference Matouschek, Serrano and Fersht1992), which has been extensively studied by a variety of methods (Khan et al., Reference Khan, Chuang, Gianni and Fersht2003) and simulation (Caflisch and Karplus, Reference Caflisch and Karplus1995; Li and Daggett, Reference Li and Daggett1998; Wong et al., Reference Wong, Clarke, Bond, Neira, Freund, Fersht and Daggett2000; Galano-Frutos and Sancho, Reference Galano-Frutos and Sancho2019). The biophysics is consistent with a cooperative unfolding of the state (Dalby et al., Reference Dalby, Clarke, Johnson and Fersht1998a, Reference Dalby, Oliveberg and Fersht1998b). There are probably two intermediates on the pathway (Khan et al., Reference Khan, Chuang, Gianni and Fersht2003; Sanchez and Kiefhaber, Reference Sanchez and Kiefhaber2003). $ {\varPhi}_{\mathrm{F}} $-values measured from ill-defined folding intermediates must be interpreted with caution because there may be non-native interactions involved. Time-resolved small-angle X-ray scattering indicates an expanded state (Konuma et al., Reference Konuma, Kimura, Matsumoto, Goto, Fujisawa, Fersht and Takahashi2011). The evidence is consistent with some fraction of the denatured ensemble containing residual, non-random structure, especially in helix 1 and the turn (β3–β4) in the centre of the sheet consistent with MD simulation of the denatured state (Bond et al., Reference Bond, Wong, Clarke, Fersht and Daggett1997; Wong et al., Reference Wong, Clarke, Bond, Neira, Freund, Fersht and Daggett2000). The folding pathway is simulated atomistically by running the unfolding pathway in reverse, Figure 15 (Fersht and Daggett, Reference Fersht and Daggett2002; Daggett and Fersht, Reference Daggett and Fersht2003).
Chymotrypsin inhibitor 2: two-state kinetics and nucleation-condensation
Our second protein studied, Chymotrypsin Inhibitor (CI2), is a 64-residue single-domain protein, unlike most of the previous proteins then studied which were multi-domain. It has a single α-helix, docked onto β-sheet, a single-module protein. In contrast to those other proteins then studied (Ptitsyn, Reference Ptitsyn1987; Kim and Baldwin, Reference Kim and Baldwin1990), CI2 was found to fold by two-state kinetics without an intermediate and, for that time, relatively fast on the 10 ms time scale (Jackson and Fersht, Reference Jackson and Fersht1991a, Reference Jackson and Fersht1991b). Intermediates do not detectably accumulate in its folding and the ratio of rate constants for folding and unfolding give the correct equilibrium constant for denaturation, again unlike for the previously studied proteins. The chevron plot has perfectly linear arms, Figure 13. Its single rate-determining transition state for folding can be studied in both directions to show unfolding and folding are the reverse pathways of each and so microscopic reversibility is obeyed. More examples of two-state folding were quickly found (Jackson, Reference Jackson1998) and 89 proteins are now reported with two-state folding kinetics (Manavalan et al., Reference Manavalan, Kuwajima and Lee2019). The small single-domain proteins are very suitable for gaining insights into the early stages of folding before their assembly into more complex tertiary structures in larger multi-domain proteins. They often fold and unfold sufficiently fast that their denatured and native states are in rapid equilibrium in vivo and so the in vitro studies are also directly relevant to biology. There could of course be high-energy intermediates, such as in Figure 1, which are cryptic. Two-state folding without accumulating intermediates resurrected the possibility of nucleation mechanisms.
Chymotrypsin inhibitor 2: nucleation-condensation mechanism
We always perform a large number of mutations, but the Φ-value analysis of CI2 was exhaustive: 100 mutations at 45 of the 64 residues and a network of 11 double-mutant cycles (Itzhaki et al., Reference Itzhaki, Otzen and Fersht1995). It revealed not only nucleation but discovered a new mechanism: the nucleation-condensation mechanism (Fersht, Reference Fersht1995; Itzhaki et al., Reference Itzhaki, Otzen and Fersht1995). The single observed transition state for folding and unfolding consists of a structure in which an extended nucleus is formed, built around the single α-helix, which is being formed at the same time as the rest of the structure is condensing around it. Apart from one residue, all the Φ-values are fractional, approaching closer to 0, the further away from the diffuse nucleus. The physical-chemistry reasoning behind this is quite simple. None of the elements of regular secondary structure, such as the α-helix, are stable in the absence of the rest of the protein structure – as is generally found for proteins – and so those regions when separate from the rest of the structure are largely random in solution (Epand and Scheraga, Reference Epand and Scheraga1968). For most proteins, the secondary structure needs to be stabilised by long-range interactions. Protein folding is, accordingly, such a cooperative process that the major transition state for folding of a domain is one in which the structure is largely formed. Nucleation-condensation is now a well-established general mechanism for the folding of single domains (Nolting and Agard, Reference Nolting and Agard2008; Kukic et al., Reference Kukic, Pustovalova, Camilloni, Gianni, Korzhnev and Vendruscolo2017).
The important features of nucleation-condensation are not just that the nucleus is large and extended but its structure is like a distorted form of the native structure where interactions are not uniform but weaken away from the nucleus. A generally useful pointer to the nucleation-condensation mechanism or a diffuse transition state is a Leffler/Brønsted plot of $ \Delta {G}_{\ddagger -\mathrm{N}} $ versus $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ (Figure 16). As the Φ-values are mainly fractional, the plot is scattered around a linear regression of slope 0.7 with deviations for the higher and lower values of Φ. In contrast, the plot for barnase with its polarised transition state and Φ spread from 0 to 1 has the points scattered between lines of slope 0 and 1 (Itzhaki et al., Reference Itzhaki, Otzen and Fersht1995).
Chymotrypsin inhibitor 2: computer simulations
CI2 is such a well-behaved system, small, and with so much experimental Φ-value data available that it stimulated and became a major test bed for computer simulation. I have had a long collaboration, beginning in 1994 (Fersht et al., Reference Fersht, Itzhaki, elMasry, Matthews and Otzen1994; Li and Daggett, Reference Li and Daggett1994), with Valerie Daggett, who had performed the first all-atom simulation of the unfolding of the bovine pancreatic trypsin inhibitor (Daggett and Levitt, Reference Daggett and Levitt1992). Our collaboration agreement was that all her simulations were done blind without foreknowledge of our experimental data. Li and Daggett simulated the unfolding of CI2 at 498K, the simulated high temperature being necessary for the unfolding to be on the then accessible timescale of 2.2 ns (Li and Daggett, Reference Li and Daggett1994) (the pathway does not change over a range of temperature (Day et al., Reference Day, Bennion, Ham and Daggett2002)). The Φ-values from MD and experiment were very similar in the first study. As more experimental Φ-values became available, the good agreement remained. A simulation (Daggett et al., Reference Daggett, Li, Itzhaki, Otzen and Fersht1996) gave a complete atomic-level description of the transition state and recapitulated all the experimental Φ-values (Itzhaki et al., Reference Itzhaki, Otzen and Fersht1995). These simulations were then combined with further studies on the denatured state, including one of the first atomic views of a ‘random coil’ denatured state (Kazmirski et al., Reference Kazmirski, Wong, Freund, Tan, Fersht and Daggett2001), and transition states (Li and Daggett, Reference Li and Daggett1996; Kazmirski et al., Reference Kazmirski, Wong, Freund, Tan, Fersht and Daggett2001), to give more detailed descriptions, reviewed by Fersht and Daggett (Reference Fersht and Daggett2002) and Daggett and Fersht (Reference Daggett and Fersht2003), Figure 17.
In multiple simulations of unfolding, single trajectories are distributed around an average ‘ensemble’ path (Day and Daggett, Reference Day and Daggett2005). Simulations of folding and unfolding at the melting temperature showed that microscopic reversibility indeed holds (Day and Daggett, Reference Day and Daggett2007). Overall, they found conformations in the transition state ensemble (TSE) have a probability of 0.5 to refold to the native state, with approximately 50% of the structures taken from the TSE refolding and the other 50% progressing to the denatured state (Day and Daggett, Reference Day and Daggett2007). Further, simulations pointed to mutations that could speed up folding by relieving strain in the transition state, and one, Arg38→Phe48, was found that speeds up folding 40x to a t 1/2 of 400 μs (Ladurner et al., Reference Ladurner, Itzhaki, Daggett and Fersht1998). Thus, the MD-derived TSE consists of true transition states, validating the use of transition state theory underlying all Φ-value analyses, and also showing the power of simulation.
The results of multiple simulations of unfolding reconciled the ‘new view’ of folding on an energy landscape and the classical view of protein folding with a defined pathway – there is a statistically preferred pathway on a funnel-like average energy surface (Lazaridis and Karplus, Reference Lazaridis and Karplus1997). The funnelled nature of the energy landscape arising from Wolynes’ minimal frustration principle (strong native bias) is consistent with unusual Φ-values being infrequent and that the transition state is a distorted version of the native state. Also, because the energy landscape is funnelled mutations, are not prone to change the structure of the native state (Oliveberg and Wolynes, Reference Oliveberg and Wolynes2005). CI2 Φ-values helped the theoreticians to clarify their views (Pande et al., Reference Pande, Grosberg, Tanaka and Rokhsar1998).
CI2 occupies an important position in the development of protein folding studies because it was the first example of a single-domain protein showing two-state kinetics, the Φ-value analysis discovered the nucleation-condensation mechanism, and it stimulated so much theoretical advance.
Movement of TS on the energy landscape: Hammond and anti-Hammond effects
The transition state lies on a saddle point in the energy landscape and can move in a direction along the reaction coordinate, Hammond effect, or perpendicular to it, anti-Hammond, as the energetics are perturbed, Figures 3 and 18 (Jencks, Reference Jencks1985). We found both Hammond and anti-Hammond in folding transition states (Matouschek and Fersht, Reference Matouschek and Fersht1993; Matouschek et al., Reference Matouschek, Otzen, Itzhaki, Jackson and Fersht1995; Matthews and Fersht, Reference Matthews and Fersht1995; Dalby et al., Reference Dalby, Oliveberg and Fersht1998c) by comparing the extent of overall folding using Leffler/Brønsted plots of $ \Delta {G_{\ddagger}}^0 $ versus $ \Delta {G}_{\mathrm{D}-\mathrm{N}} $ or $ {\beta}_{\mathrm{T}} $ with Φ-values for local structure (Matthews and Fersht, Reference Matthews and Fersht1995; Fersht and Sato, Reference Fersht and Sato2004). A Leffler/Brønsted plot of successive mutations in helix 1 of barnase has a slope for unfolding of −0.09 for mutations with $ \Delta {G}_{\mathrm{D}-\mathrm{N}}<2 $ kcal/mol, showing that it is ~90% folded in the transition state, but for $ \Delta {G}_{\mathrm{D}-\mathrm{N}}>3 $ kcal/mol, the slope steepens to −0.6, so that the helix is only ~60% folded. The overall position of the transition state moves closer to that of the native structure as it becomes less stable, measured by $ {\beta}_{\mathrm{T}} $, the Hammond effect, but the helix itself follows anti-Hammond behaviour and moves away from native. The anti-Hammond could result from a changing balance in parallel pathways (Matthews and Fersht, Reference Matthews and Fersht1995) or true movement perpendicular. Simulation supports the latter (Daggett et al., Reference Daggett, Li and Fersht1998). Movement of the transition state on large destabilising mutations signals caution in interpreting changes in Φ for them. Importantly, it points to how a series of mutations in a family of homologous proteins can lead to changes of mechanism.
Engrailed homeodomain: framework mechanism
The Engrailed homeodomain (EnHD) is a 61-residue 3-helix bundle protein. (Mayor et al., Reference Mayor, Johnson, Daggett and Fersht2000; Banachewicz et al., Reference Banachewicz, Johnson and Fersht2011). In addition, a combination of NMR, X-ray-crystallography, xX-ray-scattering, and various spectroscopic techniques on wild-type and mutant protein have also been in the rare position of being able to describe the structures of the denatured state, Dphys, and an intermediate at atomic resolution. These structural studies combined with Φ-values and molecular dynamics simulations provide a detailed description of its folding pathway from ns to μs (Mayor et al., Reference Mayor, Johnson, Daggett and Fersht2000, Reference Mayor, Grossmann, Foster, Freund and Fersht2003a, Reference Mayor, Guydosh, Johnson, Grossmann, Sato, Jas, Freund, Alonso, Daggett and Fersht2003b; Stollar et al., Reference Stollar, Mayor, Lovell, Federici, Freund, Fersht and Luisi2003; DeMarco et al., Reference DeMarco, Alonso and Daggett2004; Religa et al., Reference Religa, Markson, Mayor, Freund and Fersht2005; Huang et al., Reference Huang, Settanni and Fersht2008; McCully et al., Reference McCully, Beck and Daggett2008; Neuweiler et al., Reference Neuweiler, Banachewicz and Fersht2010; Banachewicz et al., Reference Banachewicz, Religa, Schaeffer, Daggett and Fersht2011; Nasedkin et al., Reference Nasedkin, Marcellini, Religa, Freund, Menzel, Fersht, Jemth, van der Spoel and Davidsson2015). Simulations of folding and unfolding pathways obey microscopic reversibility (McCully et al., Reference McCully, Beck and Daggett2008).
The protein folds from the intermediate via a framework mechanism. EnHD has a very stable helix 1 which is up to ~40–50% α-helical in the absence of the rest of the protein, and helices 2 and 3 together form a helix-turn-helix motif which is not only structured in that folding intermediate (Mayor et al., Reference Mayor, Grossmann, Foster, Freund and Fersht2003a) but also stable as an independent sequence (Religa et al., Reference Religa, Johnson, Vu, Brewer, Dyer and Fersht2007). This intermediate is the most stable denatured state under conditions that favour folding, the more unfolded form being less stable, and its structure has been determined by NMR (Religa et al., Reference Religa, Markson, Mayor, Freund and Fersht2005). Φ-values show the final rate-determining transition state is the docking of helix 1 onto to the structure helixes 2 and 3 to form the hydrophobic core (Figure 19; Mayor et al., Reference Mayor, Grossmann, Foster, Freund and Fersht2003a).
Homeodomain family: pointer to a unifying underlying mechanism
Slide from nucleation-condensation to framework across a family
Members of the same family of proteins having the same overall fold but with different sequences and secondary structural propensities can provide important information, especially from Φ-analysis (Im7, Im9 (Friel et al., Reference Friel, Capaldi and Radford2003); Ig-like (Geierhaas et al., Reference Geierhaas, Paci, Vendruscolo and Clarke2004; Lappalainen et al., Reference Lappalainen, Hurley and Clarke2008); SH3 domains (Martinez and Serrano, Reference Martinez and Serrano1999; Guerois and Serrano, Reference Guerois and Serrano2000); protein L (Kim et al., Reference Kim, Fisher and Baker2000), and more general discussions (Zarrine-Afsar et al., Reference Zarrine-Afsar, Larson and Davidson2005; Brunori et al., Reference Brunori, Gianni, Giri, Morrone and Travaglini-Allocatelli2012).
Three members of the homeodomain-like protein family that share the same overall topology with EnHD: human TRF1 Myb domain (hTRF1); human RAP1 Myb domain (hRAP1); and c-Myb-transforming protein (c-Myb) have decreasing propensity for α-helix formation in helix 1 (Figure 20) and helixes 2 and 3 do not form independently stable helix-turn-helix motifs. These proteins vary widely in sequence, just having fold homology. There is a spectrum of folding processes that spans the complete transition from framework to nucleation-condensation mechanism as the helical propensity decreases, Figure 21 (Gianni et al., Reference Gianni, Guydosh, Khan, Caldas, Mayor, White, DeMarco, Daggett and Fersht2003). The common factor in their mechanisms is that the transition state for (un)folding is expanded and very native-like, with the proportion and degree of formation of secondary and tertiary interactions varying. It appears that framework and nucleation-condensation are different manifestations of an underlying common mechanism, Figure 21 (Daggett and Fersht, Reference Daggett and Fersht2003; Gianni et al., Reference Gianni, Guydosh, Khan, Caldas, Mayor, White, DeMarco, Daggett and Fersht2003).
Folding close to the speed limit
Pit1, the 63-residue homeodomain from pituitary-specific transcription factor, folds via an intermediate in wider separated phases than EnHD of t 1/2 2.3 and 46 μs (Banachewicz et al., Reference Banachewicz, Johnson and Fersht2011), allowing Φ-values to be measured for both phases (Banachewicz et al., Reference Banachewicz, Johnson and Fersht2011). Its helix-turn-helix motif does not independently fold but is folded in the intermediate, docked to a misfolded helix 1, which rearranges to fold correctly. Pit1 is on the slide from framework in the EnHD folding to nucleation-condensation for Myb, TRF1 and RAP1.
The folding rate constant of 3 × 105 s−1 for the fast phase decreases with increasing viscosity and is only slightly sensitive to mutation or denaturant concentration. The formation of the intermediate is partly rate-limited by chain diffusion and partly by an energy barrier to give a very diffuse transition state. The process is rather like the association of barnase with its protein inhibitor barstar which proceeds via an encounter complex that is diffusion-limited, relatively insensitive to mutations and then precisely docks and makes specific interactions in a slower step (Schreiber and Fersht, Reference Schreiber and Fersht1995, Reference Schreiber and Fersht1996). The folding is approaching the downhill-folding scenario of energy landscape theory (Gelman and Gruebele, Reference Gelman and Gruebele2014).
The free energy barrier that separates the native and denatured states ensembles in the energy landscape model may disappear under extreme conditions that greatly energetically favour the native state (Bryngelson et al., Reference Bryngelson, Onuchic, Socci and Wolynes1995), similar to extreme Hammond behaviour for the movement of transition states in covalent chemistry, Figure 18, where the transition state moves closer in structure to the denatured state as the product becomes more stable (Hammond, Reference Hammond1955). Under these conditions, the protein folds downhill energetically. The transition-state energy barrier reappears as conditions change to stabilise the denatured state ensemble, such as going through the thermal or denaturant unfolding transitions. The finding of very fast folding small domains, ‘miniproteins’ that fold on the μs time scale or faster led to increased interest as what happens to pathways at folding close to the speed limit (Kubelka et al., Reference Kubelka, Hofrichter and Eaton2004; Gelman and Gruebele, Reference Gelman and Gruebele2014). Barriers of <3k BT (<1.8 kcal mol−1 at 298 K) are suggested to be consistent with this type of downhill folding (Carter et al., Reference Carter, Baker, Best and De Sancho2013; Prigozhin and Gruebele, Reference Prigozhin and Gruebele2013). However, ‘downhill folding on a rough energy landscape versus rapid folding through very shallow intermediates is in the eye of the beholder’ (Gelman and Gruebele, Reference Gelman and Gruebele2014). All the states along the pathway/landscape are ensembles of structures (Figure 8). There is a residual native and non-native structure in the denatured state, and this coexists with folding intermediates and the native structure in varying proportions with changing conditions. The folded state is dynamic, with regions locally unfolding as demonstrated by hydrogen-deuterium exchange (Englander et al., Reference Englander, Mayne, Bai and Sosnick1997; Englander, Reference Englander2023). The energy landscape has many local minima, which can contribute to kinetics when the transition state energy barrier is low. These problems are exacerbated for the small fast-folding domains because their folding equilibrium and activation energies are often low and the structure of domains taken from their parent is sensitive to the choice of domain boundaries.
Transition states across PSBD family: nucleation-condensation in very fast folding
The more thermostable two-helix bundle PSBD from B. stearothermophilus (E3BD) folds cooperatively and very rapidly, and its separated constituent α-helical regions have little helical tendency, showing fast folding does not require the docking of preformed elements (Spector et al., Reference Spector, Kuhlman, Fairman, Wong, Boice and Raleigh1998, Reference Spector, Rosconi and Raleigh1999a, Reference Spector, Young and Raleigh1999b; Spector and Raleigh, Reference Spector and Raleigh1999). Φ-value analysis at 325K by T-jump relaxation kinetics (Ferguson et al., Reference Ferguson, Day, Johnson, Allen, Daggett and Fersht2005) and at 298K by rapid mixing and some T-jump (Ferguson et al., Reference Ferguson, Sharpe, Johnson and Fersht2006) show a nucleation-condensation mechanism, which has a very diffuse transition state but with helix 2 the most structured. There is good consistency with calculated values from MD simulation.
Comparison of Φ-values with two other members of the PBSD family that have significant sequence identity but different helix-forming propensities, POB, from Pyrobaculum aerophilum (Sharpe et al., Reference Sharpe, Ferguson, Johnson and Fersht2008) and BBL (Neuweiler et al., Reference Neuweiler, Sharpe, Rutherford, Johnson, Allen, Ferguson and Fersht2009), Figure 22, provides information about conservation of folding mechanism in closely related, very fast folding, proteins. They all fold via nucleation-condensation, with Φ-values summarised in Figure 23. There are differences in that folding of E3BD and POB nucleates in Helix 2 but interactions in the folding transition state of BBL is more evenly dispersed across the structure, perhaps because of the high helical propensity of its Helix 1 (Neuweiler et al., Reference Neuweiler, Sharpe, Rutherford, Johnson, Allen, Ferguson and Fersht2009). The folding rate constants for E3BD, BBL, and POB at 298 K are 27,500 ± 500, 124,000 ± 5000 s−1, and 210,000 ± 5000 s−1, respectively, and follow the predicted helical propensities sites in the second helix. An increased helical propensity at the nucleation site appears to stabilise the folding nucleus and results in an increased folding rate constant.
Other examples with Φ-values
Φ-analysis has now been applied by many groups to a large number of proteins to illuminate a range of processes and structures in the folding, assembly, and activity of proteins. Alm et al. (Reference Alm, Morozov, Kortemme and Baker2002) used published data on 19 proteins with Φ-values to devise a simple model for folding. For over half of these, the theory reproduced Φ with correlation coefficients between 0.41 and 0.88. They classified transition-state structures into three categories. (1) Small proteins with polarised transition states include Protein L; Protein G; src; spectrin; and Sso7d SH3 domains. (2) Large proteins with compact subdomains include barnase; cheY; tenascin; titin; fibronectin (the tenth type III domain repeat of fibronectin); and U1A spliceosomal protein. (3) Proteins with diffuse transition states, which include: CI2; FKBP12 (FK501-binding protein); λ repressor; Suc1; muscle acylphosphatase; procarboxypeptidase; ribosomal protein S6; and villin headpiece. The examples with diffuse transition states correspond to the CI2 end of the nucleation-condensation mechanism, which slides to the polarised end for some of the polarised states.
Proteins, including some of the above with their sources, that have been subjected to Φ-analysis are in this by no means complete list. I have indicated (nc) for some that appear to fold by nucleation condensation and (fw) by framework. Monomeric λ-repressor (Burton et al., Reference Burton, Myers and Oas1998), ADA2H (nc) (Villegas et al., Reference Villegas, Martinez, Aviles and Serrano1998; Kukic et al., Reference Kukic, Pustovalova, Camilloni, Gianni, Korzhnev and Vendruscolo2017), acyl-coA binding protein (nc) (Kragelund et al., Reference Kragelund, Poulsen, Andersen, Baldursson, Kroll, Neergård, Jepsen, Roepstorff, Kristiansen, Poulsen and Knudsen1999), SH3 domains (α-spectrin (nc), src) (Grantcharova et al., Reference Grantcharova, Riddle, Santiago and Baker1998; Martinez and Serrano, Reference Martinez and Serrano1999; Guerois and Serrano, Reference Guerois and Serrano2000), SH3 domain from Grb2 (nc) (Troilo et al., Reference Troilo, Bonetti, Camilloni, Toto, Longhi, Brunori and Gianni2018), acylphosphatase (nc) (Chiti et al., Reference Chiti, Taddei, White, Bucciantini, Magherini, Stefani and Dobson1999), Im7 and Im9 (Friel et al., Reference Friel, Capaldi and Radford2003; Paci et al., Reference Paci, Friel, Lindorff-Larsen, Radford, Karplus and Vendruscolo2004; Bartlett and Radford, Reference Bartlett and Radford2010), NTL9 domain of L9 (nc) (Anil et al., Reference Anil, Sato, Cho and Raleigh2005; Sato et al., Reference Sato, Cho, Peran, Soydaner-Azeloglu and Raleigh2017), cheY (1 domain nc) (Lopez-Hernandez and Serrano, Reference Lopez-Hernandez and Serrano1996), Sod1 (Yang et al., Reference Yang, Wang, Logan, Mu, Danielsson and Oliveberg2018), S6 (Otzen and Oliveberg, Reference Otzen and Oliveberg2002; Lindberg et al., Reference Lindberg, Haglund, Hubner, Shakhnovich and Oliveberg2006), U1a (nc) (Ternstrom et al., Reference Ternstrom, Mayor, Akke and Oliveberg1999), azurin (Wilson and Wittung-Stafshede, Reference Wilson and Wittung-Stafshede2005; Zong et al., Reference Zong, Wilson, Shen, Wolynes and Wittung-Stafshede2006), apo-flavodoxin (Campos et al., Reference Campos, Bueno, Lopez-Llano, Jimenez and Sancho2004; Muralidhara et al., Reference Muralidhara, Chen, Ma and Wittung-Stafshede2005; Bueno et al., Reference Bueno, Ayuso-Tejedor and Sancho2006; Lopez-Llano et al., Reference Lopez-Llano, Campos, Bueno and Sancho2006; Homouz et al., Reference Homouz, Stagg, Wittung-Stafshede and Cheung2009; Stagg et al., Reference Stagg, Samiotakis, Homouz, Cheung and Wittung-Stafshede2010; Galano-Frutos et al., Reference Galano-Frutos, Torreblanca, Garcia-Cebollada and Sancho2022), WW domains (polarised) (Jager et al., Reference Jager, Nguyen, Crane, Kelly and Gruebele2001; Petrovich et al., Reference Petrovich, Jonsson, Ferguson, Daggett and Fersht2006; Dave et al., Reference Dave, Jager, Nguyen, Kelly and Gruebele2016), villin headpiece (Cho et al., Reference Cho, O’Connell, Raleigh and Palmer2010), SH2 domains (Visconti et al., Reference Visconti, Malagrino, Gianni and Toto2019; Toto et al., Reference Toto, Malagrino, Nardella, Pennacchietti, Pagano, Santorelli, Diop and Gianni2022), HYPA/FBP11 FF domain (bn) (Jemth et al., Reference Jemth, Day, Gianni, Khan, Allen, Daggett and Fersht2005), PTP-BL, PDZ2 and PSD-95 PDZ3 (3-state) (Calosci et al., Reference Calosci, Chi, Richter, Camilloni, Engstrom, Eklund, Travaglini-Allocatelli, Gianni, Vendruscolo and Jemth2008), ubiquitin (nc) (Went and Jackson, Reference Went and Jackson2005; Varnai et al., Reference Varnai, Dobson and Vendruscolo2008), RNaseA (nc) (Font et al., Reference Font, Benito, Lange, Ribo and Vilanova2006), yACBP and 3 variants of bACBP (Teilum et al., Reference Teilum, Thormann, Caterer, Poulsen, Jensen, Knudsen, Kragelund and Poulsen2005), B domain of Protein A (nc) (Sato et al., Reference Sato, Religa and Fersht2006), α-lactalbumin (Chedad et al., Reference Chedad, Van Dael, Vanhooren and Hanssens2005), R15 (nc) R16 (fw) R17 (fw) (domains of chicken brain α-spectrin) (Wensley et al., Reference Wensley, Gartner, Choo, Batey and Clarke2009), TI I27 (Fowler and Clarke, Reference Fowler and Clarke2001), TNfn3 (Geierhaas et al., Reference Geierhaas, Paci, Vendruscolo and Clarke2004), LysM domain (Nickson et al., Reference Nickson, Stoll and Clarke2008), FKB12 (nc) (Main et al., Reference Main, Fulton, Daggett and Jackson2001), apocyotchrome b562 (Zhou et al., Reference Zhou, Huang and Bai2005), SAP (Dodson and Arbely, Reference Dodson and Arbely2015), knotted proteins (Mallam et al., Reference Mallam, Morris and Jackson2008; Jackson et al., Reference Jackson, Suma and Micheletti2017), raf (polarised) (Campbell-Valois and Michnick, Reference Campbell-Valois and Michnick2007), four-helix HYPA/FPB11(Jemth et al., Reference Jemth, Day, Gianni, Khan, Allen, Daggett and Fersht2005), tumour suppressor P16 (Tang et al., Reference Tang, Fersht and Itzhaki2003), barstar (nc) (Nolting et al., Reference Nolting, Golbik, Neira, Soler-Gonzalez, Schreiber and Fersht1997), p13Suc1 (nc) (Schymkowitz et al., Reference Schymkowitz, Rousseau, Irvine and Itzhaki2000), arc repressor (Srivastava and Sauer, Reference Srivastava and Sauer2000), BPTI (Bulaj and Goldenberg, Reference Bulaj and Goldenberg2001), TAPLLR (Kelly et al., Reference Kelly, Meisl, Rowling, McLaughlin, Knowles and Itzhaki2014), tumour suppressor p53 (Wang and Fersht, Reference Wang and Fersht2015), and KIX domain (Troilo et al., Reference Troilo, Bonetti, Toto, Visconti, Brunori, Longhi and Gianni2017).
Φ-analysis has been applied successfully to the folding of transmembrane proteins (Otzen, Reference Otzen2011; Booth, Reference Booth2012; Paslawski et al., Reference Paslawski, Lillelund, Kristensen, Schafer, Baker, Urban and Otzen2015) and includes processes with receptors and gating (Cymes et al., Reference Cymes, Grosman and Auerbach2002; Mitra et al., Reference Mitra, Bailey and Auerbach2004; Cadugan and Auerbach, Reference Cadugan and Auerbach2007; Aleksandrov et al., Reference Aleksandrov, Cui and Riordan2009; Edelstein and Changeux, Reference Edelstein and Changeux2010). Φ-analysis is particularly useful for studying the folding of intrinsically disordered proteins on binding to folded partners (Karlsson et al., Reference Karlsson, Chi, Engstrom and Jemth2012; Dogan et al., Reference Dogan, Mu, Engstrom and Jemth2013, Reference Dogan, Gianni and Jemth2014; Rogers et al., Reference Rogers, Oleinikovas, Shammas, Wong, De Sancho, Baker and Clarke2014; Shammas et al., Reference Shammas, Crabtree, Dahal, Wicky and Clarke2016; Karlsson et al., Reference Karlsson, Andersson, Dogan, Gianni, Jemth and Camilloni2019; Toto et al., Reference Toto, Troilo, Visconti, Malagrino, Bignon, Longhi and Gianni2019; Karlsson et al., Reference Karlsson, Paissoni, Erkelens, Tehranizadeh, Sorgenfrei, Andersson, Ye, Camilloni and Jemth2020; Malagrino et al., Reference Malagrino, Visconti, Pagano, Toto, Troilo and Gianni2020; Toto et al., Reference Toto, Malagrino, Visconti, Troilo, Pagano, Brunori, Jemth and Gianni2020; Karlsson and Jemth, Reference Karlsson and Jemth2021) and for proteins where mechanical force mimics their function in vivo (Best et al., Reference Best, Fowler, Toca-Herrera and Clarke2002; Best and Clarke, Reference Best and Clarke2002; Fowler et al., Reference Fowler, Best, Toca Herrera, Rutherford, Steward, Paci, Karplus and Clarke2002). It has been extended to RNA folding (Silverman and Cech, Reference Silverman and Cech2001; Young and Silverman, Reference Young and Silverman2002; Kim and Shin, Reference Kim and Shin2010; Pereyaslavets and Galzitskaya, Reference Pereyaslavets and Galzitskaya2015) and DNA aptamers (Lawrence et al., Reference Lawrence, Vallee-Belisle, Pfeil, de Mornay, Lipman and Plaxco2014).
The robustness and validity of Φ-analysis: Φ-Φ plots
The above examples show the wide and successful application of Φ-analysis. There have been criticisms of Φ-analysis, which have been critiqued by Gianni and Jemth (Gianni and Jemth, Reference Gianni and Jemth2014). They have a nice argument on how plots of Φ versus Φ for processes in common demonstrate the robustness of Φ-analysis. Such plots on homologous proteins are used to compare folding transition states (Calosci et al., Reference Calosci, Chi, Richter, Camilloni, Engstrom, Eklund, Travaglini-Allocatelli, Gianni, Vendruscolo and Jemth2008; Wensley et al., Reference Wensley, Gartner, Choo, Batey and Clarke2009; Wensley et al., Reference Wensley, Batey, Bone, Chan, Tumelty, Steward, Kwa, Borgia and Clarke2010). Sequences of identical proteins, such as circular permutants and circularised proteins, or homologous proteins with high sequence identity are aligned and values of Φ at the same position plotted for one against in the other, as in Figure 24. The probability that the pairs in each are not linearly related, P, is infinitesimal, consistent with their containing structural information. Provided that mutations are chosen as described and analysed in the first Φ-value paper (Matouschek et al., Reference Matouschek, Kellis, Serrano and Fersht1989 and earlier (Fersht et al., Reference Fersht, Leatherbarrow and Wells1987), and too high or too low changes in $ \Delta \Delta {G}_{\mathrm{D}-\mathrm{N}} $ not used (Fersht and Sato, Reference Fersht and Sato2004), Φ-value analysis is robust. The weak, medium, and strong categorisation provides adequate constraints for simulation.
Φ-value analysis has stood the test of time over three decades and we have gone from knowing virtually nothing about the fine structure of transition states for folding in the late 1980s to having a wealth of detailed information about many individual proteins. But can we draw generalisations?
The expanded transition state as a unifying mechanism for domain folding
Proteins have evolved for optimal function in vivo and not the greatest stability or fastest folding. Protein activity often requires flexibility and dynamics for function, a stability that is high enough but not too high to prevent turnover where necessary, a rate of unfolding for some that is sufficiently slow to inhibit aggregation via unfolding, and a trade-off between overall stability and local instability of binding and active sites. For example, simple mutations can change the rate constants for the folding of CI2 over three orders of magnitude: wild-type folds at 25°C at 56 s−1, the double mutant A16G/I57A in the folding nucleus at 2.4 s−1, and R48F at 2300 s−1. The active site of barnase is a source of instability (Meiering et al., Reference Meiering, Serrano and Fersht1992) and mutations elsewhere can greatly stabilise it without loss of activity (Serrano et al., Reference Serrano, Day and Fersht1993). Those factors will conspire to complicate the formulation of simple models for folding and its kinetics and cause exceptions to mechanisms.
‘In their search for order, chemists invented Brønsted and Hammett correlations and other free energy relationships’ so begins Jencks in his review of the movement of transition states across energy landscapes (Jencks, Reference Jencks1985). So, here is an attempt to bring some order, bearing in mind that there will be many exceptions. The unifying feature across the folding of most domains that comes from Φ-value analysis is that the highest energy transition state is an expanded, distorted form of the native structure, Figure 25 (Fersht, Reference Fersht2000). It varies from the pure nucleation-condensation mechanism at one extreme with mainly low to mid-range Φ-values to framework mechanisms at the other extreme with highly polarised transitions states and Φ-values from 0 to 1. The expanded nature of the transition state and its observed malleability both naturally across protein families and unnaturally on protein engineering accommodates the slide from pure nucleation condensation to framework mechanism, Figure 26.
Envoi
My research career has spanned seven decades that have seen ground-breaking innovations, beginning in the 1960s with the first high-resolution structures of proteins from X-ray crystallography, followed by recombinant DNA technology, DNA sequencing, new enabling biological and biophysical technologies, and advances in computation methods from simulation to machine learning today. It has been my privilege and pleasure to have been a participating protein scientist using directly or indirectly all these advances as they were introduced (Fersht, Reference Fersht2008, Reference Fersht2021). Over the same period, we have gone from being just observers of the properties of proteins to being able to manipulate their structures and activities. We have progressed from the pathway of protein folding being a mysterious unknown to using those methodologies to solve the folding pathways of small domains at atomic resolution. There is much more experimental work to be done on more complex systems, where Φ-values will continue to provide otherwise inaccessible information. I hope that the Φ-values gathered by us all will be used as benchmarks for computation far into the future. It has been a marvellous time to have been a protein scientist. The best is still to come as we progress to unravelling the folding and mechanisms of complex protein systems and combine our acquired experimental knowledge with improved computation to design novel, functional proteins.