Introduction
Proteins are essential for many cellular functions, including enzymatic activity, structural support, transport, and cell signaling (Alberts, Reference Alberts2017). Structurally, they are large macromolecules composed of long chains of amino acids, which fold into unique three-dimensional shapes specific to each protein (Rodwell et al., Reference Rodwell, Bender and Botham2018). Their functional roles are driven by local intermolecular interactions within specific regions called binding sites. Binding sites play a crucial role in drug discovery. They serve as ‘hot spots’ on pharmacological targets where designed drug-like molecules bind. Identifying novel binding sites expands the ‘druggable genome’, offering new strategies for therapeutic development and drug discovery (Hopkins and Groom, Reference Hopkins and Groom2002). Drug-like molecules typically target either the orthosteric binding site, where proteins interact with their natural ligands, or distinct allosteric binding sites, which have garnered special interest. Allosteric sites show higher sequence variability between protein subtypes, enabling the design of more selective drug-like molecules compared to those targeting orthosteric binding sites (Changeux, Reference Changeux2013; Wagner et al., Reference Wagner2016; Lu et al., Reference Lu2018). Furthermore, the binding sites can be formed by several protein molecules at their interaction interface (Ferré et al., Reference Ferré2014; Wang et al., Reference Wang2018a), opening another opportunity for proximity-induced drug discovery (Békés et al., Reference Békés, Langley and Crews2022; Dewey et al., Reference Dewey2023; Liu and Ciulli, Reference Liu and Ciulli2023; Tan et al., Reference Tan2024). While proteins are the most common pharmacological targets, nucleic acids, particularly RNAs, are gaining increasing interest in structure-based drug design (Chen et al., Reference Chen2024; Tong et al., Reference Tong, Childs-Disney and Disney2024). RNA plays a vital role in gene regulation and information transfer, making it an appealing target for drug development (Warner et al., Reference Warner, Hajdin and Weeks2018). Like proteins, RNA molecules are highly structured and contain binding sites that can be modulated by small molecules (Yu et al., Reference Yu, Choi and Tu2020). Both proteins and nucleic acids are flexible macromolecules, adopting multiple conformations throughout their life cycle. Accordingly, binding sites are dynamic properties, influenced by the conformational changes of macromolecules (Laskowski et al., Reference Laskowski, Gerick and Thornton2009; Changeux and Christopoulos, Reference Changeux and Christopoulos2016). A single structure of a macromolecule represents only a single point of the complete conformational space. Therefore, it is possible to overlook binding sites in the static structures (Di Pietro et al., Reference Di Pietro2017; Sun et al., Reference Sun2020). A remarkable progress has been made in developing experimental methods for identifying binding sites, including fragment screening, site-directed tethering (Hardy and Wells, Reference Hardy and Wells2004; Ludlow et al., Reference Ludlow2015), antibody-based techniques (Lawson, Reference Lawson2012), small molecule microarrays (Doyle et al., Reference Doyle2016), hydrogen-deuterium exchange (Chalmers et al., Reference Chalmers2006), and site-directed mutagenesis (Gelis et al., Reference Gelis2012). However, experimental methods are often resource-intensive and may yield negative results. In contrast, computational approaches enable large-scale identification of binding sites, exploration of macromolecular flexibility, and the ability to assess how well chemical compounds fit into these sites.
While there are several articles describing binding site prediction methods (Laurie and Jackson, Reference Laurie and Jackson2006; Henrich et al., Reference Henrich2010; Leis et al., Reference Leis, Schneider and Zacharias2010; Chen et al., Reference Chen2011; Roche et al., Reference Roche, Brackenridge and McGuffin2015; Simões et al., Reference Simões2017; Zhao et al., Reference Zhao, Cao and Zhang2020; Liao et al., Reference Liao2022; Utgés and Barton, Reference Utgés and Barton2024), we found several gaps persisting in the literature. Specifically, most of the works focus only on protein–small molecule interactions, neglecting other important binding site interaction types, such as protein–peptide, nucleic acid–small molecules, or protein–ion. Furthermore, while deep learning-based approaches have gained popularity, there has been a limited discussion on their limitations, applicability, and interpretability compared to traditional methods. In this study, we provide a comprehensive review of computational methods for the prediction of binding sites. We eliminate existing gaps in the literature with a unified overview of computational techniques across diverse binding site interaction types. The computational methods are classified based on the types of binding sites they predict: (i) protein–small molecule binding sites (Section Protein–small molecule binding sites); (ii) protein–peptide binding sites (Section Protein–peptide binding sites); (iii) nucleic acid–small molecule binding sites (Section Nucleic acid–small molecule binding sites); and (iv) protein–ion binding sites (Section Protein–ion binding site prediction). For each type of binding site, the corresponding methods are further divided into categories based on the macromolecule input representation (sequence or structure) and algorithm type (template-based, geometric, energetic, machine learning-based, and deep learning-based). If available, we also list the benchmarks and the performance metrics of different methods for each category of binding sites. Finally, we conclude the review with a discussion of current challenges and future perspectives in the field.
Protein–small molecule binding sites
In this section, we describe computational methods for the prediction of small molecule binding sites on proteins. Small molecules are usually defined as molecules with a mass
$ \le $
500 Da (Benet et al., Reference Benet2016) designed to interact with biological targets to modulate their functions (Southey and Brunavs, Reference Southey and Brunavs2023). A binding site usually has specific geometry and physicochemical properties, making the corresponding protein region distinguishable from the rest of the protein surface. Thus, methods for predicting binding sites in proteins aim to identify such regions based on the input information (e.g., sequence, structure, or other type of information). This section is organized as follows: first, we divided methods into two large groups based on the representation of input protein as sequence or structure. After that, we further split all structure-based methods into five categories based on the type of algorithm they use: template-based, geometric, energetic, machine learning-based, and deep learning-based methods. Table 1 provides a list of computational method for prediction of small molecule binding sites on proteins along with type of approach they use.
Table 1. List of methods for prediction of protein–small molecule binding sites

Sequence-based
Sequence-based methods utilize sequence or sequence-driven information about proteins. In this problem setup, the model takes the protein sequence as input and outputs a score for each position in this sequence, indicating whether the amino acid residue at this position interacts with the ligand or not. Figure 1 provides a schematic representation of the overall pipeline of sequence-based methods. Some methods search for similar sequences in a template database of sequences with known binders and map binding information from them (Kauffman and Karypis, Reference Kauffman and Karypis2009; López et al., Reference López, Valencia and Tress2007; Yang et al., Reference Yang, Roy and Zhang2013). The idea is supported by the study suggesting, that in most cases the function of the unknown protein can be identified from its sequence or structure by homology (Yao et al., Reference Yao2003). However, for the correct work of template-based methods, there should be proteins with significant sequence identity to a query sequence in a database (Devos and Valencia, Reference Devos and Valencia2000; Wilson et al., Reference Wilson, Kreychman and Gerstein2000), as it was shown that proteins with <35–40% identity may not share the same biochemical function (Todd et al., Reference Todd, Orengo and Thornton2001). The majority of sequence-based methods generate descriptors for each position in the input sequence. The feature vector can be composed of multiple different descriptors: evolutionary information comprised of a position-specific scoring matrix (PSSM) or conservation score; tabular physicochemical properties of amino acid residues on specified position: hydrophobicity, polarity, solvation potential, residue interface propensities, net charge, average accessible surface area (ASA), values from the AAindex (Kawashima et al., Reference Kawashima2007) database, and others. Note, that one can also utilize structural features predicted from sequence using other tools (including ML ones) to generate more sophisticated feature vectors; for example, solvent accessible solvent area (SASA)(Dor and Zhou, Reference Dor and Zhou2007; Garg et al., Reference Garg, Kaur and Raghava2005; Yuan and Huang, Reference Yuan and Huang2004; Ahmad et al., Reference Ahmad, Gromiha and Sarai2003; Adamczak et al., Reference Adamczak, Porollo and Meller2004; Heffernan et al., Reference Heffernan2015), secondary structure information (Faraggi et al., Reference Faraggi2012; Yaseen and Li, Reference Yaseen and Li2014; Lin et al., Reference Lin2005; Bondugula and Xu, Reference Bondugula and Xu2007; Cheng et al., Reference Cheng, Sen, Jernigan and Kloczkowski2007; Pei and Grishin, Reference Pei and Grishin2004), dihedral angles (Wood and Hirst, Reference Wood and Hirst2005; Dor and Zhou, Reference Dor and Zhou2007; Xue et al., Reference Xue2008; Heffernan et al., Reference Heffernan2015; Lyons et al., Reference Lyons2014), etc. The feature vectors are then used as inputs into the machine learning algorithm. In some methods, feature vectors from several consecutive amino acid residues (typically between 7 and 25 residues) are processed together to form a new feature vector (Chen et al., Reference Chen, Huang and Gao2014, Reference Chen2015). One can feed the feature vectors to a classical classification ML model, such as support vector machine (SVM) (Cortes and Vapnik, Reference Cortes and Vapnik1995) or random forest (RF) (Ho, Reference Ho1995), which outputs probability scores for the amino acid residues to interact with the ligand (Kauffman and Karypis, Reference Kauffman and Karypis2009; Chen et al., Reference Chen, Huang and Gao2014, Reference Chen2015; Yu et al., Reference Yu2015; Lu et al., Reference Lu2019). Other methods are based on larger DL models, such as 1D-CNN, GRU, or LSTM, feeding the whole sequence at once (Cui et al., Reference Cui2019; Lee and Nam, Reference Lee and Nam2022) and also predicting a probability score for each position. Recently, large pre-trained language models have advanced many tasks in the field of natural language processing (NLP). Protein sequences can be viewed as a ‘sentence’ with amino acid residues as ‘words’, and approaches similar to ones from NLP can be applied to them. This idea brought the development of several transformer-based (Vaswani et al., Reference Vaswani2017) models, such as ESM (Lin et al., Reference Lin2023), ProtTrans (Elnaggar et al., Reference Elnaggar2021) or ProteinBert (Brandes et al., Reference Brandes2022). Most of these models utilize BERT-like (Devlin et al., Reference Devlin2018) architectures and were trained on huge databases in a self-supervised manner for the prediction of masked tokens in sequence. It was shown that such protein language models (PLM) can capture structural information, such as secondary structure or residue-residue contacts (Rives et al., Reference Rives2021). The sequence or amino acid residue embeddings derived from these models can be used as feature vectors in ML or DL models to predict different types of binding sites (Li et al., Reference Li2023d).

Figure 1. Schematic presentation of the sequence-based methods. The top part demonstrates the pipeline for a template-based approach: the target sequence is aligned against a database of template sequences with known binding residues, and the output binding residues are defined by the consensus score from the alignment. The bottom part demonstrates the pipeline for ML or DL methods. First, the feature vectors (e.g., sequence or physicochemical properties) or the embeddings (e.g., using language models) are calculated. Then, a method uses a moving window across the sequence and feeds feature vectors for each position into an ML or DL model outputting a binding score for each position, or utilizing a larger DL model to get binding scores for each position simultaneously.
Template-based
The template-based methods operate with a database of protein complexes with known binding sites (Figure 2). Then, for the query protein, they search for similar proteins and retrieve information about binding sites from them.
Some methods rely on a global comparison of a query structure against the template structures, then superimpose known ligands or positions of binding residue position from the identified similar templates (Brylinski and Skolnick, Reference Brylinski and Skolnick2008; Wass et al., Reference Wass, Kelley and Sternberg2010; Yang et al., Reference Yang, Roy and Zhang2013; Gao et al., Reference Gao2016). However, methods that rely on global comparison can miss non-conserved binding sites, as some proteins bind the same molecule at sites with different amino acid patterns (Moodie et al., Reference Moodie, Mitchell and Thornton1996; Denessiouk et al., Reference Denessiouk, Rantanen and Johnson2001). Other template-based methods incorporate more complicated local comparisons of substructures or surface patches (Figure 2). For example, one uses geometric hashing to compare two sets of graph vertices representing the query and template protein structures. These vertices can be centers of 3D cells (Rinaldis et al., Reference Rinaldis1998), surface residues (Schmitt et al., Reference Schmitt, Kuhn and Klebe2002), surface vertices (Rosen et al., Reference Rosen1998), surface patches (Shulman-Peleg et al., Reference Shulman-Peleg, Nussinov and Wolfson2004), conserved residues (Roy et al., Reference Roy, Yang and Zhang2012), or all atoms (Barker and Thornton, Reference Barker and Thornton2003; Gold and Jackson, Reference Gold and Jackson2006). Some methods utilize the maximum clique detection method (Bron and Kerbosch, Reference Bron and Kerbosch1973) to compare the two sets of residues (Lee and Im, Reference Lee and Im2013; Viet Hung et al., Reference Viet Hung2015; Konc and Janežič, Reference Konc and Janežič2010) or surfaces (Kinoshita and Nakamura, Reference Kinoshita and Nakamura2005). Other use sub-graph isomorphism (Ullmann, Reference Ullmann1976) for the comparison of two sets of pseudo-atoms representing residue side chains (Spriggs et al., Reference Spriggs, Artymiuk and Willett2003) or calculate root-mean-square-deviation (RMSD) between spatially neighboring sets of residues in the query and the template (Stark et al., Reference Stark, Sunyaev and Russell2003). There are also many approaches that use geometric methods to identify pockets in the query protein structure, and then provide a method to compare two binding sites. The comparison methods can be divided into alignment-based or alignment-free. The alignment-based methods calculate alignment for each pair of binding sites to estimate their similarity and are usually computationally demanding. On the other hand, alignment-free methods calculate translation- and rotation-invariant descriptors, which can be compared relatively fast. These methods are much faster than alignment-based approaches, but their results may be difficult to interpret (we refer the reader to this review on binding site comparison methods (Eguida and Rognan, Reference Eguida and Rognan2022)). Nonetheless, the template-based methods in general have higher interpretability, compared to the ML ones. However, the template-based methods are resource-consuming, as for each query protein one needs to screen the entire database, and the screening time increases as the database grows. They also strongly depend on the database itself – if the database lacks certain type of binding site, the method will not be able to identify such a binding site in a query.

Figure 2. Schematic presentation of the structure template-based methods. In the first stage, the target is screened against a database of template structures with known binding sites. In the second stage, the output prediction is obtained based on the most similar template structures with respect to the target.
Geometric
Geometric methods identify pockets from the protein shape by analyzing occupancy grids, surfaces, or probes, such as spheres placed around the protein. Figure 3 demonstrates a schematic overview of geometric methods for binding site detection.
SurfNet (Laskowski, Reference Laskowski1995) is one of the first geometric algorithms; it generates an occupancy grid for protein atoms and outlines the surface around the occupied voxels to determine the cavities as the binding sites. Many other methods are based on a very similar approach, which generates an occupancy grid and, for each empty grid point, calculates the fraction of directions that are enclosed by protein atoms or surfaces (Levitt and Banaszak, Reference Levitt and Banaszak1992; Hendlich et al., Reference Hendlich, Rippmann and Barnickel1997; Huang and Schroeder, Reference Huang and Schroeder2006; Weisel et al., Reference Weisel, Proschak and Schneider2007; Halgren, Reference Halgren2009; Marchand et al., Reference Marchand2021) (see Figure 3a). POCKET (Levitt and Banaszak, Reference Levitt and Banaszak1992) casts rays along three directions and determines, if a ray goes through protein–empty points–and then again protein; in this case, the point is considered to be in the pocket. Similarly, LIGSITE (Hendlich et al., Reference Hendlich, Rippmann and Barnickel1997) casts rays along seven directions (diagonals added), and if the number of intersections is higher than a threshold, the point is a pocket point. PocketPicker (Weisel et al., Reference Weisel, Proschak and Schneider2007) calculates buriedness for each grid point, scans into 30 directions with rays of length 10Å and width 0.9Å, and counts the number of intersections. SiteMap (Halgren, Reference Halgren2009) calculates the fraction of 110 rays striking the receptor within 8Å distance. SiteMap also calculates multiple descriptors to evaluate the druggability of the detected pocket. Similarly, CAVIAR (Marchand et al., Reference Marchand2021) casts rays in 14 directions, selects relevant grid points surrounded by protein atoms, and clusters the grid points forming a binding site. Another common approach involves generating two representations of the protein, corresponding to the spheres probes with two different radii placed around the protein (Peters et al., Reference Peters, Fauck and Frömmel1996; Kawabata and Go, Reference Kawabata and Go2007). More specifically, APROPOS (Peters et al., Reference Peters, Fauck and Frömmel1996) creates a Delaunay representation of a protein and, then, rolls spheres of two different radii over the structure to remove some of the sides. Shapes removed by small spheres and not removed by large ones are considered pockets (see Figure 3b). Similarly, PHECOM (Kawabata and Go, Reference Kawabata and Go2007) and (Masuya and Doi, Reference Masuya and Doi1995) roll spheres around protein atoms, and GHECOM (Kawabata, Reference Kawabata2010) and POCASA (Yu et al., Reference Yu2010) place spheres on a 3D grid. Small spheres are removed, if they intersect with large ones; after that, clusters of small spheres are considered as pockets. In a similar approach (Kim et al., Reference Kim2008), one generates inner and outer surface meshes through Voronoi diagrams with different probe radii; the pocket is then defined as the cavity between inner and outer meshes. Note, that multiple methods (Delaney, Reference Delaney1992; Kleywegt and Jones, Reference Kleywegt and Jones1994; Brady and Stouten, Reference Brady and Stouten2000) comprise an addition-removal iterative process, where at each step a buffer around the protein is added and some of the points are removed again until the pocket is identified or there is no change after the iteration (see Figure 3c). One can use grid-based approaches, where the flood fill algorithm is performed after each addition until the pocket points become enclosed and cannot be removed (Delaney, Reference Delaney1992; Kleywegt and Jones, Reference Kleywegt and Jones1994). DoGSite (Volkamer et al., Reference Volkamer2010) generates a 3D occupancy grid for the protein, then repeatedly applies Gaussian filters to remove points with values exceeding a specified threshold; the remaining grid points are clustered to form the binding site. Other heuristics can also be applied; for example, LISE (Xie and Hwang, Reference Xie and Hwang2012) generates sets of triangle motifs with assigned scores from protein atoms. Then, the method generates a 3D grid around the protein, and for each empty point, the sum of scores from triangles whose centers lie inside the voxel is assigned. In the next step, for each empty point, the score is recalculated as a sum of scores of all empty points within a sphere of radius 11Å. Finally, top-score points are selected as final pocket centers. Another example is PASS (Brady and Stouten, Reference Brady and Stouten2000), which adds spheres around triplets of protein atoms and filters them until no spheres can be added. There are many other types of geometric approaches that treat protein structures as 3D object and apply geometry-based algorithms to detect binding sites. CAST (Binkowski et al., Reference Binkowski, Naghibzadeh and Liang2003; Liang et al., Reference Liang, Woodward and Edelsbrunner1998) creates a Delaunay representation of a protein and then applies a flow theory to determine pockets. In Del Carpio et al. (Reference Del Carpio, Takahashi and Sasaki1993), the authors utilized an iterative process in which they first calculate a protein center, identify the closest surface atom, and flag all surface atoms in sight from this atom. Further, the next closest unflagged surface atom is selected and the process repeats until all atoms are flagged. In Coleman and Sharp (Reference Coleman and Sharp2006), the method calculates the surface and the convex hull, which is defined as the smallest convex polyhedron containing all the surface points. For surface points, it calculates ‘travel depth’, as the minimal distance from a point to the convex hull, and determines pockets as points with higher ‘travel depth’. In Bock et al. (Reference Bock, Garutti, Guerra, Markstein and Xu2007), the method generates a protein surface, and surface points calculates ‘spin-images’ from classical computer vision algorithms, from which the largest spheres that can be placed on a particular surface point without intersection with other surfaces are defined. Then the method clusters large spheres and outputs them as predicted pockets. MSPocket (Zhu and Pisabarro, Reference Zhu and Pisabarro2011) generates a surface and converts it to a graph, where two surface points are considered adjacent if moving these points along their normals makes them closer to each other. Then this graph is pruned, and surface points in the left subgraphs represent final pockets. CurPocket (Liu et al., Reference Liu2020b) generates a solvent-accessible surface for a protein, calculates curvature at each point, and then clusters points with high curvature. In Xie and Bourne (Reference Xie and Bourne2007), the method constructs a Delaunay tessellation of protein C
$ \alpha $
atoms. Then it removes too long edges and determines edges for protein boundaries. And, finally, it calculates surface directions and geometric potentials, from which the binding site is predicted. Fpocket (Le Guilloux et al., Reference Le Guilloux, Schmidtke and Tuffery2009) is the most widely used geometric method for binding pocket detection. It operates via alpha spheres. An alpha sphere is a sphere that contacts with four atoms and does not contain atoms inside. Intuitively, small spheres should lie inside the protein, large spheres are outside, and cavities should correspond to spheres of intermediate radii. So, the algorithm consists of the following steps: (i) detection of alpha spheres via Voronoi tessellation; (ii) filtering out too small and too large spheres; (iii) clustering alpha spheres; and (iv) calculation of additional descriptors and pocket re-ranking. It is worth mentioning, that there are other geometric methods that rely on the previously mentioned assumption, that residues in protein functional sites are more conserved. These methods map conservation scores of residues onto surface points of respective residues, and cluster the most conserved points in space to get binding sites (Glaser et al., Reference Glaser2006; Pupko et al., Reference Pupko2002; Armon et al., Reference Armon, Graur and Ben-Tal2001; Nimrod et al., Reference Nimrod2008; Panchenko et al., Reference Panchenko, Kondrashov and Bryant2004; Capra et al., Reference Capra2009).

Figure 3. Schematic overview of geometric methods for binding site detection. (a) Generation of occupancy grid and calculation of the fraction of directions enclosed by the target macromolecule for each empty grid point (used, for example, in POCKET (Levitt and Banaszak, Reference Levitt and Banaszak1992), LIGSITE (Hendlich et al., Reference Hendlich, Rippmann and Barnickel1997), PocketPocker (Weisel et al., Reference Weisel, Proschak and Schneider2007), SiteMap (Halgren, Reference Halgren2009), CAVIAR (Marchand et al., Reference Marchand2021)). (b) Rolling of spheres with two different radii around the target macromolecule. The spheres with a larger radius remove the smaller ones. The remaining small spheres are clustered to get final predictions (used, for example, in APROPOS (Peters et al., Reference Peters, Fauck and Frömmel1996), PHECOM (Kawabata and Go, Reference Kawabata and Go2007), (Masuya and Doi, Reference Masuya and Doi1995), GHECOM (Kawabata, Reference Kawabata2010), and POCASA (Yu et al., Reference Yu2010)). (c) The addition-removal algorithm, is used in Delaney (Reference Delaney1992), Kleywegt and Jones (Reference Kleywegt and Jones1994), and Brady and Stouten (Reference Brady and Stouten2000). Each step consists of adding and removing the surface-exposed points until the convergence. The target macromolecule is represented with a lilac surface, and grid points and probe spheres are shown with circles.
Geometry-based methods are usually faster than other methods, but they often have lower accuracy due to the lack of information about the physicochemical and energetic properties of a protein structure.
Energetic
Most energetic methods operate with atom probes placed in a 3D grid around the protein and determine low-energy clusters (see Figure 4). In Goodford (Reference Goodford1985), the authors proposed the first probe-based method, searching for energetically favorable positions on 3D maps for three types of probes: water probe, amino group
$ {\mathrm{NH}}_3^{+} $
, and methyl group
$ {\mathrm{CH}}_3 $
. For this, they used three-term energy functions including Lennard-Jones, electrostatic, and hydrogen-bond potentials. In Ruppert et al. (Reference Ruppert, Welch and Jain1997), the authors used another three types of probes (hydrogen atom for hydrophobic interaction, NH for hydrogen bond donor, and C=O for hydrogen bond acceptor probe) to obtain clusters of the lowest-energy points on the protein surface. DrugSite (An et al., Reference An, Totrov and Abagyan2004), Q-SiteFinder (Laurie and Jackson, Reference Laurie and Jackson2005), and PocketFinder (An et al., Reference An, Totrov and Abagyan2005) identify binding pockets via calculation of potential energy maps with aliphatic carbon probe using Lennard-Jones potential with parameters from ECEPP/3 (Nemethy et al., Reference Nemethy1992) or GRID (Wade et al., Reference Wade, Clark and Goodford1993) force fields. SiteHound (Ghersi and Sanchez, Reference Ghersi and Sanchez2009; Hernandez et al., Reference Hernandez, Ghersi and Sanchez2009) creates maps of potential energies for six probes (methyl, phosphate oxygen, hydroxyl oxygen, peptide nitrogen, water, and carbon), where potential energy is calculated as a sum of van der Waals and electrostatic interactions with parameters from GROMOS (Van Der Spoel et al., Reference Van Der Spoel2005) force field. FTSite (Ngan et al., Reference Ngan2012; Brenke et al., Reference Brenke2009) places 16 small molecular probes (ethanol, isopropanol, isobutanol, acetone, acetaldehyde, dimethyl ether, cyclohexane, ethane, acetonitrile, urea, methylamine, phenol, benzaldehyde, benzene, acetamide, and N,N-dimethylformamide) on a dense grid around the protein, optimizes their positions with extended energy expression using CHARMM force field (Brooks et al., Reference Brooks1983) and obtains low-energy clusters of probes. AutoSite (Ravindranath and Sanner, Reference Ravindranath and Sanner2016) calculates affinity maps for hydrophobic (carbon) and hydrophilic (oxygen, hydrogen) probes using the van der Waals interaction term from AutoDock energy function (Morris et al., Reference Morris2009; Huey et al., Reference Huey2007). SuperStar (Verdonk et al., Reference Verdonk2001) uses a slightly modified approach. This method places four different probes,
$ {\mathrm{NH}}_3^{+} $
nitrogen atom, carbonyl oxygen atom, hydroxyl oxygen atom, and methyl carbon atom, into the grid, and converts these maps according to distributions of densities observed in a database of crystallographic structures. In Tsujikawa et al. (Reference Tsujikawa2016), the atom probe approach in addition takes into account the conservation of amino acid residues. The method places a carbon atom probe, calculates van der Waals energies for them, and, then, weights interaction energy by conservation scores of nearby amino acid residues. Another class of energy-based methods runs MD simulations and retrieves information about binders from trajectory analysis. OMD (Bhinge et al., Reference Bhinge2004) runs short MD simulations of protein in water and then determines binding pockets as volumes, where the RMSD of solvent molecules within the trajectory is low. SILCS (Faller et al., Reference Faller and Klon2015) runs an MD simulation of protein with water solvent and multiple small molecule fragments. It defines what protein regions are more likely to be occupied by which small molecule types. PlayMolecule CrypticScout (Martinez-Rosell et al., Reference Martinez-Rosell2020) runs mixed-solvent MD with benzene molecules and defines binding hotspots as regions with high occupancy of benzene molecules or regions with low RMSD for these molecules within the trajectory. Another approach is utilized in MDPA (Gu et al., Reference Gu, Li and Ming2022). It is based on the assumption, that ligand binding occurs in regions with higher conformational dynamics (Ming and Wall, Reference Ming and Wall2006). To calculate the external interaction of proteins with test points, the method treats proteins as elastic network structures and simulates interactions using connected springs.

Figure 4. Schematic presentation of the energy probe-based methods. (a) Different probes (shown as red, blue, and green circles) are placed on a 3D grid around the target macromolecule (shown as a lilac surface) and their interaction energies with the target’s atoms are calculated. (b) The probes corresponding to the high-energy values are filtered out. (c) The remaining probes are clustered. (d) The filtering procedure is applied to remove non-relevant clusters.
Generally, energetic methods have higher accuracy than geometric methods and high interpretability. However, they are computationally expensive and may miss some interactions not covered by the existing probe types.
Machine learning-based
Most machine learning-based methods can be described in the following way. Firstly, they calculate feature vectors for amino acid residues in the input protein; then, the method feeds the feature vectors into an ML classifier, which outputs the probability of a residue being in the binding site. Then, the method spatially clusters high-scoring residues to get the binding site composition (Figure 5). Feature vectors can contain sequential (conservation), physicochemical (electrostatics, hydrogen bonds, solvation energy, hydrophobicity, atom types), geometrical, or structural (solvent accessibility, secondary structure, local geometry) descriptors. In Gutteridge et al. (Reference Gutteridge, Bartlett and Thornton2003), the method identifies catalytic residues in enzymes. It calculates multiple descriptors for each residue: conservation score, relative solvent accessibility, secondary structure, and closeness to a cleft identified by Surfnet (Laskowski, Reference Laskowski1995), and residue depth. This feature vector is used as input for a single-layer NN. Petrova and Wu (Reference Petrova and Wu2006) calculates sequential and structural properties of residues (conservation, flexibility, solvent accessibility, position on the protein surface, hydrogen bonds, secondary structure) and classifies them using an SVM. Tong et al. (Reference Tong2009) calculate electrostatic features, geometric properties, and sequence-based conservation for each residue and classify them using a maximum-likelihood algorithm. Qiu and Wang (Reference Qiu and Wang2011) calculates eight structural properties (solvent accessible surface area, solvation energy, hydrophobicity, depth index, protrusion index, preference, theoretical b-factor) for residues, and uses a Random Forest classifier. ISMBLab-LIG (Jian et al., Reference Jian2016) first calculates 3D probability density maps that describe interacting atom types around the protein surface using pre-calculated distributions from a database. Then, for each surface atom, the method collects a feature vector of surface local geometry combined with properties retrieved from the described density maps. The method uses a NN model as a classifier. GRaSP (Santana et al., Reference Santana2020) retrieves a set of physicochemical properties (solvent relative accessibility, atom types, interaction level) from an atomic graph, and then uses an extremely randomized tree to classify residues as binding/non-binding.

Figure 5. Schematic presentation of the machine learning-based methods. On the top, the target structure is represented as a surface, and feature vectors are calculated for the surface points. On the bottom, feature vectors are calculated for the target’s residues or atoms. Then, an ML classifier predicts the binding scores for the points, residues, or atoms, based on the input feature vectors. Finally, the output predictions are filtered by a score threshold and clustered.
Another approach is to classify points or patches on the protein surface instead of residues. Bradford and Westhead (Reference Bradford and Westhead2005) classifies surface points, where for each point a feature vector consists of seven properties: shape index, curvedness, conservation, electrostatic potential, hydrophobicity, residue interface propensity, and solvent accessible surface area. P2Rank (Krivák and Hoksza, Reference Krivák and Hoksza2018) generates a protein surface, and projects features calculated for protein atoms to surface points. Afterward, it predicts the ligandability of each point using an RF classifier and clusters points with high scores. Similarly, SiteFerret (Gagliardi and Rocchia, Reference Gagliardi and Rocchia2023) first calculates a set of features for surface points, incorporating information about cavities, and then classifies them using the Isolation Forest method. It is also possible to directly classify cavities detected by a geometric method. SCREEN (Nayal and Honig, Reference Nayal and Honig2006) first identifies all possible cavities on the protein surface. After that, it calculates a large set of cavity descriptors of different types, such as cavity size, electrostatics, hydrogen bonding, hydrophobicity, polarity, amino acid composition, rigidity, secondary structure, and cavity shape. An RF model classifies cavities as drug-binding/non-drug-binding and selects a smaller fraction of relevant descriptors. FEATURE (Bagley and Altman, Reference Bagley and Altman1995, Reference Bagley and Altman1996; Wei and Altman, Reference Wei and Altman1998, Reference Wei and Altman2003) represents a set of tools for the prediction of binding sites of different types, such as calcium binding (Wei and Altman, Reference Wei and Altman1998), ATP binding (Wei and Altman, Reference Wei and Altman2003), serine protease active sites (Bagley and Altman, Reference Bagley and Altman1996), and others (Liang et al., Reference Liang2003). It represents microenvironments around a protein as concentric shells with centers placed on a grid and calculates physicochemical properties within these shells. These properties are compared with a set of features for known binding sites and non-binding sites. Then, the Bayesian classifier (Friedman et al., Reference Friedman, Geiger and Goldszmidt1997) is used to distinguish binding sites from non-binding ones. There are also consensus-based methods that retrieve predictions from multiple other methods, including geometric, energy-based, or template-based approaches and combine them into final predictions using an ML-based re-scoring. For example, MetaPocket2.0 (Zhang et al., Reference Zhang2011) aggregates results from eight different methods:
$ {\mathrm{LIGSITE}}^{csc} $
(Huang and Schroeder, Reference Huang and Schroeder2006), PASS (Brady and Stouten, Reference Brady and Stouten2000), Q-SiteFinder (Laurie and Jackson, Reference Laurie and Jackson2005), Surfnet (Laskowski, Reference Laskowski1995), Fpocket (Le Guilloux et al., Reference Le Guilloux, Schmidtke and Tuffery2009), GHECOM (Kawabata, Reference Kawabata2010), ConCavity (Capra et al., Reference Capra2009), and POCASA (Yu et al., Reference Yu2010). Another example is COACH (Yang et al., Reference Yang, Roy and Zhang2013), which combines prediction results from TM-SITE (Yang et al., Reference Yang, Roy and Zhang2013), S-SITE (Yang et al., Reference Yang, Roy and Zhang2013), COFACTOR (Roy et al., Reference Roy, Yang and Zhang2012), FINDSITE (Brylinski and Skolnick, Reference Brylinski and Skolnick2011), and ConCavity (Capra et al., Reference Capra2009).
ML-based methods strongly rely on the dataset construction and calculated feature vectors, and can produce false positive predictions – that is, identification of ‘undruggable’ regions (Broomhead and Soliman, Reference Broomhead and Soliman2017). Moreover, even extensive feature engineering does not guarantee capturing all the information relevant to the binding site prediction.
Deep learning-based
The accumulation of large amounts of structural data and advancements in deep learning methods in other fields have led to the development of top-performing methods in structural bioinformatics problems, such as protein structure prediction (Jumper et al., Reference Jumper2021; Baek et al., Reference Baek2021) DL-based approaches may not require hand-crafted feature engineering and may be capable of capturing relevant structural context by construction. To begin with, these methods typically operate on protein structures represented as a point cloud, a graph, or a 3D density grid. Although a graph is typically a 2D representation of a structure, the 3D information can be encoded as the feature vectors of graph nodes or edges. Therefore, most of the DL-based methods can be classified based on the structure representation, and further split into two classes: (i) where a DL model performs segmentation using the entire graph or grid; and (ii) where a DL model samples sub-graphs or sub-grids and classifies whether their centers correspond to a binding site or not. Figure 6 demonstrates a schematic representation of this idea, and we provide more details about methods in each class below.

Figure 6. Schematic presentation of the DL-based methods. Most of the methods utilize graph-based or voxel grid representations of the target macromolecular structure. Then, they sample either sub-graphs or sub-grids around the structure and classify their centers as belonging to the binding site or not. Alternatively, they use segmentation models to operate with the full graph or grid.
We start with methods that sample small 3D voxel grids around a protein structure and classify whether the grid center corresponds to a binding site. DeepSite (Jiménez et al., Reference Jiménez2017) was one of the first methods developed for predicting ligand binding sites. It represents a protein structure as a 3D voxel grid with 1Å voxels in size, where each voxel contains eight channels for atoms of different types: hydrophobic, aromatic, hydrogen bond acceptor, hydrogen bond donor, positive ionizable, negative ionizable, metal, and excluded volume. Each channel of a voxel stores the occupancy value of nearby atoms of the respective type, where occupancy is calculated as
$ n(r)=1-\mathit{\exp}\left(-{\left({r}_{vdw}/r\right)}^{12}\right) $
. Then, from the generated 3D voxel grid of a protein, subgrid cubes of size
$ 16\times 16\times 16 $
voxels are sampled through a sliding window. These cubes are provided as input into a 3D CNN, which outputs a probability score for the cube center being closer than 4Å to the geometric center of a binding site. In Jiang et al. (Reference Jiang2019a), the authors utilized a similar approach but parameterized the input voxel grid differently. They calculate four pseudo-energy channels instead of using an occupancy-based grid: (i) a shape channel retrieved as output from the LIGSITE (Hendlich et al., Reference Hendlich, Rippmann and Barnickel1997) method; (ii) a van der Waals potential energy channel of an -CH3 probe; (iii) a hydrogen bond potential channel using an -OH probe; and (iv) an electric potential energy channel. DeepPocket (Aggarwal et al., Reference Aggarwal2021) re-scores pockets predicted by Fpocket (Le Guilloux et al., Reference Le Guilloux, Schmidtke and Tuffery2009) using a similar 3D CNN model. For this, it uses libmolgrid (Sunseri and Koes, Reference Sunseri and Koes2020) to obtain a cubic 3D voxel grid of size 23.5Å with a voxel size of 0.5Å around a pocket and passes it into a 3D CNN model that classifies the pocket as ligandable or non-ligandable. After that, the method also generates a segmented representation of ligandable pockets by passing a larger (32Å) cubic grid into another U-Net-like (Ronneberger et al., Reference Ronneberger, Fischer and Brox2015) 3D CNN. DeepSurf (Mylonas et al., Reference Mylonas, Axenopoulos and Daras2021), CAT-Site (Petrovski et al., Reference Petrovski, Hribar-Lee and Bosnić2022), and SAPocket (Wang et al., Reference Wang, He and Zhu2023c) instead of using a sliding window, sample points on the protein surface, calculate voxelized representations for cubic grids centered on these points and pass these grids as input into a 3D CNN classification model. FRSite (Jiang et al., Reference Jiang2019b) utilizes the faster R-CNN approach (Ren et al., Reference Ren2016): it first passes a voxelized representation of a protein into a 3D Region Proposal Network, and further feeds proposals into a 3D CNN for classification. BiteNet (Kozlovskii and Popov Reference Kozlovskii and Popov2020) utilizes the YOLO approach for real-time object detection in videos (Redmon et al., Reference Redmon2016). It first obtains a voxelized representation of an input protein and applies a 3D CNN model to it. The model splits the input grid into cells, where each cell contains predicted values for the probabilities of a binding site center being within the cell and the center of a binding site with respect to the cell. Kalasanty (Stepniewska-Dziubinska et al., Reference Stepniewska-Dziubinska, Zielenkiewicz and Siedlecki2020) was one of the first methods to perform 3D segmentation of binding sites in a single pass. The method represents a protein as a 3D grid of a constant size of 70Å along each direction with a 2Å voxel size and feeds it into a 3D U-Net (Ronneberger et al., Reference Ronneberger, Fischer and Brox2015). There are multiple follow-up methods of the Kalasanty approach with adjusted 3D CNN model (Kandel et al., Reference Kandel, Tayara and Chong2021; Li et al., Reference Li2022; Li et al., Reference Li2023b; Nazem et al., Reference Nazem2021; Liu et al., Reference Liu2023). PUResNet (Kandel et al., Reference Kandel, Tayara and Chong2021) modified the encoder in the U-Net model to ResNet. RefinePocket (Liu et al., Reference Liu2023) used an attention-enhanced encoder and a mask-guided decoder inside the U-Net. RecurPocket (Li et al., Reference Li2022) and GLPocket (Li et al., Reference Li2023b) used a recurrent LMSER (Least Mean Square Error Reconstruction) network with gated recurrent refinement. DUNet (Wang et al., Reference Wang2022b) added a DenseNet (Huang et al., Reference Huang2017) encoder into the U-Net model. InDeep (Mallet et al., Reference Mallet2022) used a 3D U-Net for the prediction and segmentation of small molecule binding sites occurring on protein–protein interfaces. PointSite (Yan et al., Reference Yan2022) represents a point cloud-based segmentation approach: it constructs a point cloud from all protein atoms, converts it into a 3D sparse grid, and applies a segmentation model with a U-Net architecture based on submanifold sparse convolutions.
A different approach is to consider the 3D structure of a protein as a graph. Some methods sample points on the protein surface or around the protein on a grid, and analyze the graph constructed from these points, that is predicting whether a point corresponds to a binding site or not. MaSIF (Gainza et al., Reference Gainza2020) pre-calculates physicochemical properties for the whole protein surface. Then, from this surface, the method samples patches represented as a graph with surface points as nodes and with surface point feature vectors as node feature vectors, containing geometric (shape index, distance-dependent curvature) and chemical (hydropathy, continuum electrostatics, free electrons/protons) descriptors with additional values for geodesic coordinates. Then, a GNN with geodesic convolutions is applied to the patch graph with multiple orientations to obtain an embedding vector for the input patch. The authors used this approach for different tasks: prediction of protein–ligand binding sites, prediction of protein–protein binding sites, and fast scanning of protein surfaces for identification of protein–protein binder partners. In dMaSIF (Sverrisson et al., Reference Sverrisson2021), the authors further improved this approach and made the method fully differentiable without the need for memory- and computation-demanding pre-calculation of surface descriptors. SiteRadar (Evteev et al., Reference Evteev, Ereshchenko and Ivanenkov2023) selects points on a 3D grid outside of the protein, then generates a graph with protein atoms around the selected points as the nodes, and uses a GNN to analyze the graphs, predicting whether the point center belongs to a binding site or not. PocketAnchor (Li et al., Reference Li2023c) samples a set of ‘anchor’ points around the protein, representing potentially ligandable positions. For each ‘anchor’ point, it further generates a graph with protein atoms and surface points within 6Å from the ‘anchor’ center. Atom and surface point nodes contain a different number of geometric and chemical features. The graphs are processed with MPNNs, outputting binding site scores for each ‘anchor’ point. Of note, GraphSite (Shi et al., Reference Shi2022) uses a graph representation of local protein regions and utilizes GNNs to classify ligand binding sites into 14 classes. A higher-level approach is to provide the whole protein graph as input into a graph neural network model for segmentation. For example, GraphBind (Xia et al., Reference Xia2021) uses residue centroids as node centers and structural and sequential features of residues as node features and passes the graph into a Hierarchical Graph Neural Network, which predicts a score for each residue indicating whether it is on the binding interface with nucleic acid. Similarly, GraphPLBR (Wang et al., Reference Wang2023d) operates on residues as nodes. FABind (Pei et al., Reference Pei2023) uses two separate GNNs, one for working with the protein residue graph and another operating on the ligand atomic graph. Embeddings from the two models are combined to make a ligand-specific prediction of binding residues. GrASP (Smith et al., Reference Smith2023) builds a graph using all protein atoms within 5Å from the surface and uses a GNN with graph attention to classify atoms as binding or non-binding. Similarly, GU-Net (Nazem et al., Reference Nazem2023) uses all protein atoms for the graph and predicts atom scores using a U-Net-like Graph Convolutional Network (Gao and Ji, Reference Gao and Ji2019). EquiPocket (Zhang et al., Reference Zhang2023) builds graphs using both protein atoms and surface points and applies GNNs with E(3)-equivariant convolutions (Satorras et al., Reference Satorras, Hoogeboom and Welling2021) to identify binding atoms. LigBind (Xia et al., Reference Xia, Pan and Shen2023) demonstrated another approach: it first pre-trains GNNs for a ligand-general binding residue predictor and a feature extractor for ligand-residue pair embeddings, and then fine-tunes ligand-specific binding residue predictors for more than 1000 ligand types from the BioLip (Yang et al., Reference Yang, Roy and Zhang2012) database.
Recently, models for protein structure prediction have made significant advances. A breakthrough occurred in the 14th Critical Assessment of Protein Structure Prediction (CASP14) challenge (Kryshtafovych et al., Reference Kryshtafovych2021) when AlphaFold2 (Jumper et al., Reference Jumper2021) achieved almost experimental accuracy in the prediction of full-atom protein structures. AlphaFold2 takes an MSA and structural templates as inputs and utilizes a complicated deep-learning model with the newly introduced Evolutionary Transformer. Some approaches for binding site prediction use these methods to generate protein structure models from sequences alone, and retrieve structural features along with sequential ones for each residue (Littmann et al., Reference Littmann2021; Ho et al., Reference Ho2021; Seo et al., Reference Seo2024). Furthermore, some models extract a structural graph from a generated protein model, where residues correspond to the graph nodes and the their features correspond to the node features. Then, the composed graph is used as the input for a graph neural network (GNN) (Yuan et al., Reference Yuan2022a; Zhang and Xie, Reference Zhang and Xie2023). Previously, it was shown that AlphaFold2 can successfully predict protein–peptide structures (Chang and Perez, Reference Chang and Perez2023; Tsaban et al., Reference Tsaban2022), but it was not clear whether AlphaFold2 could be used for the analysis of interactions between proteins and small molecules, as the latter were absent in the training objective for modeling. However, in some cases, AlphaFold2 predicts rotamers as if they were interacting with small molecules, suggesting that it can be used to train a binding site detection model. Moreover, as AlphaFold2 can predict protein–peptide complexes, it can be reasoned that this model can also be useful for the identification of interactions with small molecules, as they or their fragments can resemble amino-acid side chains (Polizzi and DeGrado, Reference Polizzi and DeGrado2020). AF2BIND (Gazizov et al., Reference Gazizov2023) is constructed as follows: as input to the AlphaFold2 model, it provides a sequence of a target protein, the protein backbone structure as a template, and 20 ‘bait’ amino acids as individual chains, appending them to the sequence with large offsets. The method further uses AlphaFold2 output pairwise representations between target residues and each of the twenty ‘bait’ amino acids as input into a logistic regression model, predicting whether a target residue is ligand binding or not. The authors demonstrated a correlation between the chemical properties of the small molecule ligands and the 20 ‘bait’ amino acids.
Finally, there are methods combining multiple representations of the protein. PocketMiner (Meller et al., Reference Meller2023) uses a geometric vector perceptron GNN (GVP-GNN) (Jing et al., Reference Jing2020) and a 3D CNN for the prediction of putative cryptic pockets. For this, the authors generated 40-ns simulations for 37 proteins and trained the models to predict the positions in each structure where a pocket would open during a short simulation. The authors showed that both GVP-GNN and 3D CNN work equally well.
Benchmarks
Most of the newest ML- and DL-based methods rely on scPDB (Desaphy et al., Reference Desaphy2015), PDBbind (Liu et al., Reference Liu2015), or BioLip (Yang et al., Reference Yang, Roy and Zhang2012) databases for training and validation. scPDB (Desaphy et al., Reference Desaphy2015) is a large database containing
$ \sim $
16,000 complexes, where each entry is annotated with calculated properties for the ligand, cavity, and interactions. PDBbind (Liu et al., Reference Liu2015) is a curated database of protein–ligand complexes (
$ \sim $
23,000), with experimentally determined binding affinity. BioLip (Yang et al., Reference Yang, Roy and Zhang2012; Zhang et al., Reference Zhang2024) contains
$ \sim $
460,000 structures of proteins or nucleic acids, with a total of approximately
$ \sim $
890,000 ligands. BioLip includes a wide range of classes of macromolecules and ligands, allowing researchers to construct various training and validation sets. In addition, it incorporates a comprehensive procedure to select relevant ligands and includes cross-references with many other databases (PDBbind (Liu et al., Reference Liu2015), BindingDB (Gilson et al., Reference Gilson2016), SIFTS (Dana et al., Reference Dana2019), UniProt (Consortium, Reference Consortium2019), and DrugBank (Knox et al., Reference Knox2024), etc.). Note that the provided numbers are for 2024; these are likely to increase in future versions of the databases. Finally, other approaches rely on training datasets compiled from PDB, followed by structure refinement, clustering, and filtering of redundant structures.
There are two benchmark sets, COACH420 (Krivák and Hoksza, Reference Krivák and Hoksza2018) and HOLO4K (Schmidtke et al., Reference Schmidtke2010), which are widely used for the comparison of binding site detection methods. COACH420 (Krivák and Hoksza, Reference Krivák and Hoksza2018) is a dataset of 420 single-chain proteins containing natural compounds and drug-like ligands. It was first created for the evaluation of the P2Rank method (Krivák and Hoksza, Reference Krivák and Hoksza2018) as a subset of a test set from (Roy et al., Reference Roy, Yang and Zhang2012; Yang et al., Reference Yang, Roy and Zhang2013), without proteins from the training set of P2Rank. HOLO4K (Schmidtke et al., Reference Schmidtke2010), in turn, is a large set of protein–ligand complexes. It was initially composed for the validation of the PocketFinder (An et al., Reference An, Totrov and Abagyan2005) method and later used for a comprehensive large-scale comparison of binding site prediction methods (Schmidtke et al., Reference Schmidtke2010). Interestingly, originally, it comprised apo complexes; but after the work of (Krivák and Hoksza, Reference Krivák and Hoksza2018), a subset of holo complexes is mainly used. It is important to note that, although the COACH420 and HOLO4K benchmarks are used by many methods, most of them perform additional filtering (e.g., removing irrelevant ligands or addressing data leakage between the training and test sets), resulting in slightly different subsets of COACH420 and HOLO4K. Therefore, a direct comparison of methods based on these benchmarks may not be as straightforward as it may seem. Nonetheless, one may see the performance metrics of different methods in Supplementary Tables S2–S10.
The other benchmarks include: CHEN11 (Chen et al., Reference Chen2011), B48/U48 (Huang and Schroeder, Reference Huang and Schroeder2006), B210 (Huang and Schroeder, Reference Huang and Schroeder2006), DT198 (Zhang et al., Reference Zhang2011), ASTEX (Hartshorn et al., Reference Hartshorn2007), and CASP (Lopez et al., Reference Lopez, Ezkurdia and Tress2009; Schmidt et al., Reference Schmidt2011; Gallo Cassarino et al., Reference Gallo Cassarino, Bordoli and Schwede2014). CHEN11 (Chen et al., Reference Chen2011) is a non-redundant dataset of 251 proteins, where each structure is the most representative structure of a family, with a ligand superimposed from the closest homolog in cases where a ligand is absent in the original structure. B48/U48 (Huang and Schroeder, Reference Huang and Schroeder2006) is a small dataset of pairs of apo and holo-structures of the same protein. The Astex Diverse set (Hartshorn et al., Reference Hartshorn2007) is a small benchmark for docking methods, which was used as the binding site detection benchmark (Le Guilloux et al., Reference Le Guilloux, Schmidtke and Tuffery2009; Yan et al., Reference Yan2022). Some of the older classical methods used CASP8 (Lopez et al., Reference Lopez, Ezkurdia and Tress2009), CASP9 (Schmidt et al., Reference Schmidt2011), and CASP10 (Gallo Cassarino et al., Reference Gallo Cassarino, Bordoli and Schwede2014) benchmarks to evaluate their performance for the prediction of ligand-binding residues. However, these benchmarks are much smaller compared to the other ones described above. Finally, we would like to note that, while for older methods, the most used metrics correspond to binary classification metrics derived from the residue scores, newer methods include metrics based on the distances between the predicted binding site center and the true binding site, as well as the overlap of the predicted and true binding site cavity in the case of binding site segmentation. Supplementary Section Metrics provides more details on commonly used performance metrics for binding site prediction methods.
Protein–peptide binding sites
Protein–protein interactions (PPIs) regulate numerous essential biological pathways, making them a key class of pharmacological targets (Ruffner et al., Reference Ruffner, Bauer and Bouwmeester2007). There is an increasing need to develop inhibitors of intracellular PPIs to modulate critical biological processes. However, PPIs have long been considered difficult to target (Tsomaia, Reference Tsomaia2015). On the one hand, large biologics, which are effective in targeting extracellular PPIs, cannot penetrate cell membranes to reach intracellular PPIs. On the other hand, traditional small molecule scaffolds can cross membranes but are often unsuitable for the large, shallow surfaces typical for PPI interfaces (Tsomaia, Reference Tsomaia2015). PPI interfaces exhibit distinct characteristics, such as larger contact areas (
$ \sim 1500-3000{\mathring{A}}^2 $
for PPI compared to
$ \sim 300-1000{\mathring{A}}^2 $
for protein–small molecule interactions (Smith and Gestwicki, Reference Smith and Gestwicki2012)) and the absence of deep binding pockets usually found in small molecule interactions (
$ \sim 270{\mathring{A}}^3 $
in volume (Buchwald, Reference Buchwald2010)). Notably, PPI interfaces often contain smaller binding pockets (
$ \sim 100{\mathring{A}}^3 $
(Fuller et al., Reference Fuller, Burgoyne and Jackson2009)) that play a crucial role in binding affinity (Clackson and Wells, Reference Clackson and Wells1995). Peptides and peptide-based molecules occupy a unique position between small molecules (with a molecular weight
$ <0.5 kDa $
) and biologics (
$ >150 kDa $
). They offer a promising therapeutic approach for targeting intracellular PPIs, as they can potentially combine the benefits of biologics, such as low toxicity, high specificity, and strong affinity, with the membrane permeability of small molecules (Tsomaia, Reference Tsomaia2015). The successful design of therapeutic peptides requires detailed knowledge of the binding sites on their protein targets. Identifying new protein–peptide binding sites could broaden the range of druggable targets, opening up new opportunities for drug discovery. Many methods for protein–peptide binding site prediction utilize approaches similar to the ones described in the previous section, but we still cover these methods here to highlight some specific characteristics. See Table 2 for an extensive list of methods for prediction of protein-peptide binding sites.
Table 2. List of methods for prediction of protein–peptide binding sites

Machine learning-based
Multiple sequence-based methods calculate features for each residue (e.g., PSSM, predicted ASA, SS, physicochemical properties, and intrinsic disorder) in an input protein sequence and pass these features as input into a classical ML model (Taherzadeh et al., Reference Taherzadeh2016; Zhao et al., Reference Zhao, Peng and Yang2018; Iqbal and Hoque, Reference Iqbal and Hoque2018; Shafiee et al., Reference Shafiee, Fathi and Taherzadeh2022). More advanced approaches tend to rely on additional information. SPRINT-Str (Taherzadeh et al., Reference Taherzadeh2018) and Multi-VORFFIP (Segura et al., Reference Segura, Jones and Fernandez-Fuentes2012) calculate structural and physicochemical descriptors for each residue in a target protein and use RF for the binary classification of residues as binding/non-binding. PINUP (Liang et al., Reference Liang2006) calculates structural and physicochemical descriptors for interface residues, then selects surface patches by choosing a central surface residue and 19 residues nearest to it, and then classifies the patch based on a set of features for the 20 patch residues. P2Rank-Pept (Krivák et al., Reference Krivák, Jendele and Hoksza2018) calculates geometrical and physicochemical descriptors for protein surface points and classifies these points using RF. PepSite (Trabuco et al., Reference Trabuco2012) uses spatial PSSMs for the identification of peptide-binding hot spots on the protein surface. For this, the method estimates the densities of protein atoms around each amino acid type in the peptide and encodes them into a 3D grid. Then, PepSite screens the target protein with these S-PSSM grids and identifies appropriate hot spots. PepBind (Zhao et al., Reference Zhao, Peng and Yang2018) is a consensus method combining predictions from SVMpep, S-SITE, and TM-SITE.
Deep learning-based
The sequence-based approaches, such as VisualP (Wardah et al., Reference Wardah2020) encode a window around a residue into a 2D image and apply a CNN. MTDSite (Sun et al., Reference Sun2021) uses a BiLSTM to predict binding residues for DNA, RNA, carbohydrates, and peptides. PepBCL (Wang et al., Reference Wang2022a) and PepNN-Seq (Abdin et al., Reference Abdin2022) retrieve protein sequence embeddings from the language model ProtTrans (Elnaggar et al., Reference Elnaggar2021). Similarly to the machine learning-based methods, recent deep learning-based approaches tend to incorporate different types of information into the model. PepCNN (Chandra et al., Reference Chandra2023) represents residues using sequential and structural descriptors, along with embeddings from the ProtT5 (Elnaggar et al., Reference Elnaggar2021) model, and passes them into a 1D CNN model. PepNN-Struct (Abdin et al., Reference Abdin2022) uses a GNN with attention to extract embeddings from a graph of protein residues and uses multi-head attention to encode a peptide sequence for predicting of binding residues. The authors also demonstrated that pre-training on protein–protein complexes significantly increases the model accuracy in predicting peptide-binding residues. GraphPPepIS (Li et al., Reference Li2023a) represents both protein and peptide structures as graphs and passes them into a GCN, extracting binding residues on both the protein and peptide sides. GAPS (Zhu et al., Reference Zhu2023) encodes a protein into a point cloud of atoms and uses a geometric attention-based network to classify atoms as binding or non-binding. BiteNet
$ {}_{Pp} $
(Kozlovskii and Popov, Reference Kozlovskii and Popov2021a) represents peptide binding sites as a set of hotspots and utilizes an approach similar to BiteNet (Kozlovskii and Popov, Reference Kozlovskii and Popov2020): it encodes an input protein into a 3D voxel grid and feeds it into a 3D CNN, which splits the grid into cells containing probabilities of a peptide binding site hotspot being in the cell and hotspot center coordinates. DeepProSite (Fang et al., Reference Fang2023) builds a model using ESMFold (Rives et al., Reference Rives2021), retrieves embeddings using the ProtTrans (Elnaggar et al., Reference Elnaggar2021) model, and feeds the graph into a Graph Transformer network (Ingraham et al., Reference Ingraham2019) afterward to predict protein–protein and protein–peptide binding sites.
Template- and energy-based methods
There are a few template-based and energy-based approaches. For example, SPOT-peptide (Litfin et al., Reference Litfin, Yang and Zhou2019) and InterPep (Johansson-Åkhe et al., Reference Johansson-Åkhe, Mirabello and Wallner2019) screen a query protein against a database of known protein–peptide complexes. Energy-based methods sample small molecule probes around a protein and cluster low-energy conformations to get final predictions. PeptiMap (Lavi et al., Reference Lavi2013) adapts the FTmap (Brenke et al., Reference Brenke2009) method for protein–small molecule binding site prediction with additional post-processing for filtering out irrelevant sites. ACCLUSTER (Yan and Zou, Reference Yan and Zou2014) scans a protein surface with 20 amino acid probes. In Verschueren et al. (Reference Verschueren2013), the method uses polypeptide fragments from the BriX (Vanhee et al., Reference Vanhee2011) database mapped around the target protein and generates ensembles of energetically favorable protein–peptide complexes.
Benchmarks
For protein–peptide binding sites, the most widely used benchmark is TS125, which is a test set from SPRINT-Seq (Taherzadeh et al., Reference Taherzadeh2016), constructed as a non-redundant subset of 1,279 protein–peptide complexes from the BioLip database (Yang et al., Reference Yang, Roy and Zhang2012). Other benchmarks include TS092, TS251, and TS639. TS092 is a test benchmark from PepNN (Abdin et al., Reference Abdin2022), designed as a subset of protein–peptide complexes from the PDB, submitted after a specific date and having a sequence identity lower than
$ 30\% $
with all protein targets in the training set. The TS251 benchmark from InterPep (Johansson-Åkhe et al., Reference Johansson-Åkhe, Mirabello and Wallner2019) was constructed such that the TM-score (Zhang and Skolnick, Reference Zhang and Skolnick2005) of the protein structures is lower than
$ 0.5 $
with all the structures in the template database. Finally, TS639 from PepBind (Zhao et al., Reference Zhao, Peng and Yang2018) is a different subset of T1279, used for training and validation of SPRINT-Seq (Taherzadeh et al., Reference Taherzadeh2016), described above. Table 3 lists performance metrics (AUC and MCC, see also Supplementary Section Metrics) for the protein–peptide binding site prediction methods. As one can see, the top methods are ML- or DL-based, with
$ {\mathrm{BiteNet}}_{Pp} $
(Kozlovskii and Popov, Reference Kozlovskii and Popov2021a) being the top-performing one.
Table 3. Performance of protein–peptide binding site detection methods on test benchmarks retrieved from Kozlovskii and Popov (Reference Kozlovskii and Popov2021a), Abdin et al. (Reference Abdin2022), and Fang et al. (Reference Fang2023)

Nucleic acid–small molecule binding sites
RNA molecules are emerging as a significant class of pharmacological targets (Warner et al., Reference Warner, Hajdin and Weeks2018). Efforts in RNA-targeting drug discovery span various approaches, such as designing stabilizers for DNA G-quadruplexes (Ortiz de Luzuriaga et al., Reference Ortiz de Luzuriaga, Lopez and Gil2021), developing antibiotics that target riboswitches (Panchal and Brenk, Reference Panchal and Brenk2021), using antisense RNA (McClorey and Wood, Reference McClorey and Wood2015), and creating RNA-targeting antivirals. RNA targets that expand the druggable genome, including those associated with ‘undruggable’ proteins or non-coding microRNAs, hold particular promise (Matsui and Corey, Reference Matsui and Corey2017). However, the development of RNA-targeted drugs faces significant challenges, such as limited chemical diversity and the dynamic nature of RNA structures (Falese et al., Reference Falese, Donlic and Hargrove2021). To advance RNA-targeting drug discovery, efficient tools for detecting structure-specific RNA-small molecule binding sites are needed.
There are many approaches targeting binding sites on proteins; however, there is a limited number of methods for nucleic acids. Table 4 provides a list of methods for prediction of nucleic acid-small molecule binding sites.
Table 4. List of methods for prediction of nucleic acid–small molecule binding sites

Knowledge-based
Firstly, there are several knowledge-based methods. Rsite (Zeng et al., Reference Zeng2015) and Rsite2 (Zeng and Cui, Reference Zeng and Cui2016) calculate distances between nucleotides based on tertiary and secondary structures, respectively, and determine nucleotides that are the most distant from others as the binding nucleotides. Similarly, RBind (Wang, Jian, et al., Reference Wang2018b) calculates the degree and closeness of nodes in a nucleotide network and determines binding nucleotides as those with values exceeding a specified threshold. RNetsite (Liu et al., Reference Liu2024) represents an RNA molecule as a graph and calculates local (degree, neighborhood connectivity) and global (betweenness centrality, closeness, and eccentricity) properties for each node of the graph. Then, each node is classified as binding or non-binding based on the property statistics computed from a reference set of RNA molecules.
Energetic
To the best of our knowledge, only two methods use an energy-based approach. SILCS-RNA (Kognole et al., Reference Kognole, Hazel and MacKerell2022) runs simulations of a target macromolecule in a mixed solvent with eight different probes. From these simulations, the method calculates a 3D grid with energy maps, which can be used for binding site identification, docking, and binding affinity evaluation tasks. SHAMAN (Panei et al., Reference Panei, Gkeka and Bonomi2024) is also a probe-based approach, but adds a metadynamics enhanced-sampling technique to explore wider conformational changes of the input RNA molecule.
Machine learning-based
Machine learning-based methods for binding site detection in nucleic acids have emerged very recently. RNAsite (Su et al., Reference Su, Peng and Yang2021) calculates sequential features (e.g., conservation from MSA) and structural features (e.g., topological properties, solvent accessibility, and Laplacian norm) for each nucleotide and passes them into an RF classifier to distinguish between binding and non-binding nucleotides. Similarly, DrugPred_RNA (Rekand and Brenk, Reference Rekand and Brenk2021) calculates a set of simple structural descriptors such as size, shape, and polarity for a pocket and uses an XGBoost model (Chen and Guestrin, Reference Chen and Guestrin2016) to classify it as druggable or non-druggable. As descriptors are constructed in a macromolecule type-agnostic way, the model is first pre-trained on a protein dataset and then fine-tuned for binding sites in RNAs.
Deep learning-based
RLBind (Wang, Zhou, et al., Reference Wang2023b) calculates local and global sequential features (e.g., nucleotide types and evolutionary conservation) and structural features (e.g., network topological properties, biochemical properties, and ASAs), retrieves a window of 11 nucleotides for each position, and feeds it into a 1D CNN that classifies the position as binding/non-binding. RNet (Möller et al., Reference Möller2022) utilizes an approach similar to DeepSite for predicting binding sites in proteins (Jiménez et al., Reference Jiménez2017). It represents a macromolecule structure as an
$ 80\times 80\times 80{\mathring{A}}^3 $
3D voxel grid with eight channels representing different atom types: carbon, nitrogen, oxygen, phosphorus, sulfur, fluorine, bromine, and iodine. The method passes this grid as input into a 3D CNN model, predicting ligandability scores for voxels of size
$ 4\times 4\times 4{\mathring{A}}^3 $
. Binding sites are retrieved by clustering predicted ligandable voxels. The authors pre-trained the model on protein binding sites and fine-tuned it to RNAs. MultiModRLBP (Wang et al., Reference Wang2024) uses a relational GCN to obtain features from a nucleotide structure graph and a pre-trained language model (RNABert (Kalicki and Haritaoglu, Reference Kalicki and Haritaoglu2022)) to get embeddings from an RNA sequence. The model concatenates these structural and sequential features and feeds the resulting vector into a small neural network of fully connected layers to obtain a prediction for each nucleotide. BiteNet
N
(Kozlovskii and Popov, Reference Kozlovskii and Popov2021b) predicts binding site centers on both RNA and DNA macromolecules. To train the model, the authors composed the largest dataset of
$ \sim $
2000 nucleic acid–small molecule structures. First, the method converts an input nucleic acid macromolecule structure into a voxel-based representation. Then, a 3D CNN model takes this grid as input and produces a set of binding site centers and coordinates, along with a binding score for each nucleotide.
Benchmarks
One of the most widely used benchmarks for RNA-ligand binding site detection methods is the TE18 test set from RNAsite (Su et al., Reference Su, Peng and Yang2021). Another benchmark is RB19 from RBind (Wang, Jian, et al., Reference Wang2018b). Note that, typically, methods use only a subset of these test sets to avoid sharing similar complexes with the training sets. Most recently, the authors of SHAMAN (Panei et al., Reference Panei, Gkeka and Bonomi2024) created a test set based on seven RNA complexes: riboswitches (FMN, THF, TPP, and dG) and viral RNAs (HIV-1 TAR, HCV-IRES-IIa, and IAV). They also introduced different strategies to evaluate the methods’ performance based on the holo or apo structures of these complexes. Supplementary Tables S11–S16 list the performance of RNA-small molecule binding site detection methods.
Protein–ion binding site prediction
Ions are crucial for various physiological processes, such as enzymatic function, signal transduction, and muscle contraction, through their interactions with proteins (Al-Fartusie and Mohssan, Reference Al-Fartusie and Mohssan2017). Ions can bind to protein-active sites (Andreini et al., Reference Andreini2008), stabilize or trigger conformational changes in protein structures (Dudev and Lim, Reference Dudev and Lim2014; Jernigan et al., Reference Jernigan, Raghunathan and Bahar1994), regulate the activity of DNA/RNA polymerases (De Baaij et al., Reference De Baaij, Hoenderop and Bindels2015), or affect the concentration-dependent aggregation rate of proteins (Poulson et al., Reference Poulson2020). In addition, ions can act as allosteric modulators. For instance, sodium ions modulate G protein-coupled receptors (Katritch et al., Reference Katritch2014), while in calcium-sensing receptors (CaSR),
$ {\mathrm{Ca}}^{2+} $
, and
$ {\mathrm{Mg}}^{2+} $
serve as activators,
$ {\mathrm{Cl}}^{-} $
acts as a positive allosteric modulator, and
$ {\mathrm{SO}}_4^{2-} $
/
$ {\mathrm{PO}}_4^{3-} $
act as negative modulators (Liu et al., Reference Liu2020a). Chloride ions (
$ {\mathrm{Cl}}^{-} $
) also modulate mGluRs (metabotropic glutamate receptors) (Tora et al., Reference Tora2015), and calcium ions (
$ {\mathrm{Ca}}^{2+} $
) influence nAChRs (nicotinic acetylcholine receptors) (Changeux, Reference Changeux2018). Therefore, understanding protein–ion interactions, particularly ion binding sites, is critical to deciphering protein function. Ion binding sites differ from protein–ligand and protein–peptide binding sites in several ways. First, the size of ion binding sites is generally smaller, as small molecules or peptides typically interact with more residues on the protein surface. Furthermore, ion-binding sites are often more adaptable than those of ligands (Chakrabarti, Reference Chakrabarti1993). Another distinction is that many ions require specific coordination geometries with protein atoms. For example,
$ {\mathrm{Zn}}^{2+} $
binding sites are typically formed by residues such as Cys, His, Asp, or Glu, and are coordinated by four or five atoms, adopting a distorted-tetrahedral or trigonal-bipyramidal geometry (Auld, Reference Auld and Maret2001). Various computational approaches have been proposed to identify ion binding sites, as summarized in Table 5.
Table 5. List of methods for prediction of protein–ion binding sites

Sequence-based
Similarly to sequence-based methods that predict protein–small molecule binding sites, almost all of the sequence-based methods for the identification of binding sites for ions utilize a machine learning-based approach (Chen et al., Reference Chen2013; Shu et al., Reference Shu, Zhou and Hovmöller2008; Lippi et al., Reference Lippi2008; Passerini et al., Reference Passerini, Lippi and Frasconi2011; Ferrè and Clote, Reference Ferrè and Clote2006; Passerini et al., Reference Passerini2007; Haberal and Oğul, Reference Haberal and Oğul2017, Reference Haberal and Oğul2019; Qiao and Xie, Reference Qiao and Xie2019; Yu et al., Reference Yu2013, Reference Yu2015; Li et al., Reference Li, Pi and Chen2019a; Li et al., Reference Li2019b; Yan et al., Reference Yan2019; Jiang et al., Reference Jiang2016; Ding et al., Reference Ding, Tang and Guo2017; Srivastava and Kumar, Reference Srivastava and Kumar2018; Zhao et al., Reference Zhao, Xu and Zhao2019; Essien et al., Reference Essien, Wang and Xu2019; Sun et al., Reference Sun2022). First, they move a sliding window along the input sequence and calculate sequential features for each position. Features can include: evolutionary information such as position-specific scoring matrix (PSSM) or conservation score, predicted secondary structure, and predicted solvent accessibility of residues. Then, these features are fed into an SVM, RF, AdaBoost, or a simple NN. ZincExplorer (Chen et al., Reference Chen2013) combines a machine learning approach with a templates-based search of known binders to identify Zn-binding sites. IBayes_Zinc (Li et al., Reference Li, Pi and Chen2019a) uses previously described sequence descriptors and predictions from other methods (ZincExplorer (Chen et al., Reference Chen2013), ZincFinder (Passerini et al., Reference Passerini2007), and ZincPred (Shu et al., Reference Shu, Zhou and Hovmöller2008)) as input into a Bayesian algorithm to predict Zn sites. MetalPredator (Valasatava et al., Reference Valasatava2016) searches through a database of Pfam domains for Fe-S clustering binding and metal binding fragments from MetalPDB (Andreini et al., Reference Andreini2012). ZINCCLUSTER (Ajitha et al., Reference Ajitha2018) first creates a database of all monopeptides, dipeptides, and tripeptides and assigns a Z-score for each of them to be Zn-binding based on a dataset. Then, it screens an input sequence with pentapeptides and retrieves a Z-score from the database for two central dipeptides and three tripeptides. The method considers this fragment to be Zn-binding if the average Z-score of dipeptides and tripeptides is higher than zero. With advancements in deep learning, transformer-based models have been developed for ion binding site prediction. IonPred (Essien et al., Reference Essien2023) employs a transformer architecture to predict ion binding sites directly from protein sequences. M-Ionic (Shenoy et al., Reference Shenoy2024) leverages residue embeddings generated by the pre-trained protein language model ESM-2 (Lin et al., Reference Lin2023) to identify binding sites for various ions. Similarly, LMetalSite (Yuan et al., Reference Yuan2022b) utilizes residue embeddings from ProtTrans (Elnaggar et al., Reference Elnaggar2021) for the prediction of binding sites specific to
$ {\mathrm{Zn}}^{2+} $
,
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
, and
$ {\mathrm{Mn}}^{2+} $
.
Template-based
Many methods aim to find fragments of an input structure that are present in the template database of known ion-binding sites. MIB (Lin et al., Reference Lin2016) and (Lu et al., Reference Lu2012) use a fragment transformation method to search for parts of an input protein that are present in a database of binding residue templates for multiple ion types. For this, they split residues in the input structure and the template into residue triplets, measured triplet pair similarity and performed clustering of triplets similar to binding ones to get the final predictions. FindSite-metal (Brylinski and Skolnick, Reference Brylinski and Skolnick2011) utilizes TM-align (Zhang and Skolnick, Reference Zhang and Skolnick2005; Pandit and Skolnick, Reference Pandit and Skolnick2008) to align template fragments onto the input structure, clusters the obtained alignments, and outputs residue binding scores as the fraction of templates including corresponding positions. TEMSP (Zhao et al., Reference Zhao2011) creates a database of Zn-binding templates from all pairs of residues interacting with this ion. Then, for an input protein, it screens all residue pairs and detects the ones present in the template library. After that, matched pairs are combined into ‘pairs-of-pairs’, which are further filtered using predefined geometrical thresholds to get the final predictions. In Garg and Pal (Reference Garg and Pal2021), the authors used a geometric hashing technique to match query structures with templates of binding sites for different ion types. In Schymkowitz et al. (Reference Schymkowitz2005b), the method creates a database of canonical positions of water molecules or ions with respect to protein atom triads. Then, it screens the surface of the protein with these triads, clusters favorable points, and performs optimization of positions using the empirical force field. GASS-Metal (Paiva et al., Reference Paiva2022) uses a genetic algorithm for the effective search of structural patterns similar to ion binding sites from a curated database of templates.
Machine learning-based
Apart from the sequence-based machine learning approaches, most of the structure-based machine learning methods (Sodhi et al., Reference Sodhi2004; Bordner, Reference Bordner2008; Zheng et al., Reference Zheng2012; Ireland and Martin, Reference Ireland and Martin2021; Song and Jiang, Reference Song and Jiang2023) calculate sequential and structural features for each residue and feed them into an SVM, RF, or neural network classifier. For example, FEATURE (Ebert and Altman, Reference Ebert and Altman2008) constructs concentric radial shells for the atomic environments, calculates physicochemical features inside each of them, and uses Bayesian learning to differentiate whether an environment corresponds to a Zn-binding site or not. IonCom (Hu et al., Reference Hu2016) combines predictions from the sequence-based approach IonSeq and other tools for binding site prediction: COFACTOR (Roy et al., Reference Roy, Yang and Zhang2012), TM-SITE, S-SITE, and COACH (Yang et al., Reference Yang, Roy and Zhang2013), and trains a classifier on top of them. PinMyMetal (Zheng et al., Reference Zheng2024) uses geometrical features, including residue properties, interatomic distances, bond angles, and atomic types as input into the ensemble ML model predicting
$ {\mathrm{Zn}}^{2+} $
binding sites.
Deep learning-based
GraphBind (Xia et al., Reference Xia2021) is a GNN-based model that predicts binding sites for
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mn}}^{2+} $
, and
$ {\mathrm{Mg}}^{2+} $
. DeepProSite (Fang et al., Reference Fang2023), as mentioned in Section Deep learning-based, uses ESMFold (Rives et al., Reference Rives2021), ProtTrans (Elnaggar et al., Reference Elnaggar2021), and a GNN for the prediction of different types of binding sites, including those for
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mn}}^{2+} $
, and
$ {\mathrm{Mg}}^{2+} $
. In Gamouh et al. (Reference Gamouh, Hoksza and Novotny2023), the authors used embeddings from ProtTrans (Elnaggar et al., Reference Elnaggar2021) as features for the graph nodes and used GNN to predict binding sites for nucleotides and ions:
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ {\mathrm{Mn}}^{2+} $
,
$ {\mathrm{Fe}}^{3+} $
, and
$ {\mathrm{Zn}}^{2+} $
. DELIA (Xia et al., Reference Xia, Pan and Shen2020) first constructs a feature vector as the combination of outputs from two other models: (i) BiLSTM which takes as the input the sequence-based features; and (ii) 2D CNN ResNet model, which takes as the input the distance matrix. The constructed feature vectors are used as the input to the next fully connected layer, which outputs the probability of each residue being binding or non-binding. Metal3D (Dürr et al., Reference Dürr, Levy and Rothlisberger2023) employs a 3D CNN to predict the probability density of
$ {\mathrm{Zn}}^{2+} $
binding across the protein structure. MoM (Laveglia et al., Reference Laveglia2023) utilizes a GNN to classify local protein environments composed of Cys, His, Asp, and Glu residues, determining whether these environments are likely to bind
$ {\mathrm{Zn}}^{2+} $
. BindWeb (Xia et al., Reference Xia2022) is a consensus method combining predictions from GraphBind (Xia et al., Reference Xia2021) and DELIA (Xia et al., Reference Xia, Pan and Shen2020) models.
Other
There are several geometric methods that search positions with surrounding atoms whose geometry resembles the ion coordination shell. GRE4Zn (Liu et al., Reference Liu2014) utilizes the fact that most known Zn-binding sites comprise sets of four or three residues with distinctly specific geometries. GaudiMM Metals (Sciortino et al., Reference Sciortino2019) retrieves information about acceptable coordination shell geometries for a set of ions and implements them as an additional objective for optimization with ion presence in the GaudiMM platform (Rodrı́guez-Guerra Pedregal et al., Reference Rodrı́guez-Guerra Pedregal2017). BioMetAll (Sánchez-Aparicio et al., Reference Sánchez-Aparicio2020) constructs a grid of metal probes around a protein and checks each grid position to see if the amino acid environment matches geometric constraints determined from statistics in a dataset of protein structures. The method obtains final predictions from the clustering of relevant points. Also, there are methods that utilize an energetic approach. For example, BION (Shashikala et al., Reference Shashikala2021) calculates electrostatic potential maps with a gaussian-smooth dielectric function term to predict the positions of non-specifically surface-bound ions.
It is important to note that many of the ion binding site identification methods consider only Cys, His, Glu, and Asp residues (Chen et al., Reference Chen2013; Shu et al., Reference Shu, Zhou and Hovmöller2008; Passerini et al., Reference Passerini2007), as these four amino acids are involved in the coordination shell of a bound ion in many cases. Moreover, MetalDetector (Lippi et al., Reference Lippi2008; Passerini et al., Reference Passerini, Lippi and Frasconi2011) and DeepMBS (Haberal and Oğul, Reference Haberal and Oğul2017, Reference Haberal and Oğul2019) operate only with His and Cys, and DiANNA (Ferrè and Clote, Reference Ferrè and Clote2006) work solely with Cys residues. On the other hand, many methods have been developed to predict binding regions for specific ions. For example, multiple approaches aim to predict Zn-binding regions (Chen et al., Reference Chen2013; Shu et al., Reference Shu, Zhou and Hovmöller2008; Passerini et al., Reference Passerini2007; Haberal and Oğul, Reference Haberal and Oğul2017, Reference Haberal and Oğul2019; Li et al., Reference Li, Pi and Chen2019a; Li et al., Reference Li2019b; Yan et al., Reference Yan2019; Ajitha et al., Reference Ajitha2018).
It is worth noting that for some ions (e.g.,
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ {\mathrm{Na}}^{+} $
,
$ \mathrm{and}\hskip0.5em {\mathrm{K}}^{+} $
), the performance metrics are much lower compared to others, as can be seen in Supplementary Table S19. As pointed out in Lu et al. (Reference Lu2012) and Qiao and Xie (Reference Qiao and Xie2019), this can be caused by the higher variability of these binding sites in terms of amino acid composition and structure. Indeed, in Qiao and Xie (Reference Qiao and Xie2019), the authors calculate the frequency difference index, defined as the average difference in the ratio of binding and non-binding residues of each type among the 20 amino acid types, and observed that the index values are much lower for
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ {\mathrm{Na}}^{+} $
, and
$ {\mathrm{K}}^{+} $
compared to other ions.
Benchmarks
We observed that different methods use various benchmarks for evaluation; here, we list benchmarks that were used by several methods. The Passerini dataset (Passerini et al., Reference Passerini2006) is a dataset containing 2,727 sequences with 687 protein chains bound to a metal atom. There are four methods that used it as a training or validation set for the prediction of Zn-binding sites (ZincFinder (Passerini et al., Reference Passerini2007), ZincPred (Shu et al., Reference Shu, Zhou and Hovmöller2008), ZincExplorer (Chen et al., Reference Chen2013), DeepMBS (Haberal and Oğul, Reference Haberal and Oğul2017)). However, note that these methods calculated different metrics on different sets of residues (e.g., Cys and His or Cys, His, Glu, and Asp). The Zhao dataset (Zhao et al., Reference Zhao2011) is a dataset used for training and validation of a template-based method TEMSP, consisting of
$ \sim $
600 protein targets with bound Zn ions. Although many methods use this dataset as an independent test set, some methods retrieved only a subset from it. SSWPNN (Li et al., Reference Li2019b) provides the most complete comparison of methods on this dataset (see Supplementary Table S21). Furthermore, for the validation of SSWPNN, the authors also collected a second independent test set from PDB consisting of 213 protein chains with 1,017 Zn-binding sites, and compared SSWPNN with five other approaches for the prediction of Zn binding sites on the Zhao and SSWPNN datasets (see Supplementary Tables S21 and S22). ZincBindDB (Ireland and Martin, Reference Ireland and Martin2019) is the largest database of Zn binding sites (about 35,000 binding sites from about 16,000 structures), that was used for training and validation of the ZincBindPredict method (Ireland and Martin, Reference Ireland and Martin2021) (however, it was not used by the other methods). Note that a newer version of the database contains more samples (
$ \sim $
40,000 binding sites from
$ \sim $
16,000 structures); so one may expect improved performance for newer methods. The BION dataset (Petukh et al., Reference Petukh, Kimmet and Alexov2013) contains binding sites for
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Zn}}^{2+} $
,
$ {\mathrm{Cl}}^{-} $
, and
$ {\mathrm{Mg}}^{2+} $
ions from 446 protein structures. In Shashikala et al. (Reference Shashikala2021), the authors used this dataset to compare the performances of BION (Petukh et al., Reference Petukh, Kimmet and Alexov2013) and BION-2 (Shashikala et al., Reference Shashikala2021) methods with forcefield-based tools from VMD (Humphrey et al., Reference Humphrey, Dalke and Schulten1996) and Fold-X (Schymkowitz et al., Reference Schymkowitz2005a). In Hu et al. (Reference Hu2016), the authors created a large dataset of 2,100 protein structures in complex with 3,075 ions (
$ {\mathrm{Zn}}^{2+} $
,
$ {\mathrm{Cu}}^{2+} $
,
$ {\mathrm{Fe}}^{2+} $
,
$ {\mathrm{Fe}}^{3+} $
,
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ {\mathrm{Mn}}^{2+} $
,
$ {\mathrm{Na}}^{+} $
,
$ {\mathrm{K}}^{+} $
,
$ {\mathrm{CO}}_3^{2-} $
,
$ {\mathrm{NO}}_2^{-} $
,
$ {\mathrm{SO}}_4^{2-} $
,
$ \mathrm{and}\hskip0.5em {\mathrm{PO}}_4^{3-} $
) retrieved from the BioLip database (Yang et al., Reference Yang, Roy and Zhang2012). The authors used it for 5-fold cross-validation of IonSeq and IonCom methods, and there are available scores for several methods on this benchmark (see Supplementary Table S18). In MIonSite (Qiao and Xie, Reference Qiao and Xie2019), the authors created a large dataset of 7,676 sequences for training and 274 sequences for an independent test set. These sets include ions of multiple types:
$ {\mathrm{Zn}}^{2+} $
,
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ {\mathrm{Mn}}^{2+} $
,
$ {\mathrm{Fe}}^{3+} $
,
$ {\mathrm{Cu}}^{2+} $
,
$ {\mathrm{Fe}}^{2+} $
,
$ {\mathrm{Co}}^{2+} $
,
$ {\mathrm{Na}}^{+} $
,
$ {\mathrm{K}}^{+} $
,
$ {\mathrm{Cd}}^{2+} $
,
$ \mathrm{and}\hskip0.5em {\mathrm{Ni}}^{2+} $
. MIonSite was compared with other methods on their test set (see Supplementary Table S19). The authors also created a small dataset (BTD) of 10 proteins with metal ion-binding sites and 10 proteins without metal ion-binding sites for additional comparison with other methods (see Supplementary Table S20). TargetS (Yu et al., Reference Yu2013) used the BioLip database (Yang et al., Reference Yang, Roy and Zhang2012) to assemble training and validation sets with metal ion binding sites (
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Zn}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ {\mathrm{Mn}}^{2+} $
,
$ \mathrm{and}\hskip0.5em {\mathrm{Fe}}^{3+} $
) and nucleotides with 3,779 and 642 ion-bound protein sequences, respectively. The authors used an independent test set to compare TargetS with ligand-specific predictors and an alignment-based predictor (see Supplementary Table S23). Garg and Pal (Garg and Pal, Reference Garg and Pal2021) assembled datasets for five metal ions (
$ {\mathrm{Cu}}^{2+} $
,
$ {\mathrm{Fe}}^{3+} $
,
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ \mathrm{and}\hskip0.5em {\mathrm{Zn}}^{2+} $
) and split them into training and testing sets with 1,079 and 268 structures in total, respectively. The authors compared their method with IonCom (Hu et al., Reference Hu2016) and MIB (Lin et al., Reference Lin2016) (see Supplementary Table S24). In Yan et al. (Reference Yan2019), the authors prepared the zn1436 dataset of proteins with bound zinc ions for a comparison of the ZnMachine method with ZincExplorer (Chen et al., Reference Chen2013) (see Supplementary Table S25). In Yuan et al. (Reference Yuan2022b), the authors compiled a test set from the BioLip database (Yang, Reference Yang, Roy and Zhang2012), consisting of 211, 183, 235, and 57 protein chains bound to
$ {\mathrm{Zn}}^{2+} $
,
$ {\mathrm{Ca}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
, and
$ {\mathrm{Mn}}^{2+} $
ions, respectively (see Supplementary Table S27). Similarly, the authors of M-Ionic (Shenoy et al., Reference Shenoy2024) constructed an independent test set from BioLip, but reported performance metrics only for LMetalSite (Yuan et al., Reference Yuan2022b) and M-Ionic (Shenoy et al., Reference Shenoy2024) on this dataset (see Supplementary Table S28). Among the benchmarks used by a single method, one may notice the BioMetAll test set (Sánchez-Aparicio et al., Reference Sánchez-Aparicio2020), which consists of 53 crystallographic structures containing the two-histidine one-carboxylate motif (FTM). This is an interesting benchmark since its structures vary in size, and this motif may bind multiple types of metal ions (
$ {\mathrm{Cd}}^{2+} $
,
$ {\mathrm{Co}}^{2+} $
,
$ {\mathrm{Cu}}^{2+} $
,
$ {\mathrm{Fe}}^{3+} $
,
$ {\mathrm{Hg}}^{2+} $
,
$ {\mathrm{Mg}}^{2+} $
,
$ {\mathrm{Mn}}^{2+} $
,
$ {\mathrm{Ni}}^{2+} $
,
$ {\mathrm{Ru}}^{3+} $
,
$ \mathrm{and}\hskip0.5em {\mathrm{Zn}}^{2+} $
). Note that almost all methods operate with residue-based scores as their performance metric, and only GaudiMM Metals (Sciortino et al., Reference Sciortino2019) and BION (Shashikala et al., Reference Shashikala2021) used distance-based scores, which are likely more representative of the ion binding site prediction problem.
Despite the large variety of methods and benchmarks (see Table 5), one can see from Supplementary Tables S18 and S19 that MIonSite (Qiao and Xie, Reference Qiao and Xie2019) and IonCom (Hu et al., Reference Hu2016) demonstrate better performance for different ions. Interestingly, MIonSite is a sequence-based method, and IonCom is a structure-based method; therefore, it would be interesting to see if a combined approach shows even better results. As for the Zn-specific predictors, Supplementary Table S22 shows that SSWPNN (Li et al., Reference Li2019b) outperforms other methods.
Challenges
The composition of high-quality and diverse labeled datasets and benchmarks is one of the biggest challenges in the binding site detection problem. As one can see from the Benchmarks subsections, typically, there are no unified training-validation sets and test benchmarks to perform a rigorous comparison of the developed methods. Moreover, the existing training and test splits often contain data leakage (let alone structural artifacts), resulting in likely overly optimistic performance metrics, that should not be compared between the different methods. Unfortunately, despite the exponential growth of available experimental structures in PDB, in some cases, the test benchmarks are too small to make statistically solid conclusions. For example, the commonly used RNA-small molecule binding site benchmark, TE18, contains just 18 structures (see Section Nucleic acid–small molecule binding sites Benchmarks). Related to this, another challenge to consider, especially with respect to the ML and DL methods, is overfitting. Overfitting is a common problem in machine learning, where models perform well on training data but fail to generalize to the unseen cases. Deficient training sets or incorrect training-validation-test splitting can also result in models showing artificially high score values (Kapoor and Narayanan, Reference Kapoor and Narayanan2023). Thus, to overcome these challenges, not only high-quality datasets are required but also unified training-validation-test splits. This will also help to make the comparison of different methods more rigorous. However, other hidden biases may remain. For example, using pre-trained language models (LMs) may lead to data leakage, as the LM model itself might have seen data similar to the training set.
One rather technical but important thing that also prevents a rigorous comparison of binding site detection methods is the use of different performance metrics for their evaluation. First of all, many papers present accuracy, precision, recall, or ROC AUC metrics, which may be misleading. Indeed, precision or recall metrics are high when the model outputs either the lowest or highest number of positive predictions, respectively, and accuracy or ROC AUC tends to be 1 when there is a high imbalance in binary labels. MCC is a much more suitable metric for binding site detection; however, it still depends on the choice of the threshold for determining binary labels. On the other hand, AP or AURPC, which is the area under the precision-recall curve, is much more convenient for the classification of binding residues, as it takes into account the ranking of predictions and does not depend on class imbalance. Also, instead of calculating precision on the top-N predictions for each structure (see DCC and DCA metrics in Supplementary Section Metrics), one can use AP for assessing model performance using a distance-based criterion as well (Kozlovskii and Popov, Reference Kozlovskii and Popov2020), following this idea from object detection in computer vision (Everingham et al., Reference Everingham2010). More reliable performance metrics can be roughly divided into three categories: distance-, volume-, and residue-based. Distance-based metrics define a prediction as successful if the distance from the predicted binding site center to the true center or any atom of the binding site or ligand is lower than a threshold value, which is usually set to 4Å. Volume-based metrics calculate the overlap between the predicted and true binding site cavities. Finally, residue-based metrics rely on binary classification metrics calculated from residue scores. There is no one-size-fits-all solution, however. For example, residue-based scores may be misleading for protein–ion binding sites. Indeed, the number of interacting residues is small; thus, the impact of a single residue on the metric is high. Furthermore, the definition of interacting residues varies too, resulting in different metric values for the same predictions but with a slightly different set of true labels. Similarly, a residue-based metric may be misleading for nucleic acid–small molecule binding site prediction, though with the opposite reasoning. In this case, the number of nucleotides in the structure is typically small; thus, the binding site covers a larger portion of nucleotide residues. As a result, residue-based metrics may become insensitive to very different predictions. In contrast, distance- and volume-based metrics have been shown to be informative enough for protein–small molecule binding site predictions. Therefore, distance-based metrics would be more robust for protein–ion and nucleic acid–small molecule binding site detection problems. However, as one can see from Sections Nucleic acid–small molecule binding sites Benchmarks and Protein–ion binding site prediction Benchmarks, most of the methods rely on residue-based metrics. On the other hand, residue-based metrics can be more suitable for protein–peptide binding site detection methods compared to distance- and volume-based metrics, because of the large size of protein–peptide binding site interfaces.
Another challenge is the interpretability of ML and DL-based binding site detection models. While these methods could achieve superior accuracy compared to classical approaches, their predictions often lack clear mechanistic explanations, making it difficult to extract meaningful insights about the underlying molecular interactions (Murdoch et al., Reference Murdoch2019; Vecchietti et al., Reference Vecchietti2024). This issue can be particularly relevant when understanding why a model identifies a particular region as a potential binding site is essential for hypothesis-driven drug design. Unlike physics-based or first-principle methods, which rely on well-understood physical and chemical principles, DL models operate as complex, non-linear transformations of input data, obscuring the contributions of individual molecular features. This ambiguity also hampers debugging models, detecting biases in datasets, and ensuring reliable generalization across diverse molecular structures. Recent advances in explainable AI (XAI) (Jiménez-Luna et al., Reference Jiménez-Luna, Grisoni and Schneider2020; Bhatt et al., Reference Bhatt, Koes and Durrant2024), such as feature attribution techniques (e.g., SHAP (Lundberg and Lee, Reference Lundberg, Lee, Guyon, Von Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017), LRP (Montavon et al., Reference Montavon2019), Grad-CAM (Selvaraju et al., Reference Selvaraju2017)) and attention mechanisms in transformer-based models (Wiegreffe and Pinter, Reference Wiegreffe and Pinter2019), have been proposed to increase interpretability, but their application to binding site prediction remains limited. Incorporating interpretable ML techniques into binding site detection could improve trust in DL-based predictions and enhance their practical usability in drug discovery pipelines.
In drug discovery, one of the challenges is to assess the ‘druggability’ or ‘ligandability’ of the detected binding sites. Currently, there are no strict criteria for the ‘druggable’ binding sites. Similarly to the characterization of drug-like molecules using the rule of 5, Ghose filter, or other heuristics, one can compose such knowledge-based criteria for the binding sites based on their properties. While an exhaustive review of binding site characterization methods is beyond the scope of this article, it is noteworthy that several computational tools have been developed to analyze specific properties of binding sites. These tools assess attributes such as volume, surface area, and flexibility, and often identify sub-pockets within larger pockets (Durrant et al., Reference Durrant2014; Guerra et al., Reference Guerra2021), typically, employing approaches similar to those discussed in Section Protein–small molecule binding sites Geometric. Moreover, certain geometric binding site prediction methods inherently provide volume estimations of binding sites (Kawabata and Go, Reference Kawabata and Go2007; Capra et al., Reference Capra2009; Le Guilloux et al., Reference Le Guilloux, Schmidtke and Tuffery2009; Zhu and Pisabarro, Reference Zhu and Pisabarro2011), and some methods analyze pockets throughout molecular dynamics (MD) trajectories, offering dynamic insights into binding site properties (Craig et al., Reference Craig2011; Schmidtke et al., Reference Schmidtke2011; Paramo et al., Reference Paramo2014; Laurent et al., Reference Laurent2015; Wagner et al., Reference Wagner2017; Chen et al., Reference Chen2019; Lv and Cao, Reference Lv and Cao2024). Other tools, like MOLE (Pravda et al., Reference Pravda2018), CAVER (Stourac et al., Reference Stourac2019), and others (Yaffe et al., Reference Yaffe2008; Lee and Helms, Reference Lee and Helms2012) aim at the characterization of protein tunnels, channels, and pores.
Last but not least challenge is the prospective validation of the developed methods. Given the aforementioned challenges that can lead to over-optimistic performance on the retrospective benchmarks, the real-world application is of crucial importance. However, such case studies are quite rare (Popov et al., Reference Popov2024; Naz et al., Reference Naz2015) and absent for most of the developed methods. In this regard, community-driven challenges, such as CASP (https://predictioncenter.org) and CACHE (https://cache-challenge.org ), may comprise targets with previously unpublished binding sites and, thus, provide an opportunity to demonstrate the predictive power of the developed methods.
Trends and future directions
It is no wonder that machine learning-based approaches are gradually displacing the first-principle methods, and while older research focuses more on searching for the most powerful features, newer research is more focused on exploring various neural network architectures. Moreover, with the advances in large language models, it has become common to utilize embeddings produced by, for example, protein language models, as the feature vectors for the downstream task of the binding site detection. While this idea seems promising, extensive exploration of this research direction is computationally expensive, requiring significant hardware resources and time, which can limit accessibility for some research groups.
When applying or testing binding site detection methods, an important question to address is the flexibility of the target. Naturally, one expects that a method should detect a binding site in the holo conformation of the target. But in practice, one needs to discover novel binding sites given the unbound conformation. At what point in the imaginary trajectory between the unbound and bound conformations should a method detect the binding site? The answer to this question likely depends on the considered set of ligands for a particular target, such that the method should detect a binding site in the structure, if the corresponding conformation is within a certain vicinity of the bound conformation for at least one of the ligands. The vicinity can be simply defined as all conformations within a given RMSD threshold relative to the bound conformations. Constructing such a benchmark of conformational ensembles would be a valuable step forward for the development of robust binding site detection approaches. Currently, one typically addresses the flexibility issue by generating multiple conformations of the target using molecular dynamics or another method and applying binding site detection to them (Kozlovskii and Popov, Reference Kozlovskii and Popov2020; Martinez-Rosell et al., Reference Martinez-Rosell2020; Meller et al., Reference Meller2023; Panei et al., Reference Panei, Gkeka and Bonomi2024). The development of spatiotemporal methods to analyze target binding sites and their dynamics is a valuable direction for future research.
There are other types of binding sites besides those covered in this review. For example, specific models have been developed for protein-nucleotide binding sites (Chauhan et al., Reference Chauhan, Mishra and Raghava2009; Chen et al., Reference Chen, Mizianty and Kurgan2012; Kusuma et al., Reference Kusuma2019), carbohydrate binding sites (Canner et al., Reference Canner, Shanker and Gray2023), vitamin binding sites (Panwar et al., Reference Panwar, Gupta and Raghava2013), catalytic sites (Dou et al., Reference Dou2012), as well as water positions (Zaucha et al., Reference Zaucha2020; Park and Seok, Reference Park and Seok2022). As for RNA targets, there are methods to predict RNA-ion binding sites, including MetalionRNA (Philips et al., Reference Philips2012), MgNet (Zhou and Chen, Reference Zhou and Chen2022), Metal3DRNA (Zhao et al., Reference Zhao2023), as well as machine learning methods to predict RNA-protein binding sites using only sequence data (Choi and Han, Reference Choi and Han2013; Panwar and Raghava, Reference Panwar and Raghava2015; Tuvshinjargal et al., Reference Tuvshinjargal2016; Choi et al., Reference Choi2017; Zhan et al., Reference Zhan2018; Pan et al., Reference Pan2020; Tahir et al., Reference Tahir2021; Zhao et al., Reference Zhao, Zhang and Du2022), sequential and secondary structure (Uhl et al., Reference Uhl2019), or sequential and tertiary data (Luo et al., Reference Luo2017). Some of the sequence-based methods rely on large databases with experimental data on RNA-protein binding, such as RNAcompete (Ray et al., Reference Ray2009), CLIP-seq, and RIP-seq (Ray et al., Reference Ray2013). Notably, there are approaches to predict DNA binding sites (e.g., DeepBind (Alipanahi et al., Reference Alipanahi2015) and DeepSTF (Ding et al., Reference Ding2023)), trained on datasets from protein binding microarrays (PBMs) (Mukherjee et al., Reference Mukherjee2004), ChIP-seq (Kharchenko et al., Reference Kharchenko, Tolstorukov and Park2008), or HT-SELEX (Jolma et al., Reference Jolma2010). We would like to separately note that protein-covalent ligand binding sites constitute a special case of protein–small molecule binding sites. Covalent ligands may be useful as a therapeutic modality in various diseases; hence, the prediction of this type of binding site is relevant in covalent drug discovery (Boike et al., Reference Boike, Henning and Nomura2022). A ligand can form a covalent bond with target residues (commonly Cys, Ser, and Lys, but there are other cases as well) upon binding, which imposes strict constraints on the binding site detection problem. There are several databases of covalent agents, including CovalentInDB (Du et al., Reference Du2021) and CovPDB (Gao et al., Reference Gao2022), and there are methods for predicting the ability of cysteines to form a covalent bond with ligands ( Zhang et al., Reference Zhang2016, Reference Zhang, Pei and Lai2017; Zhao et al., Reference Zhao2017; Du et al., Reference Du2022; Gao and Günther, Reference Gao and Günther2023). Other examples include methods to predict macromolecular binding sites, such as protein-nucleic acids ( Hendrix et al., Reference Hendrix2021; Wei et al., Reference Wei2022; Yuan et al., Reference Yuan2022a; Liu and Tian, Reference Liu and Tian2023; Roche et al., Reference Roche2023; Song et al., Reference Song2023; Zhu and Yu, Reference Zhu and Yu2023) or protein–protein (Fout et al., Reference Fout2017; Gainza et al., Reference Gainza2020; Dai and Bailey-Kellogg, Reference Dai and Bailey-Kellogg2021; Renaud et al., Reference Renaud2021; Sverrisson et al., Reference Sverrisson2021; Tubiana et al., Reference Tubiana, Schneidman-Duhovny and Wolfson2022; Gao et al., Reference Gao2023; Jha et al., Reference Jha, Karmakar and Saha2023; Krapp et al., Reference Krapp2023). Most of these methods are based on approaches similar to the ones described here. Given the large variety of binding site types on one hand and the advances in multi-modal and multi-task machine learning approaches on the other hand, we expect that the next-generation binding site prediction methods will operate across different types of macromolecular structures as well as their binding counterparts. Although there is currently no strong evidence that this will improve model accuracy, one may expect that a single model trained on comprehensive datasets could have stronger generalization ability and robustness. We observed that some methods implicitly explore this idea already; for example, RNet (Liu et al., Reference Liu2024) and
$ {\mathrm{BiteNet}}_{Pp} $
(Kozlovskii and Popov, Reference Kozlovskii and Popov2021a) started with protein–small molecule binding site detection models and fine-tuned them for RNA-small molecule and protein–peptide binding site models, respectively.
Finally, the discovery of novel binding sites may come from an orthogonal direction. For example, global molecular docking approaches search for the optimal configuration of two binding partners without prior knowledge of the corresponding binding site. Molecular docking methods have been developed for different types of macromolecules and ligands, including those described here in the context of the binding site detection problem. Needless to say, the molecular docking field and virtual ligand screening, in general, are also affected by the machine-learning era (Fadahunsi et al., Reference Fadahunsi2024). Moreover, breakthroughs in protein structure prediction by DeepMind (https://deepmind.google/technologies/alphafold/) have also opened new opportunities to solve binding site detection problems. Particularly, one promising direction is the development of co-folding approaches that aim to predict the 3D structure of the molecular complex starting from a 1D (sequence) or 2D (graph) representation of its subunits. The most recent examples of such approaches include AlphaFold3 (Abramson et al., Reference Abramson2024), RoseTTaFold All-Atom (Krishna et al., Reference Krishna2024), or NeuralPLexer (Qiao et al., Reference Qiao2024). Although their predictive power has yet to be assessed on independent test benchmarks to date, one may expect the rise of end-to-end approaches to solving structure prediction, binding site detection, and molecular docking problems simultaneously in the future.
List of abbreviations
- ML
-
machine learning
- DL
-
deep learning
- RF
-
random forest
- SVM
-
support vector machine
- MSA
-
multiple sequence alignment
- NN
-
neural network
- CNN
-
convolutional neural network
- GNN
-
graph neural network
- GCN
-
graph convolutional network
- MPNN
-
message passing neural network
- GRU
-
gated recurrent unit
- LSTM
-
long short-term memory
- RMSD
-
root-mean-square deviation
- MD
-
molecular dynamics
- SS
-
secondary structure
- ASA
-
accessible surface area
- SASA
-
solvent accessible surface area
- RSASA
-
relative solvent accessible surface area
- PSSM
-
position specific scoring matrix
- AF2
-
AlphaFold2
- ATP
-
adenosine triphosphate
- RNA
-
ribonucleic acid
- DNA
-
deoxyribonucleic acid
- PDB
-
Protein Data Bank
- NMR
-
nuclear magnetic resonance
- pLM
-
protein language model
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S003358352500006X.
Acknowledgements
This study was done within the PROXIDRUGS consortium. PROXIDRUGS is as part of the initiative ‘Clusters4Future’ funded by the Federal Ministry of Education and Research BMBF (03ZU2109JE; 03ZU2109ID).
Competing interests
The authors declare no competing interests.