This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Md Masud Rana, Department of Mathematics, University of Kentucky;
(2) Duc Duy Nguyen, Department of Mathematics, University of Kentucky & [email protected].
Conclusion, Data and Software Availability, Competing interests, Acknowledgments & References
Protein-protein interactions (PPIs) are critical for various biological processes, and understanding their dynamics is essential for decoding molecular mechanisms and advancing fields such as cancer research and drug discovery.
Mutations in PPIs can disrupt protein binding affinity and lead to functional changes and disease. Predicting the impact of mutations on binding affinity is valuable but experimentally challenging. Computational methods, including physics-based and machine learning-based approaches, have been developed to address this challenge.
Machine learning-based methods, fueled by extensive PPI datasets such as Ab-Bind, PINT, SKEMPI, and others, have shown promise in predicting binding affinity changes.
However, accurate predictions and generalization of these models across different datasets remain challenging. Geometric graph learning has emerged as a powerful approach, combining graph theory and machine learning, to capture structural features of biomolecules.
We present GGL-PPI, a novel method that integrates geometric graph learning and machine learning to predict mutation-induced binding free energy changes. GGL-PPI leverages atom-level graph coloring and multi-scale weighted colored geometric subgraphs to extract informative features, demonstrating superior performance on three validation datasets, namely AB-Bind, SKEMPI 1.0, and SKEMPI 2.0 datasets. Evaluation on a blind test set highlights the unbiased predictions of GGLPPI for both direct and reverse mutations.
The findings underscore the potential of GGL-PPI in accurately predicting binding free energy changes, contributing to our understanding of PPIs and aiding drug design efforts.
Keywords— geometric graph, machine learning, protein-protein interactions, mutation, binding free energy changes
Protein-protein interactions (PPIs) play a fundamental role in numerous biological processes, including cell signaling, metabolic pathways, and immune responses [1;2;3]. Understanding PPIs and their dynamics is crucial for unraveling the intricate mechanisms underlying these processes and holds significant implications for various fields, such as cancer research, drug discovery, and personalized medicine [3;4].
The effects of mutations on PPIs have drawn substantial attention due to their potential impact on protein function and cellular behavior [5;6;7;8]. Missense mutations, which involve single amino acid substitutions, can disrupt the binding affinity between proteins and their partners [9;10]. Such alterations can lead to malfunctioning PPI networks, resulting in diseases, drug resistance, or other molecular disorders [11;12;13;14;15;16].
Therefore, accurate prediction of the impact of mutations on binding affinity holds significant importance in understanding disease mechanisms, facilitating therapeutic interventions, and enabling the design of innovative biopharmaceuticals.
One of the key parameters used to assess the impact of mutations on PPIs is the binding free energy change (∆∆G). This thermodynamic parameter quantifies the difference in binding affinity between the wild-type and mutant protein complexes. Experimental determination of ∆∆G values, while accurate, can be tedious and costly.
Consequently, there has been a surge in the development of computational methods to predict these energy changes. Broadly, these computation approaches fall into two main categories: physics-based and machine learning-based methods.
The former, rooted in biophysical principles, delves into protein conformations and offers a rigorous approach [17;18;19]. However, they often demand significant computational resources and are not always scalable.
On the other hand, machine learning-based methods have gained popularity due to their scalability and rapid prediction capabilities. Leveraging the wealth of data from PPI datasets such as ASEdb [20], PINT[21], ProTherm[22], SKEMPI [23;24], and others [25;26;27], machine learning models like mCSM28, BindProf 6 , iSEE29, MutaBind [7] , and several others [30;31;32;33] have been developed. These models have shown significant potential in predicting ∆∆Gs.
However, challenges such as imbalanced training datasets, generalization across different PPI datasets, and the intricacy of capturing complex sequence-structure-function relationships remain obstacles [34;35;36;37]. This underscores the need for further research to enhance machine learning methodologies, ensuring accurate and efficient ∆∆G predictions.
In recent years, geometric graph learning has emerged as a promising approach for analyzing complex biomolecular systems [38;39;40]. By representing proteins and their interactions as graphs, this methodology leverages the power of graph theory and machine learning to capture essential structural and spatial features of the biomolecular complexes.
Specifically, the use of geometric subgraphs, which encode local interactions between atoms and residues, offers a rich representation. This not only sheds light on intricate molecular details but also provides insights into their impact on binding affinity [39].
This work presents a novel method, called GGL-PPI (Geometric Graph Learning for Protein-Protein Interactions), which combines the principles of geometric graph learning and machine learning to predict mutation-induced binding free energy changes.
The workflow of GGL-PPI is depicted in Figure 1. Central to its methodology, GGL-PPI utilizes atom-level graph coloring and multi-scale weighted colored geometric subgraphs, enabling the extraction of informative features from protein structures and their interactions.
These features serve as inputs to a gradient-boosting tree model, which facilitates precise and consistent predictions of binding free energy change upon mutations. When compared with existing models, GGL-PPI consistently outperforms state-of-the-art approaches across all datasets. Further addressing its generalizability, GGL-PPI was evaluated on a blind test set, Ssym dataset [36].
This evaluation was conducted using a homology-reduced balanced training set to avert data leakage, showcasing GGL-PPI’s robust performance and ability to produce unbiased predictions for both direct and reverse mutations.
This paper is available on Arxiv under CC 4.0 license.