Our Datasets and Results From Our Study: GGL-PPI Models

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Md Masud Rana, Department of Mathematics, University of Kentucky;

(2) Duc Duy Nguyen, Department of Mathematics, University of Kentucky & [email protected].

Table of Links

Abstract & Introduction

Datasets and Results

Methods

Conclusion, Data and Software Availability, Competing interests, Acknowledgments & References

2 Datasets and Results

In this section, we perform validation and evaluation of our proposed models on several benchmark datasets. We develop two types of GGL-PPI models: GGL-PPI1 and GGL-PPI2. The first model, GGL-PPI1, is built solely on geometric graph features discussed in Section 3.

On the other hand, GGL-PPI2 incorporates both geometric graph features and auxiliary features, as detailed by Wang et al. [41]. The electrostatic potential calculations for the auxiliary components are conducted using the MIBPB software [42].

2.1 Validation

To validate our models, we primarily consider the AB-Bind dataset [25], SKEMPI 1.0 dataset [23], and SKEMPI 2.0 dataset [24]. We employ a rigorous evaluation methodology by conducting a 10-times 10-fold cross-validation (CV) on each datasets. The mean Pearson correlation coefficient (Rp) and root-mean-square error (RMSE) serve as our evaluation metrics.

In comparing the CV performance of our proposed models with other existing methods, we specifically assess TopNetTree [41], Hom-ML-V2 [43], and Hom-ML-V1 [43]. Both TopNetTree and Hom-ML-V2 incorporate auxiliary features in conjunction with their topology-based and Hom-complex-based features, respectively. On the other hand, Hom-ML-V1 solely relies on Hom-complex-based features without utilizing any auxiliary features.

Validation on AB-Bind S645 Data Set The AB-Bind dataset contains 1,101 mutational data points for 32 antibody-antigen complexes, providing experimentally determined binding affinity changes upon mutations. Pires et al. curated a subset known as AB-Bind S645 [44], consisting of 645 single-point mutations observed in 29 antibody-antigen complexes. The dataset comprises a mix of stabilizing (20%) and destabilizing (80%) mutations.

Additionally, the dataset includes 27 non-binders that do not show any binding within the assay’s sensitivity range. For these non-binders, the binding free energy changes have been uniformly set to a value of 8 kcal/mol. It is crucial to consider these non-binders as outliers during model development and evaluation to ensure model accuracy and robustness.

Our GGL-PPI2 achieved an Rp of 0.58 on the AB-Bind S645 dataset, as shown in Figure 2a. The comparison results in Table 1 indicate that our model tied for second place with Hom-ML-V2 43, while TopNetTree [41] claimed the top position.

However, when we exclude the 27 nonbinders from the dataset, our model outperforms all other existing models. Specifically, the Rp value increases to 0.74 from 0.58 after removing the nonbinders (Figure 2b).

Furthermore, GGL-PI1, our purely geometric graph-based features model, demonstrated competitive performance with an Rp of 0.57 on the AB-Bind S645 dataset. Intriguingly, when excluding the nonbinders, GGL-PPI1 surpassed all other models with an improved Rp of 0.73.

These performances reveal that our multiscale weighted colored geometric graphs can effectively characterize the wide range of interactions in biomolecular complexes.

Validation on SKEMPI 1.0 S1131 Data Set The SKEMPI 1.0 dataset consists of a collection of 3,047 mutations of 158 complexes obtained from literature sources, where the complexes have experimentally determined structures [23]. The dataset includes both single-point mutations and multi-point mutations.

Specifically, there are 2,317 entries in the dataset that represent single-point mutations, which are collectively known as the SKEMPI S2317 set. Additionally, a subset of 1,131 non-redundant interface single-point mutations has been selected from the SKEMPI S2317 set and labeled as the SKEMPI S1131 set [45]. This subset focuses on studying the impact of single-point mutations on protein-protein interactions.

Table 1: Performance comparison of different methods in terms of Pearson correlation coefficients (Rp) for the AB-Bind (S645) dataset.

Figure 2c shows that our model GGL-PPI2 achieves an Rp of 0.873 and an RMSE of 1.21 kcal/mol in 10-fold CV on the S1131 dataset. Table 2 presents the performance comparison of various methods on the S1131 dataset, including our proposed models, GGL-PPI1 and GGL-PPI2.

Among them, our model, GGL-PPI2, achieved the highest performance, underscoring its superiority in predicting binding affinity changes due to mutation.

Notably, even without auxiliary features, our GGL-PPI1 outperformed both TopNetTree and Hom-ML-V2 methods that do leverage auxiliary features. This again highlights the efficacy of our geometric graph-based molecular representation.

Table 2: Performance comparison of different methods in terms of Pearson correlation coefficients (Rp) for the single-point mutations in the SKEMPI 1.0 (S1131) dataset.

Validation on SKEMPI 2.0 S4169 and S8338 Data Sets The SKEMPI 2.0 dataset is an updated and expanded version of the original SKEMPI dataset, incorporating new mutations collected from various sources [24].

Released in 2018, it significantly increased in size, now containing a total of 7,085 entries, including both single-point and multi-point mutations. The data was obtained by merging several databases, including SKEMPI 1.0 [23], AB-Bind [25], PROXiMATE27, and dbMPIKT46.

Additionally, new data from the literature were manually curated and added to the dataset. The mutations cover a wide range of protein complexes, such as protease-inhibitor, antibody-antigen, and TRCpMHC complexes. Among the mutations, approximately 3,000 are single-point alanine mutations, 2,000 are single-point non-alanine mutations, and another 2,000 involve multiple mutations.

Notably, the authors of the mCSM-PPI2 [8] method filtered the single-point mutations, yielding S4169 set, comprising 4,169 variants in 139 different complexes/ The S8338 set, derived from S4169, represents hypothetical reverse mutation energy changes with negative values. This comprehensive dataset serves as a valuable resource for studying protein interactions and their thermodynamic properties.

Perforamnce-wise, Our GGL-PPI2 model posts an Rp of 0.81 with an RMSE of 1.03 kcal/mol for the S4169 dataset as shown in Figure 2d, outstripping all existing models (Table 3). It is noteworthy that our GGL-PPI1 model, which solely relies on geometric graph-based features, demonstrated comparable performance to GGL-PPI2, outperforming TopNetTree and mCSM-PPI2 with an Rp of 0.80 and an RMSE of 1.06 kcal/mol.

In the case of the S8338 dataset, we applied a stratified cross-validation approach similar to mCSM-PPI2. We ensured that hypothetical reverse mutations were consistently placed either in the training or test sets during the dataset splits, maintaining their relationship to the corresponding original mutations intact throughout the cross-validation process.

GGL-PPI2 achieved an Rp of 0.85 with an RMSE of 1.07 kcal/mol as depicted in Figure 2e, and GGL-PPI1 closely followed, attaining an Rp of 0.84 with the same RMSE value. As Table 3 attests, our GGL-PPI2 is on par with TopNetTree and outperforms mCSM-PPI2 on the S8338 dataset.

Table 3: Performance comparison of different methods in terms of Pearson correlation coefficients (Rp) for the single-point mutations in the SKEMPI 2.0 (S4169 and S8338) dataset.

2.2 Evaluation

To evaluate our proposed model for predicting binding free energy (BFE) changes of protein-protein interactions, we consider two datasets sourced from the ProTherm database [22].

The first dataset, carefully selected by Pucci et al. [36], named S[sym] dataset. This data assembles 684 mutations from the ProTherm, comprising 342 direct mutations and their corresponding reverse mutations, resulting in a balanced dataset.

The dataset specifically focuses on mutations in fifteen protein chains with solved 3D structures, ensuring highresolution data with a resolution of at least 2.5˚A.

By providing experimentally measured ∆∆G values and a balanced representation of stabilizing and destabilizing mutations, the S[sym] dataset serves as a valuable resource for evaluating prediction biases in the context of predicting mutation-induced binding affinity changes.

To address the issue of data leakage and enhance the generalization capability of our method, we employed the Q1744 dataset [47]. Quan et al. [48] compiled the Q3421 dataset from ProTherm, consisting of 3421 single-point mutations across 150 proteins with available PDB structures. However, the presence of homologous proteins in both the training and test set can lead to interdependent effects of mutations, compromising the model’s performance.

To mitigate this, Li et al. [47] created the Q1744 dataset, derived by excluding overlapping data points and refining protein-level homology between Q3421 and S[sym] datasets, resulting in 1744 distinct mutations.

Furthermore, the Q3488 dataset was created by augmenting reverse mutations in the Q1744 set. We utilized the Q3488 dataset as our training set, thereby enhancing our ∆∆G predictor’s capability to accurately predict BFE changes in PPIs.

We conduct an evaluation of our model on the blind test set S[sym], with a distinct focus on both direct and reverse mutations. To assess the performance, we utilize the Pearson correlation coefficient and root-mean-square error as our primary metrics. Additionally, to discern any prediction bias, we incorporated two statistical measures: Rpdir−rev and δ.

The former calculates the Pearson correlation between predictions for direct and reverse mutations, while the latter represents the sum of predicted ∆∆G values for both types of mutations. The hypothesis is that an unbiased predictor would yield Rpdir−rev = −1 and an average δ ( ¯δ) of 0 kcal/mol.

Our main focus is to highlight the effectiveness of our model, GGL-PPI2, particularly emphasizing its robust geometric graph-based molecular featurization. GGL-PPI2 has demonstrated exceptional prediction accuracy, maintaining consistency for both direct and reverse mutations. As depicted in Figure 3a and 3b, our model achieves consistent Rp values of 0.57 and an RMSE of 1.28 kcal/mol, indicating its efficiency against overfitting to direct mutations.

Additionally, the analysis reveals that a significant proportion of mutations fall within a prediction error of 0.5 kcal/mol and 1.0 kcal/mol, with 34.6% and 65.8% for direct mutations and 35.1% and 66.0% for reverse mutations, as depicted in Figure 3d and 3e.

Furthermore, Figure 3c demonstrates that GGL-PPI2 effectively addresses prediction bias by achieving a nearly perfect Rpdir−rev value of -0.999 and an extremely low average ¯δ of 0.006 kcal/mol. Finally, the distribution plot in Figure 3f illustrates that 99.4% of mutations exhibit a prediction bias under 0.05 kcal/mol.

In Table 4, we present the prediction results of our models and conduct a comprehensive comparison with other ∆∆G predictors. We observe that our GGL-PPI2 model outperforms ThermoNet [47], which was also trained on the homologyreduced set Q3488, across all evaluation measures. It outperforms ThermoNet by 21.3% for direct mutations and 18.7% for reverse mutations.

Furthermore, the GGL-PPI1 model, which only uses geometric graph-based features, also performs better than ThermoNet in both direct and reverse prediction tasks. This further emphasizes the effectiveness of our geometric-graph approach.

For a broader comparison against other ∆∆G predictors, we introduce the GGL-PPI2∗ model, trained on the Q6428 set constructed before the homology reduction of the set Q3421 [47]. As illustrated in Table 4, GGL-PPI2∗ excels over other methods in reverse mutation predictions.

It is noteworthy that while some methods surpass GGL-PPI2∗ for direct mutations, they frequently exhibit significant bias towards reverse mutations.

This paper is available on Arxiv under CC 4.0 license.