Table Of Links Table Of Links ABSTRACT ABSTRACT I. INTRODUCTION I. INTRODUCTION I. INTRODUCTION II. BACKGROUND II. BACKGROUND II. BACKGROUND III. DESIGN III. DESIGN III. DESIGN DEFINITIONS DESIGN GOALS FRAMEWORK EXTENSIONS DEFINITIONS DESIGN GOALS FRAMEWORK EXTENSIONS IV. MODELING IV. MODELING IV. MODELING CLASSIFIERS FEATURES CLASSIFIERS FEATURES V. DATA COLLECTION V. DATA COLLECTION V. DATA COLLECTION VI. CHARACTERIZATION VI. CHARACTERIZATION VI. CHARACTERIZATION VULNERABILITY FIXING LATENCY ANALYSIS OF VULNERABILITY FIXING CHANGES ANALYSIS OF VULNERABILITY-INDUCING CHANGES VULNERABILITY FIXING LATENCY ANALYSIS OF VULNERABILITY FIXING CHANGES ANALYSIS OF VULNERABILITY-INDUCING CHANGES VII. RESULT VII. RESULT VII. RESULT N-FOLD VALIDATION EVALUATION USING ONLINE DEPLOYMENT MODE N-FOLD VALIDATION EVALUATION USING ONLINE DEPLOYMENT MODE VIII. DISCUSSION VIII. DISCUSSION VIII. DISCUSSION IMPLICATIONS ON MULTI-PROJECTS IMPLICATIONS ON ANDROID SECURITY WORKS THREATS TO VALIDITY ALTERNATIVE APPROACHES IMPLICATIONS ON MULTI-PROJECTS IMPLICATIONS ON ANDROID SECURITY WORKS THREATS TO VALIDITY ALTERNATIVE APPROACHES IX. RELATED WORK IX. RELATED WORK IX. RELATED WORK CONCLUSION AND REFERENCES CONCLUSION AND REFERENCES CONCLUSION AND REFERENCES VII. RESULT VII. RESULT The section conducts a comprehensive evaluation of the accuracy of our framework across both the training and inference phases, reflecting real-world performance. A. N-FOLD VALIDATION A. N-FOLD VALIDATION We first identify the optimal classifier type, followed by the feature dataset reduction. Classifier Selection. To select the most accurate classifier type, all six types of classifiers are evaluated using the complete set of devised feature data. The training dataset incorporates information about all known ViCs. The evaluation employs the Weka v1.8 [38] toolkit with the default parameter configurations for each classifier, ensuring a fair comparison of their inherent performance. Classifier Selection. Table IV shows the 12-fold validation result. The Random Forest classifier demonstrates the highest classification accuracy among the six types tested. It achieves ~60% recall for ViCs with 85% precision, while misclassifying only 3.9% of LNCs (calculated as 1–0.992×0.969). Based on the evaluation result, the rest of this study uses Random Forrest. The superior performance of Random Forrest over the Decision Tree classifier is expected, as shown by the relative operating characteristic curve (ROC) area of 0.955 vs. 0.786. The Quinlan C4.5 classifier also maintains the notably lower precision and recall than Random Forest for classifying ViCs. The logistic regression classifier exhibits the second-best performance in terms of ROC area, mainly thanks to its relatively high precision (0.768) for ViCs. However, its recall for ViCs is significantly lower (0.414) compared to the Decision Tree and Quinlan C4.5 classifiers. Similarly, the naïve Bayes classifier underperforms the logistic regression classifier across all three metrics (recall, precision, and ROC area). Finally, the SVM classifier demonstrates the highest recall for LNCs and a good precision for LNCs, indicating that the model is over-fitted to the LNC samples. It can be confirmed by the fact that SVM does not show a good recall for ViCs. The over-fitting is likely because of the imbalanced training dataset, where the LNC samples significantly outnumber the ViC samples. SVM performance generally benefits from a balanced ratio of positive and negative examples (e.g., 1:1), which is particularly difficult in vulnerability classification tasks. Feature Reduction. Let us evaluate the performance of the Random Forrest classifier using various subsets of the devised feature data types. The process is to identify a highly effective feature subset that maintains high accuracy, while requiring less data collection during inference compared to using the full feature datasets. Feature Reduction. Table V presents the evaluation results. As expected, the first row, using all six feature sets (VH, CC, RP, TM, HH, and PT) represents the best case. Removing the HH (Human History), PT (Process Tracking), or TM (Text Mining) feature sets individually leads to minor reductions in the recall (0.4–0.9%) and precision (0.6-2.8%) for classifying ViCs. Practically, it translates to ~5 misclassified ViCs out of the 585 ViCs and ~14 misclassified LNCs out of the 7,453 LNCs. The ROC area remains largely consistent across those three variations (0.954–0.957 for the 2nd, 3rd, and 4th rows), compared to the baseline of 0.955 (the 1st row in Table V). Let us further investigate the accuracy achieved after removing both the HH (Human History) and PT (Process Tracking) feature sets, followed by the removal of all three (HH, PT, and TM). The results show that the VH (Vulnerability History), CC, and RP (Review Pattern) feature sets still provide high accuracy, exhibiting only a 0.3% reduction in LNC recall and a 4.5% reduction in ViC precision over when all features are used. The following discusses each of the three remaining feature sets in more details: The VH (Vulnerability History) feature set aligns with the known factors used in the buggy component prediction (e.g., temporal, spatial, and churn localities). The results in this study demonstrate that those three types of localities remain relevant and effective for predicting vulnerabilities at the code change level. Among the six VH feature data types, VHtemporal_avg is the most impactful. It is confirmed by the fact that none of the other five VH feature data types alone could correctly classify a single ViC in isolation during the 12-fold validation experiment. localities remain relevant and effective for predicting vulnerabilities at the code change level The CC (Change Complexity) feature set aligns with the established principle that complexity often leads to software defects, a relationship repeatedly observed when analyzing defect ratios of software components (e.g., files or modules). The data in this study further confirms that more complex code changes are indeed more likely to introduce vulnerabilities. Our VP framework thus signals software engineers to pay extra attention by selectively flagging a subset of code changes as higher risk (e.g., using predicted chances of vulnerabilities) It is to help identify and fix potential coding errors before those code changes are merged into a source code repository. more complex code changes are indeed more likely to introduce vulnerabilities. The data confirms the importance of the novel RP (Review Pattern) feature set in the VP framework. Complex code changes are likely to contain software faults, placing a burden on code reviewers to detect coding errors and guide authors toward fixes. While the RP feature set alone does not provide the highest accuracy for ViCs (e.g., 59.5% precision), combining it with the CC (Change Complexity) feature set significantly boosts the precision for ViCs (e.g., 88.6%). The pairing helps identify situations such as: when complex code changes lack rigorous review before submission; or when authors self-approve complex changes without any explicit peer code reviews recorded. However, the RP and CC feature sets do not offer high ViC recall (e.g., 31.8%) as the pair target the specific code change characteristics. Many other factors contribute to ViCs slipped through code reviews and other pre-submit testing. Further removing RP (Review Pattern) from the VH, CC, and RP sets significantly reduces accuracy (i.e., 80.5% precision for ViCs drops to 59%). Interestingly, even when both RP and CC (Change Complexity) are removed, the VH (Vulnerability History) features set alone still provides the higher accuracy than the VH and CC sets combined (e.g., the ROC area of 91.8% vs. 80.2%). It is partly due to VH leveraging the N-fold validation setting (i.e., learning from future ViCs to predict past ViCs). The next subsection (VII.B) addresses it using online inference and demonstrates a general counterexample applicable to all feature data types. Potential as a Global Model. To enable immediate deployment of a VP model across multiple projects, this study also investigates which feature data types are likely target project agnostic. Among the six feature sets, four (CC, RP, TM, and PT) are potentially not project-specific. In contrast, the HH (Human History) and VH (Vulnerability History) feature sets focus on vulnerability statistics tied to specific engineers and software modules, respectively. It suggests us that models trained using those two feature sets would not be directly transferable to other projects with different engineers and software modules. Potential as a Global Model We explore the possibility of a global VP model though another 12-fold validation study. Because one may argue TM (Text Mining) could be programming language-specific, the accuracy of a global model is evaluated without and with TM to assess its impact. Table VI shows that using only the CC, RP (Review Pattern), and PT (Process Tracking) feature sets yields relatively low ViC recall (32%) but notably high ViC precision (~90%). The precision increases further if TM is used together with CC, RP, and PT (~95%). The result is promising, as those feature sets could potentially be used across multiple projects due to their ability to minimize false positives. Let us further reduce the feature sets by considering individual features. Using only the five features listed in the last row of Table VI (CCadd, CCrevision, CCrelative_revision, RPtime, and RPweekday) the VP model achieves a ROC area of 0.786, while maintaining a high ViC precision of 73.4%. It comes at the cost of a notable reduction in recall (i.e., to ~32% from ~60% when all feature sets are used). However, we argue that the penalty is minimal, as evidenced by the still-high LNC precision of 94.9%. Importantly, our approach remains significantly better than not using the VP framework at all, since it retains a ViC recall of 32.1%. In our N-fold cross-validation, the recall for ViCs is not notably high. Yet it confirms the extra security coverage that can be provided by the VP framework without having to conduct extensive security testing. The relatively low recall is likely due to the validation process not fully capturing the inherent temporal relationships, dependencies, and patterns within the feature data. For instance, N-fold validation can reorder a ViC and its corresponding VfC such that the VfC precedes the ViC, violating the natural order. Consequently, we evaluate the VP framework using its online deployment mode to better reflect real-world scenarios. B. EVALUATION USING ONLINE DEPLOYMENT MODE B. EVALUATION USING ONLINE DEPLOYMENT MODE This subsection evaluates the VP framework under its production deployment settings (namely, online deployment mode) using about six years of AOSP vulnerabilities data. To achieve maximum accuracy, this experiment employs the Random Forest classifier and leverages all devised feature data types. The evaluation data originates from the AOSP frameworks/av project. Each month, the VP framework assesses all code changes submitted in that month using the latest model, trained on data available before that month begins. For this evaluation, it is assumed that a ViC is known if and only if it is merged. However, a more realistic scenario considers a ViC known if its corresponding VfC is merged. The assumption highlights the need for thorough security testing (e.g., fuzzing) to identify ViCs within an average of half a month after they are merged. Thus, existing security testing techniques are crucial to fully realize the potential of the VP framework. existing security testing techniques are crucial to fully realize the potential of the VP framework. Figure 7 presents the evaluation results of the online deployment mode. For ViCs, the framework demonstrates an average recall of 79.7% and an average precision of 98.2%. For LNCs, it achieves an average recall of 99.8% and an average precision of 98.5%. Those results indicates that the online VP framework can identify ~80% of ViCs with ~98% accuracy, while only misdiagnosing ~1.7% of LNCs (assuming no hidden vulnerabilities within LNCs) at presubmit time before code changes are merged. The actual misdiagnosis rate is likely lower than 1.7% due to potential future discovery of vulnerabilities within LNCs. Similarly, the exact ViC accuracy metric values can change depending on the classification of newly discovered ViCs in the future. The direction and magnitude of such metric value changes depend on how those new ViCs were previously classified. Overall, these promising results warrant further investigation for industry and open source community deployments. Significant variations exist in the ViC recall values (i.e., a standard deviation of 0.249). While one might assume low recall in months with few ViCs (e.g., <5), the sample correlation coefficient analysis shows no significant link (- 0.068) between the ViC recall and count (captured in Figure 8). In contrast, the LNC recall and precision values show less variations. Among those two, the precision exhibits slightly wider variation than the recall (i.e., a standard deviation of 0.017. It likely stems from the abundance of LNCs each month and the high LNC precision of the VP framework (as shown in Subsection VI.A). The online mode demonstrates notably higher accuracy than the 12-fold validation using the same feature data and classifier. It is likely due to the online mode giving greater weights to recent history within its learning model, effectively leveraging the strong temporal correlations found in certain feature values. For example, a file in a ViC is likely to contain another ViC in the near future if the same software engineers continue working on the file (e.g., as author and reviewers) and are performing similar tasks (e.g., as part of a workstream to develop a new feature). The 12-fold validation, with its shuffled training and test data, does not fully capture such temporal causality. Consequently, the online mode results provide a more realistic assessment of the VP framework accuracy than the 12-fold validation ones. Figure 8 reveals that an average of 7.4% of reviewed and merged code changes are classified as ViCs. The framework flags an average of 6.875 LNCs per month for additional security review. This manageable volume (<2 code changes per week) represents an acceptable review cost, especially considering the large number of full-time software engineers worked on the target project. an average of 6.875 LNCs per month for additional security review. This manageable volume (<2 code changes per week) represents an acceptable review cost, Author: Keun Soo Yim Author: Author: Keun Soo Yim Keun Soo Yim Keun Soo Yim This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license. This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license. available on arxiv n arxiv