How Bias, Context, and Data Gaps Shape What We Know About Code Security

Written by codereview | Published 2026/01/19
Tech Story Tags: code-reviews | code-review-security | software-security | cwe-699 | openssl-security | shift-left-security | vulnerability-management | php-security

TLDRThis section outlines the methodological limits of using code review data to study software security, addressing annotation bias, taxonomy overlap, dataset quality, and generalizability.via the TL;DR App

Abstract

1 Introduction

2 Background and Related Work

  • Software Security
  • Coding Weaknesses
  • Security Shift-Left
  • Modern Code Review
  • Code Review for Software Security
  • Security Concern Handling Process in Code Review

3 Motivating Examples

4 Case Study Design

  • Research Questions
  • Studied Projects
  • Data Collection
  • Coding Weakness Taxonomy
  • Study Overview
  • Security Concern Identification Approach (RQ1)
  • Alignment Analysis of Known Vulnerabilities (RQ2)
  • Handling Process Identification (RQ3)

5 Preliminary Analysis

  • PA1: Prevalence of Coding Weakness Comments
  • PA2: Preliminary Evaluation of our Security Concern Identification Approach

6 Case Study Results

7 Discussion

8 Threats to Validity

  • Internal Validity
  • Construct Validity
  • External Validity

8 Threats to Validity

We discuss potential threats to the validity of our study.

8.1 Internal Validity

During the manual annotation to identify security concerns, code review comments can be ambiguous or require more contextual information to understand. In such cases, we decided to preserve the precision of the manual annotation by considering the ambiguous or unclear-context comments as irrelevant to coding weakness. However, as the annotation process was conducted categorically, it may be susceptible to the biases of the annotator. To mitigate this, the comments were independently validated by the third author (Section 4.6.2). Additionally, if the comments are relevant to multiple categories (i.e., receiving high similar scores in multiple categories), they were also annotated and validated multiple times. During the validation of handling scenarios in RQ3 (see Section 4.8), we encountered a few instances of disagreement. We attribute this discrepancy to the limitations inherent in code review data and a potential lack of expertise in the project. We were aware that some weaknesses in the CWE-699 taxonomy are not considered harmful from a security perspective. Thus, we regularly consulted the extended description in CWE-699 to ensure that the security concerns in question can lead to vulnerabilities. We were also aware that several categories in CWE-699 may share similar weaknesses. For example, weaknesses in the Random Number Issues (CWE-1213) category are also listed in the Cryptographic Issues (CWE-310) category. Nevertheless, we only identified three security concerns that shared both coding weaknesses.

8.2 Construct Validity

We used an automated text-based approach to facilitate our manual annotation process. The performance of the automated approach can be suboptimal due to the limited vocabulary in the documents. We tried to mitigate this concern by including CWE’s alternate terms that developers might use. It should also be noted that the selection of word-embedding techniques can impact the possibility of finding relevant code review comments. We carefully selected the word-embedding model pre-trained in the software engineering domain to reduce the potential issues. In the manual annotation process, we read only comments that have high similarity scores (i.e., reading and doing manual analysis until reaching the saturation point). It is possible that some of the unread comments may also contain coding weaknesses

For RQ2, we analyzed the alignment of known vulnerabilities and security concerns by observing the distribution of related weaknesses. It is worth noting that CWE assignments for CVE are based on the security expert’s judgment. Therefore, they can be subjective. Additionally, CVE records can be updated. Hence, our analysis is limited by the abstract observations at the time of data collection. For RQ3, we found two PHP pull requests with a long thread of discussions (100-300 comments). Although we were able to locate the identified comments, it is difficult to observe the handling scenarios, i.e., whether the issue was eventually addressed by developers or not. To avoid misinterpretation of the handling process based on these code review activities, we decided to drop these two pull requests from the results of our RQ3. We tried to minimize this problem by manually checking the final code change and the developer’s reactions. However, there is no effective solution to completely mitigate this issue. For transparency, we released the dataset used in this study in our supplementary materials.

Finally, the quality of the studied datasets can affect the validity of the results. Although the studied projects primarily conduct code reviews on GitHub, we cannot guarantee that our datasets include every code review in each project because some code reviews may not be documented.

8.3 External Validity

While increasing the number of studied projects may strengthen the generalizability of the findings, expanding the studied subjects is not a trivial task. This is because there are a limited number of projects that fit our selection criteria e.g., the size of projects (the small projects may not have sufficient security discussion (Di Biase et al., 2016)), the past vulnerabilities (for comparing the alignment of past vulnerabilities), the availability of code review data, and the mandatory code review policy. Furthermore, Nagappan et al. (2013) also suggested that indiscriminately increasing the sample size in software engineering study may not necessarily improve the generalizability. During the annotation process, we observed that both studied projects have several special traits due to a different application domain. The findings based on these two projects may include aspects that may not apply to other software projects. Thus, the analysis of the studied dataset does not allow us to draw conclusions for all open-source projects. Nevertheless, we carefully selected two distinct projects for this study that differ in nature and potential security issues. PHP is a general-purpose scripting language that may face a wide range of varying levels of security threats depending on its usage. OpenSSL is a library with a primary focus on security. Hence, we believe that security issues present in both of these projects are also relevant to other software projects within similar application domains.

Further studies are required to confirm this hypothesis. As our findings are based on the snapshot of code review datasets until June 2022, the recency of the data can be a concern. To mitigate this issue, we analyzed the coding weaknesses in the newly collected code review datasets between June 2023 and February 2024 from both projects, which comprise 6,365 code review comments and 1,427 pull requests in total. We found no major difference in the prevalence of coding weakness discussion between the two datasets. In particular, nine categories remain in the top 10 categories of OpenSSL, and six categories remain in the top 10 categories of PHP. 65 However, we cannot guarantee whether the results will be sustained in future code reviews.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


Written by codereview | Code Review
Published by HackerNoon on 2026/01/19