A Case Study of Security Concerns in Real-World Code Reviews

Table of links

Abstract

1 Introduction

2 Background and Related Work

Software Security
Coding Weaknesses
Security Shift-Left
Modern Code Review
Code Review for Software Security
Security Concern Handling Process in Code Review

3 Motivating Examples

4 Case Study Design

Research Questions
Studied Projects
Data Collection
Coding Weakness Taxonomy
Study Overview
Security Concern Identification Approach (RQ1)
Alignment Analysis of Known Vulnerabilities (RQ2)
Handling Process Identification (RQ3)

5 Preliminary Analysis

PA1: Prevalence of Coding Weakness Comments
PA2: Preliminary Evaluation of our Security Concern Identification Approach

6 Case Study Results

7 Discussion

8 Threats to Validity

Internal Validity
Construct Validity
External Validity

4 Case Study Design

In this section, we outline our study by describing the research questions, the selection processes of studied subjects and coding weakness taxonomy, the data collection method, and our analysis approach to answer our research questions. As comments related to security issues are sparse in code reviews (Di Biase et al., 2016), it can be challenging to synthesize an insightful result from diverse data sources. To overcome this problem, we followed the case study method (Perry et al., 2004) to explore the real-world security concerns in selected software projects that are prone to security issues. We chose OpenSSL and PHP for our case study based on the known vulnerabilities in the past and the code review activities.

4.1 Research Questions

To understand the potential benefits of considering coding weaknesses during code reviews for early prevention of software security issues, we formulate the following research questions.

RQ1: What kinds of security concerns related to coding weaknesses are often raised in code review? Motivation: Code review is an approach that many projects use to identify and eliminate defects early before integrating the new code into the codebase. However, Braz and Bacchelli (2022) raised a concern that identifying security issues during the code review process can be challenging for developers because of the lack of security knowledge and awareness. As Bojanova and Galhardo (2023) showed that the chain of coding weaknesses can be the root cause of security issue, we hypothesize that code reviews could identify such coding weaknesses which have simpler coding patterns and require less security knowledge. Prior studies have yet to investigate the security concerns that reviewers could raise from the coding weakness perspective. Since coding weaknesses are more visible to developers than security issues, the understanding of these security concerns may provide guidance to improve security issue identification in code reviews.

RQ2: How aligned are the raised security concerns and known vulnerabilities? Motivation: While RQ1 helps us to better understand what kind of coding weaknesses can be raised during the code review process, it is still unclear whether current code review practices have focused on coding weaknesses related to the real vulnerabilities that were known in the past. Thus, we set out RQ2 to examine the alignment of vulnerabilities that the systems had and the raised coding weakness comments. The findings will highlight the types of coding weaknesses that may not be sufficiently discussed in the code reviews. This understanding could increase the reviewer’s awareness of the less frequently identified coding weaknesses.

RQ3: How are security concerns handled in code review? Motivation: Developers can respond to the raised security concerns (i.e., coding weakness raised during code reviews) in various ways in order to address the reviewer’s comments and get the code accepted. Kononenko et al. (2016) found that developers consider security concerns in reviewers’ comments before modifying the code. In contrast, Lenarduzzi et al. (2021) reported that security defects do not influence the acceptance decision of the proposed code changes. However, little is known when it comes to the security concerns from coding weakness comments. An extended understanding of how developers respond to these security concerns could shed more light on the remaining challenges of current secure code review practice. Hence, we set out to explore how the developers handle security concerns from coding weaknesses.

4.2 Studied Projects

We aimed to conduct a case study of the code review process of software systems that are prone to security issues. Therefore, to select suitable projects, we considered the following criteria:

Use C or C++ as the major programming languages—Unlike other programming languages, users of the C and C++ programming languages can access and manipulate lower-level environments (Turner, 2014) that are susceptible to security issues.
Actively performing code reviews—Quality assurance practices such as peer code reviews can improve the quality of code and reduce defects in the code base (Bacchelli and Bird, 2013; Beller et al., 2014).
An accessible code review history—To enable the data extraction, the complete code review data must be publicly accessible. To find subjects prone to security issues, we obtained a list of C and C++ software systems that have records of publicly reported vulnerabilities from the works of Hazimeh et al. (2020) and Lipp et al. (2022). We obtained nine software projects:

We first checked if the code review history of the project is publicly available as it is essential for our analysis. We also carefully checked whether the project regularly performs code reviews for every new code change by examining the project’s code repository (e.g., GitHub and Gitlab), documents on the project’s websites, and code review history in other sources (e.g., mailing list). We found that OpenSSL and PHP established a public contribution policy20 that requires the developers to create GitHub pull requests and address any reviewer’s comments for all public code submissions; Libpng, Libtiff, and Libxml2 have relatively small code review history (less than 350 proposed code changes that received code review comments); SQLite3 employs a private code review process; Binutils and FFmpeg perform code review on mailing lists, which complicates the process of extracting code review information; and Poppler has a large amount of code integration without reviewers’ comments.

Therefore, OpenSSL and PHP were the remaining projects that met our criteria. OpenSSL is a popular encryption library for secure communication over the Internet, and PHP is a widely used web scripting language. In terms of open-source project characteristics, OpenSSL receives more than 13k pull requests from nearly 900 active developers and gains over 23.8k stars on GitHub, while PHP, receives over 10k pull requests from over 900 active developers and gains approximately 37k stars on the same platform. In addition to a remarkable number of active developers, both projects are important to the software development communities because numerous software rely on, or implement them. As of July 2022, the number of CVE vulnerabilities reported was 215 and 662 for OpenSSL21 and PHP22 , respectively. Both projects regularly perform code reviews on GitHub23, where new proposed code changes are submitted as pull requests.

4.3 Data Collection

As we aimed to identify the types of coding weaknesses raised by reviewers and the handling process of these concerns, we needed to analyze the code review history, especially the review discussions. Therefore, we first downloaded the code review histories of the studied projects using the GitHub REST API.24 We downloaded the pull requests and all comments on the pull requests. We accessed and retrieved all the code review historical data of both projects at the end of June 2022. The earliest pull requests that we downloaded from OpenSSL and PHP were created in September 2013 and July 2011, respectively. Then, we selected pull requests that are (1) closed, (2) proposed to upstream, i.e., the main branch, (3) comprised of at least one C or C++ file, and (4) have received at least one comment from a human reviewer.25 It should be noted that we considered the code review comments from (1) the main discussion of a pull request and (2) the code level because reviewers may discuss coding weaknesses in both levels. Table 2 shows the number of downloaded and selected pull requests, as well as the number of comments on the selected pull requests.

4.4 Coding Weakness Taxonomy

Our objective was to classify security concerns in code review comments from the perspective of development flaws and not from the perspective of an attacker, which has been done in previous work (Paul et al., 2021b; Bosu et al., 2014; Bosu and Carver, 2013; Edmundson et al., 2013; Di Biase et al., 2016). We aim to adopt the taxonomy that is generally applicable to the code reviews in any software system. The selected taxonomy covers the realistic security concerns from typical reviewers who may have fundamental software development expertise, but limited security knowledge and awareness (Braz and Bacchelli, 2022). Therefore, we selected an existing taxonomy based on the following criteria:

– Covers the diverse coding weaknesses that are not restricted by the types of well-known security issues

– Provides detailed descriptions and examples for the ease of the annotation process and the future applicability for practitioners

– Focuses on the business and application logic that can be addressed in code rather than low-level aspects such as network or hardware

– Unattached to a specific technology, programming language, or platform We considered four coding weakness taxonomies from industrial guidelines and standards: (1) Common Weakness Enumeration 699 (CWE-699 ) (CWE, b), (2) Common Weakness Enumeration 1000 CWE-1000 (CWE, a), (3) OWASP Code

Review Guide (Owa), and (4) OWASP Application Security Verification Standard (OWASP ASVS) (OWA). Taking into account these four taxonomies based on our criteria, we opted to use CWE-699 in our study for the following reasons. CWE699 covers a large number of coding weaknesses that can be linked to significant security issues. In particular, CWE-699 has 40 categories containing more than 400 weaknesses that may not expose the security implications (hence, do not require deep security knowledge to identify) but still lead to vulnerabilities and can be introduced during software implementation. On the contrary, the other taxonomies have a relatively smaller number of weaknesses (e.g., 6 categories (Croft et al., 2022)). Although CWE-1000 has a higher number of weaknesses, it includes weaknesses in other layers, e.g., hardware or platform, that can be outside of the code review context. CWE-699 provides an extended description of each of its weaknesses, while the OWASP Code Review Guide, which provides a checklist of nine critical vulnerabilities, does not provide a clear description of the suggested vulnerabilities that reviewers can utilize. As CWE-699 focuses on software implementation, its weaknesses are to some extent generalizable to any technology. On the other hand, OWASP ASVS which provides 14 categories of security requirements is more specific to web applications. The simplified definitions of security weaknesses are shown in Table 3.

4.5 Study Overview

Figure 5 shows an overview of our study. To explore the coding weaknesses raised by the reviewers (RQ1), we use a semi-automated approach to identify comments that raised coding weaknesses in a pull request. In particular, we apply text analysis techniques to sort comments that have high semantic similarity with the descriptions of coding weaknesses in the CWE-699 taxonomy. Then, we manually validate and annotate the types of security weaknesses based on the CWE-699 taxonomy. The security concerns found in RQ1 will be further qualitatively analyzed for RQ2 and RQ3, along with their related pull request information and known vulnerabilities. For the alignment assessment with known vulnerabilities (RQ2), we analyze the vulnerabilities of the studied systems that were reported in the past against the security concerns raised during the code reviews. For RQ3, we qualitatively analyze the related code review comments and corresponding code changes within the pull requests to investigate the handling scenario of the raised security concerns. The semi-automated approach (RQ1) provided dual benefits. Firstly, it addressed the challenges in identifying relevant code review comments, overcoming limitations faced by previous studies (Bosu et al., 2014; Paul et al., 2021b; Alfadel et al., 2023) that can only identify the limited set of issues in code review comments by using the keyword-based approach. Thus, it enabled the discovery of more comments related to coding weaknesses. Secondly, it mitigated the manual effort required for validating comments associated with each type of coding weak-

ness. However, manual validation and annotation remain essential to eliminate false positives beyond the capacity of the automated process. We explain our study approaches in the following sections: Section 4.6.1 describes the security concern identification approach for RQ1, Section 4.6.2 describes the handling scenarios identification approach for RQ3, and Section 4.7 describes the known vulnerabilities alignment analysis approach for RQ2.

4.6 Security Concern Identification Approach

(RQ1) Due to a large number of code review comments (i.e., 135K comments in our dataset; see Table 2), it is not feasible for us to manually identify security comments related to coding weaknesses (so-called, security concerns). We opted to use an automated approach to support our security concern identification process. While prior works (Bosu et al., 2014; Paul et al., 2021b; Alfadel et al., 2023) used the keyword-based approach to identify the comments that mentioned vulnerabilities, their pre-defined keyword lists are limited to the specific types of security issues which do not cover all categories in CWE-699. Indeed, the keyword-based approach does not find the code review comments that contain other coding weaknesses that are not included in the keyword lists. Therefore, we resorted to employing a semi-automated approach that combines the textual analysis technique using pre-trained word-embedding model to rank comments that are semantically related to coding weaknesses and human evaluation. Our approach comprises two steps: 1) calculating semantic similarity with the descriptions of coding weaknesses and 2) manually validating and annotating the types of security weaknesses based on the CWE-699 taxonomy. We now provide the details of each step below.

Where C and W are the vectors of the code review comment and description of a coding weakness, respectively. Prior to creating the vectors, we prepare the comments and the description of each of the 40 categories in CWE-699 by removing hyperlinks, stopwords, numbers, and non-alphanumeric characters. We also apply SnowballStemmer (Sno) to obtain the common form of each word. Then, we generate the vector of each code review comment and 40 vectors of the descriptions of coding weaknesses by using the Gensim library (Rehurek and Sojka, 2011) and calculate the similarity scores. Therefore, each code review comment will receive 40 similarity scores. A high similarity score for a particular category indicates that the comment is more likely related to that category. The results of cosine similarity calculation strongly depend on the vector representation of the documents. In order to select the vector representation that yields the best results, we explore two vector representation techniques.

– Term Frequency - Inverse Document Frequency (TF-IDF): We measure the cosine similarity between each code review comment against the full description of each of the CWE-699 weakness categories. We used Term Frequency - Inverse Document Frequency (TF-IDF) vectors (Tata and Patel, 2007) to represent the code review comment and the description of the weakness. A TF-IDF vector represents the significance of every word in a document (i.e., code review comment and the weakness description). The Term Frequency (TF) is the number of times that a word appears in a document over the total number of words in the document, and the Inverse Document Frequency (IDF) is the logarithm of the ratio between the number of all documents and the number of documents that contain the word. The TF-IDF score for each word is the multiplication of the TF value and IDF value. The TF-IDF vector contains the TF-IDF scores of all words in a document.

– Word Embedding: Developers may use interchangeable terms during code review which may be different from the content of the weakness description. For example, login may refer to authenticate in some contexts. Conventional vector representations such as TF-IDF may obstruct the expansive definition of a word. A soft cosine similarity technique that incorporates word embedding models with cosine similarity (Ye et al., 2016; Sidorov et al., 2014) can mitigate this limitation. Instead of using TF-IDF vectors, we use word embedding models to generate vectors that represent the code review comment and the full description of weaknesses. We explore a pre-trained word embedding model in the software engineering domain, namely SO Word2Vec (Efstathiou et al., 2018). It should also be noted that we carefully check if the words are included in word-embedding model before the stemming process. This is to address the out-of-vocabulary (OOV) problem because we stem and retain the words that are not in the word-embedding model for similarity score calculation.

4.6.2 Manual Validation & Annotation of Security Concerns

Once we calculated semantic similarity scores for each code review comment against each of the coding weakness descriptions in the CWE-699 taxonomy, we manually validated and annotated the code review comments into the CWE-699 taxonomy. In particular, we manually analyzed the comments from those with the highest similarity scores to those with lower scores. Specifically, for each category, the comments were sorted by similarity score in descending order before manually annotating the entire body of each comment to determine whether a security concern that is relevant to the coding weaknesses was raised. Figure 6 illustrates an example of our manual validation and annotation approach.

We performed manual validation and annotation in two rounds. First Round–Screening The first round aimed to preliminarily remove comments that are generic or irrelevant to coding weakness (e.g., related to bookkeeping and code styling). For each CWE-699 category, we carefully read comments and determine if they are related to that category based on its description (see Table 3). Due to the large number of comments, it is not feasible to validate all the comments across the 40 categories (i.e., 40 × 135K in total). Thus, for each category, we validate the comments until reaching the saturation point, i.e., 50 consecutive comments were identified as generic or irrelevant comments. In total, we validated 9,704 scores of 6,146 code review comments from both projects. Note that one comment can have high similarity scores in multiple categories. For example, Comment#2 has high similarity scores in categories 1 and 2 (see Figure 6).

Hence, such a comment will be read and validated more than once. This screening process in the first round was done by the first author (30 categories) and a third-year PhD student, who is experienced in manual analysis, (10 categories).

Second Round–Validation The second round aimed to carefully identify security concerns from the comments that passed the first round. In particular, we determined whether the comments raised legitimate security concerns that are relevant to the coding weaknesses. Specifically, a comment was determined to contain a legitimate security concern if it was relevant to one of the CWE-699 categories and met at least one of the following two conditions:

– Reviewer purposefully remarked security consequence(s). For example, the reviewer commented that ”Are these implementations safe against charset based attacks?.”26 which can be considered to indicate a problem with the neutralization of the input data.

– Reviewer expressed concern that can potentially point to a security issue, but did not explicitly mention security consequence(s). For example, the reviewer commented that ”[...] login which always asks for name/password but returns ’no such user’ or ’wrong password’ Nowadays they always return ’Bad login’”27 which may indicate an improper authentication process.

When identifying concerns, it is possible that multiple comments in the same pull request indicated the same security concern. For example, two reviewers suggested that the developer verify a concurrency issue in the code.28 In such a case, we consider these comments as a single security concern for that pull request. Similarly, a comment can also contain multiple legitimate concerns. For example, a reviewer raised concerns about the predictability of the random number generator algorithm29 which can be interpreted as both Random Number Issues and Cryptographic Issues concerns based on the CWE-699 categories. In this case, we considered that this pull request has two concerns. The second round of annotation was done independently by the first author and the third author, who has ten years of experience in security testing. To establish a clear understanding and ensure a consistent classification in the second round, the first author and the second author performed co-classification on a small set of comments to establish a common understanding of the annotation task. Both authors independently classified the remaining comments, and the inter-rater agreement was measured. It should be noted that the authors also assessed the code changes and the code review conversation to better understand the context when necessary. We used Cohen’s Kappa (Cohen, 1960) to measure the inter-rater agreement. For comments with disagreement between the two authors, all authors met and discussed how to resolve the conflicts. The code review comments that were annotated from this process would be called security concerns in the later steps.

To ensure that the chance of missing the relevant comments is minimized, we also assessed the false negative rate on 400 randomly sampled unseen comments, which were left out after the saturation point in each category was reached, from each project. We found very few unseen comments (two in OpenSSL and three in PHP) that can be considered security and weakness-related. In comparison to the ratio of positively identified comments from annotated data, we considered the false negative rate insignificant.

4.7 Alignment Analysis of Known Vulnerabilities (RQ2)

To answer our RQ2, we further analyzed the security concerns raised during the code reviews obtained from RQ1 against the known vulnerabilities of the studied systems that were reported in the past. To study the past vulnerabilities in our studied projects, we used Common Vulnerabilities and Exposures (CVE) which is the collection of publicly reported vulnerabilities in software systems. We downloaded the CVE entries of OpenSSL and PHP from the CVE mirror database30 . We collected the CVE entries that were reported within the same timeframe as the analyzed code reviews (2013-2022 in OpenSSL; 2011-2022 in PHP). We downloaded a total of 101 CVEs for OpenSSL and 277 CVEs for PHP. We then excluded the CVEs that were not assigned any CWEs since we would use the assigned CWE numbers to map into the CWE-699 taxonomy. In addition, we excluded the CVEs that have deprecated CWEs. Finally, we studied 81 and 236 CVEs for OpenSSL and PHP, respectively. Table 4 shows the number of CVEs used in this study.

To examine the alignment between the security concerns raised during the code reviews and the vulnerabilities of the studied systems that were reported in the past, we quantitatively analyzed the frequency of the concerns and the CVE entries across the 40 categories of the CWE-699 taxonomy. While each CVE entry has an assigned CWE number, it may not be directly associated with the CWE699 categories. A clear mapping between CVE entries and CWE-699 categories is not available because a CVE entry can be assigned with any CWE number which may not be under CWE-699 categories. Therefore, we used the CWE hierarchical tree to systematically map the CVE entry and its assigned CWE number into the relevant CWE-699 categories. In particular, we considered the CVE entry to be relevant to the category A in the CWE-699 categories when

– The assigned CWE has the same CWE as the category A; or

– The assigned CWE has a relationship (e.g., PeerOf, CanFollow, and CanAlsoBe) with CWE of category A; or

– The assigned CWE is the child of CWE of category A.

To illustrate the CWE mapping process, we provide an example for each condition as follows. For the first condition, CVE-2014-827531 has been assigned with CWE-31032 which is the category Cryptographic Issues. For the second condition, CVE-2014-367033 is relevant to category Numeric Errors (CWE-189)34 because it has been assigned with CWE-11935 that has the CanFollow relationship with CWE-12836 which belongs to category CWE-189. For the third condition, CVE2017-1293237 is considered as relevant to category Pointer Issues (CWE-465)38 because it has been assigned with CWE-41639 that is the child of CWE-82540 which belongs to category CWE-465. Note that an assigned CWE can be relevant to multiple CWE-699 categories. In that case, we classify that CVE entry as relevant to every related CWE-699 category. If the assigned CWEs of a CVE cannot be mapped based on the heuristic above, we manually map them into the relevant CWE-699 categories based on the descriptions. In particular, 29 (out of 81 for OpenSSL) and 81 (out of 236 for PHP) CVEs require a manual mapping of CWEs based on the descriptions. Table 5 shows a list of CWEs that we manually mapped into the CWE-699 categories.

4.8 Handling Process Identification (RQ3)

To understand how developers handle security concerns (i.e., code review comments that mention coding weaknesses that could lead to security issues) (RQ3), we employed a lightweight coding method, similar to the approach used by Gousios et al. (2014), to analyze what happens after security concerns are raised. We analyzed the security concerns that were manually identified in RQ1. Our aim was to identify the handling scenario based on the code review activities that occurred after security concerns were raised. To do so, for each security concern, we read the discussion in the pull request, including the developer responses and the subsequent comments by the reviewer or other reviewers. In addition, similar to prior work (Rahman et al., 2017), if the security concern points to particular lines of code, we checked whether the developer subsequently modified the associated lines of code to address the concern. Then, we summarized the observation for each security concern. Finally, the summarized observations were sorted into groups on the basis of their thematic similarities and a handling scenario theme was defined for each group. The manual analysis in RQ3 was done by the first author and the second author. We analyzed the handling scenarios in three steps. In the first step, the first

author summarized the discussion in a pull request into a brief note describing the handling of each security concern. In the second step, the first author reviewed the notes and categorized them into distinct groups based on thematic similarities. In the last step, the second author reviewed the groups before discussing the disagreed groups with the first author. Then, the first author refined the groups according to the mutual agreement. Following refinement, the first author revisited the notes and ensured that they fit with the refined groups. We repeated the second and last step in multiple iterations until no further changes were needed (i.e., no new groups emerged, and all notes remained in the original groups). Finally, the second author manually validated 10% of the security concerns to confirm the correctness of the annotated scenarios.

Authors:

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.