In Code Reviews, Security Risks Hide Behind Technical Language

Written by codereview | Published 2026/01/19
Tech Story Tags: code-reviews | code-review-security | software-security | cwe-699 | openssl-security | shift-left-security | vulnerability-management | php-security

TLDRDevelopers seldom label vulnerabilities outright in code reviews, but they frequently highlight underlying coding weaknesses. By using semantic similarity—especially word embeddings—security-related review comments can be identified more effectively than with keyword searches, enabling scalable and more accurate security analysis.via the TL;DR App

Abstract

1 Introduction

2 Background and Related Work

  • Software Security
  • Coding Weaknesses
  • Security Shift-Left
  • Modern Code Review
  • Code Review for Software Security
  • Security Concern Handling Process in Code Review

3 Motivating Examples

4 Case Study Design

  • Research Questions
  • Studied Projects
  • Data Collection
  • Coding Weakness Taxonomy
  • Study Overview
  • Security Concern Identification Approach (RQ1)
  • Alignment Analysis of Known Vulnerabilities (RQ2)
  • Handling Process Identification (RQ3)

5 Preliminary Analysis

  • PA1: Prevalence of Coding Weakness Comments
  • PA2: Preliminary Evaluation of our Security Concern Identification Approach

6 Case Study Results

7 Discussion

8 Threats to Validity

  • Internal Validity
  • Construct Validity
  • External Validity

5 Preliminary Analysis

In this section, we present two preliminary analyses to provide the logical ground for our main case study. The goal of the first preliminary analysis (PA1) is to examine whether reviewers tend to raise coding weaknesses related to security issues more frequently than explicitly discussing the vulnerabilities. The second analysis (PA2) aims to preliminarily evaluate the effectiveness of our semi-automated approach (see Section 4.6.1) to calculate semantic similarity scores for the code comments that contain coding weaknesses. Dataset: We conducted the two preliminary analyses based on a sample dataset. We randomly sampled 400 code review comments from each of the studied projects (i.e., OpenSSL and PHP). This sample size should allow us to generalize conclusions with a confidence level of 95% and a confidence interval of 5% (Triola, 2009).

5.1 PA1: Prevalence of Coding Weakness Comments

The motivating examples in Section 3 show that coding weaknesses can lead to security issues. Since code review focuses on identifying and mitigating issues in source code (M¨antyl¨a and Lassenius, 2009; Bacchelli and Bird, 2013), it is possible that code review may be able to identify such coding weaknesses. To confirm this, we assess the degree to which the coding weaknesses are discussed in code reviews. In particular, we analyze whether reviewers more frequently discussed coding weaknesses than vulnerabilities.

Approach

From the sampled dataset, we manually classified code review comments into three groups: 1) comments that mentioned a coding weakness, 2) comments that explicitly mentioned a vulnerability, and 3) other comments that are not related to coding weaknesses and vulnerabilities. We consider that a code review comment mentioned a coding weakness when it is related to coding weaknesses listed in the CWE-699. A code review comment is considered as mentioning a vulnerability when it is related to the types of exploitable vulnerabilities obtained from prior studies (Di Biase et al., 2016; Paul et al., 2021b) i.e., Race Condition, Buffer and Integer Overflow, Improper Access, Cross Site-Scripting (XSS) and CrossSite Request Forgery (CSRF), Denial of Service (DoS) and Crash, Information Leakage, Command and SQL Injection, Format String, Encryption, and common vulnerability keywords such as attack, bypass, back-door, breach, trojan, spyware, virus, ransom, malware, worm, and sniffer. Note that one code review comment can be classified into multiple categories. For example, a comment ’[..] we ensure that when the ‘while‘ loop ends, there are always at least 2 more slots available in the output buffer without overrunning it [..]’41 is related to a vulnerability (i.e., buffer overflow) as well as a coding weakness (Incorrect Calculation of Buffer Size (CWE-131)). Hence, this comment is classified as mentioning vulnerability and coding weakness.

Results

Our preliminary result shows that coding weaknesses were raised more often than vulnerabilities during the code review. Table 6 shows the number of code review comments that mentioned a coding weakness, a vulnerability, and others. From 400 sampled code review comments for each studied project, we identified 67 comments related to coding weaknesses and 2 comments related to vulnerabilities in PHP; and 84 comments related to coding weaknesses and 4 comments related to vulnerabilities in OpenSSL. The amount of code review comments that mentioned vulnerabilities align with the findings of Di Biase et al. (2016) who found that 1% of the code review comments identified vulnerabilities. Table 6 shows that the number of comments that mentioned a coding weakness is 21 - 33.5 times higher than the number of comments that mentioned a vulnerability. In addition, we observed that reviewers sometimes point out a potential

5.2 PA2: Preliminary Evaluation of our Security Concern Identification Approach

Since we cannot manually identify code review comments that contain coding weaknesses in the entire code review comment dataset (i.e., 135K comments; see Table 2), we opt to use a semi-automated approach to identify comments, as explained in Section 4.6. In particular, we measure the cosine similarity score of each code review comment and the descriptions of coding weakness categories and we manually validate the comments with high cosine similarity scores until reaching the saturation point, i.e., 50 consecutive comments are identified as generic or irrelevant comments. In this work, we explore two well-known vector representation techniques (i.e., TF-IDF and word embedding) when measuring cosine similarity. We did not use the keyword search like prior works (Bosu et al., 2014; Paul et al., 2021a,b) because their pre-defined keyword lists are limited and may not cover all coding weaknesses. Hence, we set out this preliminary analysis to evaluate the effectiveness of our approach compared to the keyword search and examine which vector representation can produce the similarity scores that better distinguish the code review comments that contain coding weaknesses from the irrelevant code review comments.

Approach

We conducted our preliminary evaluation based on the sampled dataset and our manual classification in PA1. We considered the comments that mentioned coding weakness as coding weakness comments group, and the other comments as noncoding weakness comments group. We pre-processed code review comments in the sampled dataset and the combined descriptions of coding weaknesses in all CWE699 categories with the method described in Section 4.6.1. Then, we generated TFIDF and word embedding vectors of the code review comments and the combined descriptions. Finally, we calculated the similarity score between the vectors. To measure the effectiveness of our approach, we adopted the effort-aware evaluation concept (Kamei et al., 2013; Verma et al., 2016). We measured top

k precision, recall, and F1-score where k is the number of comments with the highest similarity scores. While the value of k approximates the effort required for our manual validation, the top-k precision shows the proportion of coding weakness security comments in the top-k over the non-coding weakness comments; the topk recall shows the percentage of coding weakness security comments that can be identified at the top-k; and the top-k F1-score shows the single score that represent both top-k precision and top-k recall. For the keyword search, we measured the precision, recall, F1-score and of the code review comments that were identified by a set of vulnerability keywords from previous secure code review studies (Bosu et al., 2014; Paul et al., 2021a,b). To evaluate the two vector representation techniques, we examine which technique produces similarity scores for coding weakness comments higher than the scores for non-coding weakness comments. Thus, we used the one-sided MannWhitney-Wilcoxon test to examine the statistical difference in the similarity scores between the two groups of code review comments. We also used Cliff’s |δ| effect size to estimate the magnitude of the difference in scores from each group.

Results

As shown in Table 7, we found that our approach with word embedding vectors achieved the highest top-k F1-score in OpenSSL and PHP for all k ∈ (20, 40, 60, 80, 100) with the top-k F1-score of 0.16 - 0.58, while our approach with TF-IDF achieved the top-k F1-score of 0.14 - 0.47. Table 7 also shows that our approach achieves higher F1-score than the keyword search. The keyword search retrieved 16 and 13 comments that contain one of the vulnerability keywords, which achieves an F1-score of 0.28 for OpenSSL and 0.25 for PHP. Moreover, we observe that the keyword search did not identify some types of coding weaknesses that can introduce vulnerability such as Pointer Issues (CWE-465). For example, the keyword approach could not identify a comment “The object can’t be referenced after free obj, only dtor obj“ 43 which is related to the ‘NULL Pointer Dereference‘ weakness (CWE-476).

This result shows that our approach using cosine similarity can identify more coding weakness comments than the keyword search. For the performance of similarity score calculation, Table 8 shows the results of the one-sided Mann–Whitney–Wilcoxon test and Cliff’s |δ| effect size between the similarity scores of the coding weakness comments and the non-coding weakness comments. We found that similarity scores of coding weakness security comments are significantly higher than non-coding weakness security comments (p-value < 0.05) when using TF-IDF and word embedding vectors. In addition, we found that the difference in the similarity scores from the word embedding vectors has a large effect size (|δ| ≥ 0.474 (Romano et al., 2006)) for both OpenSSL and PHP, while the difference in the similarity scores from TF-IDF vector has a large effect size for OpenSSL and a medium effect size for PHP. This suggests that the similarity scores based on the word embedding vectors can better differentiate coding weakness comments from their counterparts than the similarity scores based on the TF-IDF vectors. This finding is consistent with the top-k precision, recall and F1-scores shown in Table 7, i.e., at the same k value, using word embedding vectors achieves a higher score than using TF-IDF vectors.

Our preliminary evaluation shows that our approach with the word embedding technique 1) achieves a higher recall than the TF-IDF technique and the keyword search and 2) can better distinguish the coding weakness comments. Therefore, in this study, we used the word embedding technique to calculate the similarity scores to help us manually identify the coding weakness comments in the remaining dataset.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


Written by codereview | Code Review
Published by HackerNoon on 2026/01/19