Authors:
(1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada;
(2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand;
(3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand;
(4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada;
(5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand;
(6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada.
Replication Scope and Methodology
Conclusion, Acknowledgments, and References
AI-powered code generation models have been developing rapidly, allowing developers to expedite code generation and thus improve their productivity. These models are trained on large corpora of code (primarily sourced from public repositories), which may contain bugs and vulnerabilities. Several concerns have been raised about the security of the code generated by these models. Recent studies have investigated security issues in AI-powered code generation tools such as GitHub Copilot and Amazon CodeWhisperer, revealing several security weaknesses in the code generated by these tools. As these tools evolve, it is expected that they will improve their security protocols to prevent the suggestion of insecure code to developers. This paper replicates the study of Pearce et al., which investigated security weaknesses in Copilot and uncovered several weaknesses in the code suggested by Copilot across diverse scenarios and languages (Python, C and Verilog). Our replication examines Copilot’s security weaknesses using newer versions of Copilot and CodeQL (the security analysis framework). The replication focused on the presence of security vulnerabilities in Python code. Our results indicate that, even with the improvements in newer versions of Copilot, the percentage of vulnerable code suggestions has reduced from 36.54% to 27.25%. Nonetheless, it remains evident that the model still suggests insecure code.
Index Terms—Security weaknesses, code generation, security analysis, Copilot
Code generation tools aim to increase productivity by generating code segments for developers - either in the form of autocompletion of existing code segments or converting prompts (written in natural language) into code. Code generation tools have been around for some time. New code generation tools based on AI models, in particular, have gained popularity over the past few years with the availability of commercial products such as GitHub Copilot[1] and Amazon CodeWhisperer[2]. This includes the use of large language models (LLMs) that translate natural language into code. Such tools are touted as AIpair programmers, trained on billions of existing lines of code that help write code (in multiple languages) faster and with less work by turning natural language prompts into coding suggestions.
Copilot is based on models built using OpenAI’s Codex [1], which interprets comments in natural language and executes them on the user’s behalf. The Copilot model is trained on publicly available code from projects hosted on GitHub. As of June 2022, Copilot has been used by more than a million developers and generated over three billion accepted lines of code [2].
Previous research has studied code generation tools, with more focus on the correctness of the results [3], [4], [5], [6]. There is also increased attention given to the security of the generated code [7], [8], including studies on new tools such as Copilot, CodeWhisperer, and ChatGPT [9], [10].
Generating code based on the training on publicly available code may result in code that inherits not just the intended functionality or behavior but also bugs and security issues. Previous studies have shown that publicly available code, such as code snippets hosted on Stack Overflow, can be vulnerable [11]. This code leads to not only generating ‘functional’ code but also security bugs and vulnerabilities. In the context of Copilot, the tool may produce insecure code as Codex, its model, is trained on code hosted on GitHub [12], which is known to contain buggy programs and untrusted data [13].
In 2022, Pearce et al. [14] studied security weaknesses of Copilot-generated code for several programming languages (i.e., Python and C) and reported that 40% of the generated programs contained vulnerabilities. The examples were generated using MITRE’s Common Weakness Enumerations (CWEs), from their “2021 CWE Top 25 Most Dangerous Software Weaknesses”. A recent study on the security weaknesses in Copilot-generated code found in publicly available GitHub projects (using multiple languages) shows that over 35% of Copilot-generated code snippets contain CWEs. It also reported the security weaknesses are diverse in nature and related to 42 different CWEs (from MITRE’s list) [15] (including CWEs that appear in MITRE’s 2022 list). These findings confirm that such weaknesses can also make their way arXiv:2311.11177v1 [cs.SE] 18 Nov 2023 into real-world projects if generated code is not appropriately checked. Copilot security concerns go beyond vulnerable code suggestions; Copilot was found to reveal hard-coded secrets that were part of the training data in GitHub- a recent study [16] found over 2,000 hard-coded credentials in Copilot-generated code, raising alarms of major privacy concerns of the potential leakage of hard-coded credentials. This is mainly because the GitHub training data also contains millions of hard-coded secrets [17].
With the continued improvement in Copilot, it is expected that security measures will be put into place to filter out vulnerable code (that may introduce CWEs). We aim to test this by replicating the study of Pearce et al. using a variety of CWEs and prompts (as in the original study). The goal of this study is to conduct a targeted replication study of Copilot Python code using MITRE’s top 25 CWEs.
This replication study addresses two main questions: does Copilot provide insecure code suggestions? and what is the prevalence of insecure generated code?. We used Copilot to generate code suggestions using prompts based on 12 CWEs from MITRE’s CWE Top 25 Most Dangerous Software Weaknesses
Our results show that despite improvements in Copilot’s newer versions in terms of filtering out vulnerable suggestions, it is still evident that many Copilot suggestions contain CWEs. This is the case across a diversity of weaknesses and scenarios. For the investigated Python suggestions, we found evidence of improvements, with a reduced number of vulnerable Copilot suggestions from 35.35% to 25.06%. We also noted that there are 100% improvements with regard to some of the scenarios and CWEs (that is, the replication shows no weaknesses in the code suggestions). Interestingly, we note some cases where the new version of Copilot’s suggestions contains more weaknesses than what has been shown in the original study. This shows that while Copilot has improved in filtering out vulnerable code suggestions, it does not completely eliminate them. Developers should be cautious of the code suggestions generated by Copilot. Developers should incorporate automatic and manual security analysis of the code before integrating Copilot suggestions into the code.
The paper is organized as follows: we explain the replication scope and our methodology is presented in Section III. We present our results in Section IV followed by a discussion of these results in Section V. We present results work in Section VI. Finally, Section VII presents the study conclusion and future research directions.
This paper is available on arxiv under CC 4.0 license.
[1] https://github.com/features/copilot
[2] https://aws.amazon.com/codewhisperer