Authors:
(1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada;
(2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand;
(3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand;
(4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada;
(5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand;
(6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada. Table of Links Abstract and Introduction Original Study Replication Scope and Methodology Results Discussion Related Work Conclusion, Acknowledgments, and References VI. RELATED WORK There have been several studies evaluating the security issues of generated code by LLMs, specifically those generated by Copilot. A recent study on the security of Copilot-generated code in GitHub projects by Fu et al. [15] reported that around 36% of the Copilot-generated code contains CWEs. The studied snippets revealed weaknesses related to 42 different CWEs, including eleven, that appear in the MITER’s 2022 Top-25 CWEs. Most weaknesses are found to be related to, among others, OS Command Injection (CWE-78), Use of Insufficiently Random Values (CWE-330), and Improper Check or Handling of Exceptional Conditions (CWE-703). Previous studies, including the work of Khoury et al. [20], inspected code generated by ChatGPT for common vulnerabilities as well as its response to prompting to improve vulnerable code. The study found that while the model is conceptually “aware” of the vulnerabilities present in the code, it nevertheless continues to produce code with these vulnerabilities present. Hajipour et al. [21] investigated vulnerabilities introduced by specially engineered prompts. By inverting the target models, the study was able to extract prompts that would induce vulnerabilities in the generated code. A study by He et al. [8] used adversarial testing to implement security hardening on pre-trained LLMs. This process showed a significant improvement in the security of the output code without having to retrain the models. (Study used CodeQL to validate generated code samples) Asare et al. [22] compared the rate of introduction of vulnerabilities by both humans and Copilot. Of the code samples tested, 33% was found to reintroduce the same vulnerabilities as the original code, with 25% being the same as the fixed code. The remaining 42% was code that was dissimilar to either the vulnerable or fixed code. Several studies have also been conducted to examine the security of generated code from ChatGPT models. Shi et al. [23] proposed a backdoor attack that may be used exploit the security vulnerabilities of ChatGPT. Initial experiments show that attackers may manipulate generated text with this approach. Erner et al. [24] explored various attack vectors for ChatGPT and performed a qualitative analysis of the security implications of these vectors. Given the large attack surface, the study concluded that more research is required into each of the vectors to inform professionals and policymakers going forward. Go et al. [25] demonstrated the usage of GitHub’s code search to find “Simple Stupid Insecure Practices” (SSIPs) in open-source software projects across the site. The study shows that SSIPs are common, exploitable vulnerabilities that can easily be found using GitHub. Perry et al. [26] performed a study comparing how users complete programming tasks with and without AI Code Assistants. The study found that users who had access to one of OpenAI’s code generation models wrote significantly less secure code than those without access. Huang et al. [27] surveyed the safety and trustworthiness of LLMs and the viability of use of various verification and validation techniques. The paper is intended as an organized collection of literature to facilitate a quick understanding of LLM safety and trustworthiness from the perspective of verification and validation. A growing number of studies have investigated different aspects of GitHub Copilot’s code quality. Several of those studies have focused on the productivity aspects of Copilot. Dakhel et al. [5] Analyzed the viability of Copilot as a pair programmer/programming tool by investigating the correctness of solutions provided by the tool compared with those by programmers. It was reported that Copilot programmers’ solutions have a higher correctness ratio compared to those of Copilot. However, Copilot’s buggy solutions are found to require less effort to be repaired compared to the programmers’ ones. Evaluating the practical quality of Copilot suggestions, Nquyen et al. [28] used LeetCode questions to create queries for Copilot in four programming languages. Java was found to have the highest rate of correct suggestions with 57% while JavaScript had the lowest at 27%. Some shortcomings of Copilot include generating incomplete code that relies on undefined helper functions or over complicated and circuitous code. This paper is available on arxiv under CC 4.0 license. Authors: (1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand; (3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand; (4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand; (6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada. Authors: Authors: (1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand; (3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand; (4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand; (6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada. Table of Links Abstract and Introduction Abstract and Introduction Original Study Original Study Replication Scope and Methodology Replication Scope and Methodology Results Results Discussion Discussion Related Work Related Work Conclusion, Acknowledgments, and References Conclusion, Acknowledgments, and References VI. RELATED WORK There have been several studies evaluating the security issues of generated code by LLMs, specifically those generated by Copilot. A recent study on the security of Copilot-generated code in GitHub projects by Fu et al. [15] reported that around 36% of the Copilot-generated code contains CWEs. The studied snippets revealed weaknesses related to 42 different CWEs, including eleven, that appear in the MITER’s 2022 Top-25 CWEs. Most weaknesses are found to be related to, among others, OS Command Injection (CWE-78), Use of Insufficiently Random Values (CWE-330), and Improper Check or Handling of Exceptional Conditions (CWE-703). Previous studies, including the work of Khoury et al. [20], inspected code generated by ChatGPT for common vulnerabilities as well as its response to prompting to improve vulnerable code. The study found that while the model is conceptually “aware” of the vulnerabilities present in the code, it nevertheless continues to produce code with these vulnerabilities present. Hajipour et al. [21] investigated vulnerabilities introduced by specially engineered prompts. By inverting the target models, the study was able to extract prompts that would induce vulnerabilities in the generated code. A study by He et al. [8] used adversarial testing to implement security hardening on pre-trained LLMs. This process showed a significant improvement in the security of the output code without having to retrain the models. (Study used CodeQL to validate generated code samples) Asare et al. [22] compared the rate of introduction of vulnerabilities by both humans and Copilot. Of the code samples tested, 33% was found to reintroduce the same vulnerabilities as the original code, with 25% being the same as the fixed code. The remaining 42% was code that was dissimilar to either the vulnerable or fixed code. Several studies have also been conducted to examine the security of generated code from ChatGPT models. Shi et al. [23] proposed a backdoor attack that may be used exploit the security vulnerabilities of ChatGPT. Initial experiments show that attackers may manipulate generated text with this approach. Erner et al. [24] explored various attack vectors for ChatGPT and performed a qualitative analysis of the security implications of these vectors. Given the large attack surface, the study concluded that more research is required into each of the vectors to inform professionals and policymakers going forward. Go et al. [25] demonstrated the usage of GitHub’s code search to find “Simple Stupid Insecure Practices” (SSIPs) in open-source software projects across the site. The study shows that SSIPs are common, exploitable vulnerabilities that can easily be found using GitHub. Perry et al. [26] performed a study comparing how users complete programming tasks with and without AI Code Assistants. The study found that users who had access to one of OpenAI’s code generation models wrote significantly less secure code than those without access. Huang et al. [27] surveyed the safety and trustworthiness of LLMs and the viability of use of various verification and validation techniques. The paper is intended as an organized collection of literature to facilitate a quick understanding of LLM safety and trustworthiness from the perspective of verification and validation. A growing number of studies have investigated different aspects of GitHub Copilot’s code quality. Several of those studies have focused on the productivity aspects of Copilot. Dakhel et al. [5] Analyzed the viability of Copilot as a pair programmer/programming tool by investigating the correctness of solutions provided by the tool compared with those by programmers. It was reported that Copilot programmers’ solutions have a higher correctness ratio compared to those of Copilot. However, Copilot’s buggy solutions are found to require less effort to be repaired compared to the programmers’ ones. Evaluating the practical quality of Copilot suggestions, Nquyen et al. [28] used LeetCode questions to create queries for Copilot in four programming languages. Java was found to have the highest rate of correct suggestions with 57% while JavaScript had the lowest at 27%. Some shortcomings of Copilot include generating incomplete code that relies on undefined helper functions or over complicated and circuitous code. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Copilot vs. Humans: Comparing Vulnerability Rates in Code Generation

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Assessing the Security of GitHub Copilot Generated Code: A Targeted Replication Study

Complexity NIMBYs and the Illusion of Transcending Software Complexity Bounds

5 Practical Ways AI Can Boost Productivity for Web Developers

On the Concerns of Developers When Using GitHub Copilot

GitHub Copilot in Practice: Empirical Insights into User Experiences and Practical Challenges

Assessing the Security of GitHub Copilot Generated Code: A Targeted Replication Study

Complexity NIMBYs and the Illusion of Transcending Software Complexity Bounds

5 Practical Ways AI Can Boost Productivity for Web Developers

On the Concerns of Developers When Using GitHub Copilot

GitHub Copilot in Practice: Empirical Insights into User Experiences and Practical Challenges

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps