Table of Links
IV. Systematic Security Vulnerability Discovery of Code Generation Models
VII. Conclusion, Acknowledgments, and References
Appendix
A. Details of Code Language Models
B. Finding Security Vulnerabilities in GitHub Copilot
C. Other Baselines Using ChatGPT
D. Effect of Different Number of Few-shot Examples
E. Effectiveness in Generating Specific Vulnerabilities for C Codes
F. Security Vulnerability Results after Fuzzy Code Deduplication
G. Detailed Results of Transferability of the Generated Nonsecure Prompts
H. Details of Generating non-secure prompts Dataset
I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset
J. Effect of Sampling Temperature
K. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable Codes
L. Qualitative Examples Generated by CodeGen and ChatGPT
M. Qualitative Examples Generated by GitHub Copilot
III. TECHNICAL BACKGROUND
Detecting software bugs before deployment can prevent potential harm and unforeseeable costs. However, automatically finding security-critical bugs in code is a challenging task in practice. This also includes model-generated code, especially given the black-box nature and complexity of such models. In the following, we elaborate on recent analysis methods and classification schemes for code vulnerabilities.
A. Evaluating Security Issues
Various security testing methods can be used to find software vulnerabilities to avoid bugs during the run-time of a deployed system [38], [39], [40]. To achieve this goal, these methods attempt to detect different kinds of programming errors, poor coding style, deprecated functionalities, or potential memory safety violations (e.g., unauthorized access to unsafe memory that can be exploited after deployment or obsolete cryptographic schemes that are insecure [41], [42], [26]). Broadly speaking, current methods for security evaluation of software can be
divided into two categories: static [38], [43] and dynamic analysis [44], [45]. While static analysis evaluates the code of a given program to find potential vulnerabilities, the latter approach executes the codes. For example, fuzz testing (fuzzing) generates random program executions to trigger the bugs.
For the purpose of our work, we choose to use static analysis to evaluate the generated code, as it enables us to classify the type of detected vulnerabilities. Specifically, we use CodeQL, one of the best-performing free static analysis engines released by GitHub [46]. For analyzing the language model generated code, we query the code via CodeQL to find security vulnerabilities in the code. We use CodeQL’s CWE classification output to categorize the type of vulnerability that has been found during our evaluation and to define a set of vulnerabilities that we further investigate throughout this work.
B. Classification of Security Weaknesses
Common Weaknesses Enumerations (CWEs) is a list of typical flaws in software and hardware provided by MITRE [29], often with specific vulnerability examples. In total, more than 400 different CWE types are defined and categorized into different classes and variants, e. g. memory corruption errors. Listing 1 shows an example of CWE-502 (Deserialization of Untrusted Data) in Python. In this example from [29], the Pickle library is used to deserialize data: The code parses data and tries to authenticate a user based on validating a token, but without verifying the incoming data. A potential attacker can construct a pickle, which spawns new processes, and since Pickle allows objects to define the process for how they should be unpickled, the attacker can direct the unpickle process to call the subprocess module and execute /bin/sh.
For our work, we focus on the analysis of thirteen representative CWEs that can be detected via static analysis tools to show that we can systematically generate vulnerable code and their input prompts. We decided not to use fuzzing for vulnerability detection due to the potentially high computational cost and manual effort imposed by root cause analysis. Some CWEs represent mere code smells or require considering the development and deployment process and are hence out of scope for this work. The thirteen analyzed CWEs, including a brief description, are listed in Table I. Of the thirteen listed CWEs, eleven are from the top 25 list of the most important vulnerabilities. The description is defined by MITRE [29].
Authors:
(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security ([email protected]);
(2) Keno Hassler, CISPA Helmholtz Center for Information Security ([email protected]);
(3) Thorsten Holz, CISPA Helmholtz Center for Information Security ([email protected]);
(4) Lea Schonherr, CISPA Helmholtz Center for Information Security ([email protected]);
(5) Mario Fritz, CISPA Helmholtz Center for Information Security ([email protected]).
This paper is