Table of Links
IV. Systematic Security Vulnerability Discovery of Code Generation Models
VII. Conclusion, Acknowledgments, and References
Appendix
A. Details of Code Language Models
B. Finding Security Vulnerabilities in GitHub Copilot
C. Other Baselines Using ChatGPT
D. Effect of Different Number of Few-shot Examples
E. Effectiveness in Generating Specific Vulnerabilities for C Codes
F. Security Vulnerability Results after Fuzzy Code Deduplication
G. Detailed Results of Transferability of the Generated Nonsecure Prompts
H. Details of Generating non-secure prompts Dataset
I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset
J. Effect of Sampling Temperature
K. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable Codes
L. Qualitative Examples Generated by CodeGen and ChatGPT
M. Qualitative Examples Generated by GitHub Copilot
B. Finding Security Vulnerabilities in GitHub Copilot
Here, we evaluate the capability of our FS-Code approach in finding security vulnerabilities of the black-box commercial model GitHub Copilot. GitHub Copilot employs Codex family models [15] via OpenAI APIs. This AI programming assistant uses a particular prompt structure to complete the given codes. This includes suffix and prefix of the user’s code together with information about other written functions [63]. The exact structure of this prompt is not publicly documented. We evaluate our FS-Code approach by providing five few-shot prompts for different CWEs (following our settings in previous experiments). As we do not have access to the GitHub Copilot model or their API, we manually query GitHub Copilot to generate non-secure prompts and codes via the available Visual Studio Code extension [9]. Due to the labor-intensive work in generating the non-secure prompts and codes, we provide the results for the first four of thirteen representative CWEs. These CWEs include CWE-020, CWE-022, CWE-078, and CWE-079 (see Table I for a description of these CWEs). In the process of generating non-secure prompts and the code, we query GitHub Copilot to provide the completion for the given sequence of the code. In each query, GitHub Copilot returns up to 10 outputs for the given code sequence. GitHub Copilot does not return duplicate outputs; therefore, the output could be less than 10 in some cases. To generate non-secure prompts, we use the same constructed few-shot prompts that we use in our FS-Code approach. After generating a set of non-secure prompts for each CWE, we query GitHub Copilot to complete the provided non-secure prompts and then use CodeQL to analyze the generated codes.
Table VII provides the results of generated vulnerable codes by GitHub Copilot using our FS-Code approach. The results are the number of codes with at least one vulnerability. In total, we generate 783 codes using 109 prompts for all four CWEs. In Table VII, column 2 to 5 provides results for different CWEs, and column 6 provide the sum of the codes with other CWEs that CodeQL detects. The last column provides the sum of the codes with at least one security vulnerability. In Table VII, we
observe that our approach is also capable of testing a black-box commercial model’s potential in generating vulnerable codes. We provide vulnerable code examples generated by GitHub Copilot in Appendix B.
Authors:
(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security ([email protected]);
(2) Keno Hassler, CISPA Helmholtz Center for Information Security ([email protected]);
(3) Thorsten Holz, CISPA Helmholtz Center for Information Security ([email protected]);
(4) Lea Schonherr, CISPA Helmholtz Center for Information Security ([email protected]);
(5) Mario Fritz, CISPA Helmholtz Center for Information Security ([email protected]).
This paper is