Table of Links
IV. Systematic Security Vulnerability Discovery of Code Generation Models
VII. Conclusion, Acknowledgments, and References
Appendix
A. Details of Code Language Models
B. Finding Security Vulnerabilities in GitHub Copilot
C. Other Baselines Using ChatGPT
D. Effect of Different Number of Few-shot Examples
E. Effectiveness in Generating Specific Vulnerabilities for C Codes
F. Security Vulnerability Results after Fuzzy Code Deduplication
G. Detailed Results of Transferability of the Generated Nonsecure Prompts
H. Details of Generating non-secure prompts Dataset
I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset
J. Effect of Sampling Temperature
K. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable Codes
L. Qualitative Examples Generated by CodeGen and ChatGPT
M. Qualitative Examples Generated by GitHub Copilot
I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset
In Table X, we provide the detailed results of evaluating various code language models using our proposed non-secure prompts dataset. Table X demonstrates the number of vulnerable Python and C codes generated by CodeGen-6B [6], StarCoder7B [24], Code Llama-13B [12], WizardCoder-15B [56], and ChatGPT [4] models. Detailed results for each CWE can offer valuable insights for specific use cases. For instance, as shown in Table X, Code Llama-13B generates fewer Python codes with the CWE-089 (SQL-injection) vulnerability. Consequently,
this model stands out as a strong choice among the evaluated models for generating SQL-related Python code.
Authors:
(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security ([email protected]);
(2) Keno Hassler, CISPA Helmholtz Center for Information Security ([email protected]);
(3) Thorsten Holz, CISPA Helmholtz Center for Information Security ([email protected]);
(4) Lea Schonherr, CISPA Helmholtz Center for Information Security ([email protected]);
(5) Mario Fritz, CISPA Helmholtz Center for Information Security ([email protected]).
This paper is