This story draft by @escholar has not been reviewed by an editor, YET.

Detailed Results of Evaluating CodeLMs using Non-secure Dataset

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and I. Introduction

II. Related Work

III. Technical Background

IV. Systematic Security Vulnerability Discovery of Code Generation Models

V. Experiments

VI. Discussion

VII. Conclusion, Acknowledgments, and References


Appendix

A. Details of Code Language Models

B. Finding Security Vulnerabilities in GitHub Copilot

C. Other Baselines Using ChatGPT

D. Effect of Different Number of Few-shot Examples

E. Effectiveness in Generating Specific Vulnerabilities for C Codes

F. Security Vulnerability Results after Fuzzy Code Deduplication

G. Detailed Results of Transferability of the Generated Nonsecure Prompts

H. Details of Generating non-secure prompts Dataset

I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset

J. Effect of Sampling Temperature

K. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable Codes

L. Qualitative Examples Generated by CodeGen and ChatGPT

M. Qualitative Examples Generated by GitHub Copilot

I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset

In Table X, we provide the detailed results of evaluating various code language models using our proposed non-secure prompts dataset. Table X demonstrates the number of vulnerable Python and C codes generated by CodeGen-6B [6], StarCoder7B [24], Code Llama-13B [12], WizardCoder-15B [56], and ChatGPT [4] models. Detailed results for each CWE can offer valuable insights for specific use cases. For instance, as shown in Table X, Code Llama-13B generates fewer Python codes with the CWE-089 (SQL-injection) vulnerability. Consequently,


Fig. 7: The number of discovered vulnerable codes versus the number of sampled codes generated by (a), (c) CodeGen, and (b), (d) ChatGPT. The non-secure prompts and codes are generated using our FS-Code method. While Figure 4 already has removed exact matches, here, we use fuzzy matching to do further code deduplication.


TABLE VIII: The number of discovered vulnerable codes generated by the CodeGen and ChatGPT models using the promising non-secure prompts generated by CodeGen. We employ our FS-Code method to generate non-secure prompts and codes. Columns two to thirteen provide results for Python codes. Columns fourteen to nineteen give the results for C Codes. Column fourteen and nineteen provides the number of found vulnerable codes with the other CWEs that CodeQL queries. For each programming language, the last column provides the sum of all codes with at least one security vulnerability.


this model stands out as a strong choice among the evaluated models for generating SQL-related Python code.


Authors:

(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security ([email protected]);

(2) Keno Hassler, CISPA Helmholtz Center for Information Security ([email protected]);

(3) Thorsten Holz, CISPA Helmholtz Center for Information Security ([email protected]);

(4) Lea Schonherr, CISPA Helmholtz Center for Information Security ([email protected]);

(5) Mario Fritz, CISPA Helmholtz Center for Information Security ([email protected]).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks