This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Mohammed Latif Siddiq, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame;
(2) Joanna C. S. Santos, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame.
In this section, we discuss works that focus on empirically investigating the capabilities of LLMs and works related to benchmarking LLMs.
Automated code generation techniques are initially focused on deducting the users’ intent from a high-level specification or input-output examples [20, 21, 41]. These approaches transform task specifications into constraints, and the program is extracted after demonstrating its ability to satisfy the constraints [21]
With the rise of the attention-based transformer model [68], code generation task is considered a sequence-to-sequence problem where the user intent comes in the form of natural language. Many LLMs have been produced to generate code, such as CodeBert [16], Codex [10], and CodeT5 [69]. Code generation models are heavily used in producing code for competitive programming challenges, for example, AlphaCode [37]. GitHub Copilot [25], a closed-source tool for code generation, uses the upgraded version of Codex [10] to develop an improved auto-complete mechanism. Currently, code generation models are part of multi-tasks model (i.e., perform different tasks). For example, GPT-4 [49] can perform image and text analysis. It is also capable of code generation.
Though the performance of the code generation task is increasing daily and user end tools like GitHub Copilot are being adapted by users [60], they are not free of security risk. Pearce et al. [51] studied the output of GitHub Copilot with their early release. They found that 40% of the outputs are vulnerable. Siddiq et al. [62] explored the code generative models and their datasets by following standard coding practices and security issues. Sandoval et al. [58] measured if the AI assistant generates more vulnerable codes than users. Siddiq et al. [61] suggested a static analyzer-based ranking system to have more secured code in the output. Hajipour et al. [22] investigated finding the vulnerabilities in the black box code generation model.
While there is a recent growing body of literature that investigated the capabilities of code generation beyond their functional correctness but also security [44,45,51,52,58,65], these existing studies only pinpoint the observed issues without proposing new metrics or a way to systematically benchmarking LLMs with respect to the security of the LLM generated code. Unlike these previous studies, in this paper, we release a dataset and an evaluation environment that can automatically benchmark code LLMs with respect to security.
Traditionally, deep learning models use a training set for learning and a test set to evaluate the model. For example, CodeXGlue [40] includes Concode dataset [28] for Java code generation which contains a test set of 2,000 samples. The Automated Programming Progress Standard (APPS) dataset has been used for measuring the performance of the code generation model for generating solutions for coding challenges. It contains 130,000 test cases. However, because of the involvement of the large language models in code generation, they need to be evaluated from the perspective of understating prompts that mimic real-life developers and evaluated using execution-based systems.
The authors of the Codex [10] model developed HumanEval for this purpose. HumanEval contains 164 simple programming problems with canonical solutions and test cases. Mostly Basic Python Problems Dataset (MBPP) dataset contains around 1,000 samples for a similar purpose [48]. These datasets are later extended for different programming languages [3, 76]. CoderEval dataset [74] uses samples from real-world software. However, these datasets focus on functionalities. Pearce et al. [51] provided a set of scenarios for testing the security of the generated code. SecurityEval [63] formalized the prompts for testing security for many CWEs. Though these datasets focus on measuring security, they do not enable an automated and systematic approach for benchmarking LLMs provided by our framework.