This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Mohammed Latif Siddiq, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame;
(2) Joanna C. S. Santos, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame.
Fig. 2 shows an overview of our framework and how it was created. Our framework consists of three major components: a dataset of prompts, an evaluation environment to execute the code, configurable assessment techniques, and novel evaluation metrics. Each of these components are further described in the next subsections.
To create an effective security benchmarking framework, we first needed a high-quality dataset of prompts. Although there are two datasets available (LLMSecEval and SecurityEval) [63, 67] they have many problems. First, one of them (LLMSecEval [67]) is a dataset of natural language prompts, which is a format that not all code LLMs support. Second, SecurityEval has several prompts that do not execute and lack test cases to verify both its functional correctness and the presence of vulnerabilities in the generated code. Therefore, we aimed to create a manually curated and high-quality dataset of prompts to fulfill our needs.
The creation of the framework’s dataset of prompts involved two steps. We first retrieved code snippets and texts from different sources. Then, we manually crafted a prompt from the retrieved code snippets. In the following sections, we presented the approach to collecting and crafting the prompts for our framework.
Our goal was to create a prompt dataset that reflects the real-life security-centric needs of software developers. To build this dataset, we mined code snippets from the following sources:
- StackOverflow [1] is a popular question-answering website among developers. Users describe their problems, and others try to solve them via discussion. We retrieved the 500 top most popular questions with an accepted answer containing the word “unsafe” or “vulnerable”, and that is tagged as a Python-related question. From these 500 questions, we applied a set of inclusion and exclusion criteria. The inclusion criteria were: the question has to (1) explicitly ask “how to do X” in Python; (2) include code in its body; (3) have an accepted answer that includes code. We excluded questions that were (1) open-ended and asking for best practices/guidelines for a specific problem in Python; (2) related to finding a specific API/module for a given task; (3) related to errors due to environment configuration (e.g., missing dependency library); (4) related to configuring libraries/API; (5) syntax-specific types of questions. By applying the criteria above to these 500 questions, we obtained a total of 13 code snippets.
- The Common Weakness Enumeration (CWE) [43] is a community effort to create a list of vulnerability types (weaknesses). Each weakness may also include demonstrative examples, which are code snippets written in different programming languages (e.g., C, PHP, Java, Python, etc.) containing a vulnerability that an attacker can exploit. We retrieved the list of all CWEs and extracted all demonstrative examples written in Python. As a result, we retrieved a total of 1 code snippet. As not all CWEs have examples in Python, we also created examples ourselves based on the CWE descriptions. We created a total of 35 coding snippets.
- CodeQL [26] is a static analysis tool that detects vulnerabilities by making queries over a source code graph representation. This tool’s documentation includes vulnerable examples in different programming languages. Thus, we retrieved a total of 35 vulnerable Python samples from CodeQL’s documentation.
- Sonar Rules [57] is a set of pre-defined patterns used by the SonarQube tool to analyze and assess the quality of a code. These rules cover a wide range of coding standards, best practices, and vulnerabilities. Thus, we retrieved a total of 9 Python examples provided in the documentation for the Python-related vulnerability rules.
For each collected sample from these sources, we extract their title, content (i.e., the raw text/code collected from the source), and source URL.
After collecting the samples, we went through them manually and created a well-structured prompt. Each prompt is a function/method signature that describes a security-relevant coding task, i.e., a problem in which there are one or more possible solutions that are functionally correct but insecure. The prompt also includes the required relevant module imports. For each prompt, we assign a unique identifier and manually classify it with a CWE-ID.
For each prompt, we also create an example of an insecure solution, i.e., a functionally correct solution, but that has a vulnerability. This way, our dataset is not only a collection of prompts but also includes executable vulnerable programs.
Listing 1 shows an example of a prompt in our dataset. This prompt instructs the model to use the GitHub search REST API to obtain the profile information for a given username. The first 15 lines (highlighted) include the necessary context and a docstring describing the task to complete. The rest of the code is a possible insecure solution for this prompt. As observed, this task has the risk of a model generating a code prone to server-side request forgery attacks (CWE918).
Our framework provides as input to an LLM the prompts in its dataset. For each prompt, our framework requests the LLM to generate k solutions to the prompt (where k can be specified). Each generated code is saved in a Python script file.
As prior studies have shown, LLMs can generate code with simple compilation errors (e.g., missing the end curly bracket for a code block) [15, 61, 64]. Hence, our framework includes a static filtering phase responsible for (a) automatically fixing syntax errors through three rules and (b) removing generated code snippets that are not executable (even after attempting to fix it).
The rules used to repair compilation errors automatically work as follows:
• H1: Code Block Extraction Conversation-style models, such as ChatGPT, can include explanations (i.e., natural language text) before and/or after the generated code and then enclose the code within backticks (i.e., ```code```). Thus, the first heuristic removes the text written in natural language and only keeps the generated code in the first block of text delimited by three backticks.
• H2: Prompt Addition In the generated code by an LLM, the initial prompt can be absent. This absence will result in syntax errors as the required function/class signature and imported libraries are missing. Thus, we prepend the generated code with the original prompt.
• H3: Extra Code Removal This heuristic removes any extra code after having the following patterns (including these patterns): '\ndef', '\nif', '\n@app', "\n'''", '\nclass'. These patterns usually indicate that the code generation model is done generating code.
To systematically evaluate the security of the code produced by a model, the infrastructure has two major components: a set assessment technique and a security checker.
Our framework has an evaluation environment with the runtime configuration to execute and verify a generated code security. This environment is composed of docker images for the prompts [24]. Each prompt is released in a standalone docker image with all the required dependencies to run the code.
During this evaluation process, the generated code is placed into the evaluation environment and executed in a sandbox to prevent unsafe behavior. Next, our framework evaluates the security of the code generated by LLMs using two assessment techniques: dynamic-based assessment, and static-based assessment.
Dynamic-Based Assessment Since each prompt in our dataset has a CWE-ID and example of an insecure solution, the expected functional and insecure output of a function from a given input is known. If a model generates an insecure code, the behavior will differ from the expected one. Deviations from the expected (secure) behavior can be used to check if the source code is susceptible to vulnerabilities. Thus, this assessment includes the development of test cases that include assertions for the expected functional and security properties. Specifically, we write a unit test for each of the prompts in our dataset using the unittest module. Each unit test class has two test methods; one verifies the functional behavior of the generated code, whereas the other checks the security behavior of the program.
To clarify, the code in Listing 2 shows the test case created for the prompt shown in Listing 1. This test class has two methods (test_security and test_functionality). The first one checks whether the solution is functionally correct by making two HTTP GET requests to the flask application; one request passes as input a username that exists (octocat), and the other one requests with a non-existent username (jn2Elxop0). Then, the code asserts whether the generated successfully retrieves the metadata for these users. The second test method checks whether the generated code is prone to server-side request forgery attacks. It is important to highlight that when we generate the code, we save it in a file named A_cweID_promptID.py. This way, the test case can import the generated code for the test (as shown in line 5).
Static-Based Assessment Programs may use built-in or external libraries/modules/functions (henceforth, simply “APIs”) that are inherently unsafe. Since these unsafe APIs are used in the wild, they are likely to be part of the training data used by LLMs. Thus, there is a risk that the model may use these unsafe APIs in the generated code.
For example, the source code shown in Listing 3 uses the md5 hash function. This weak hash reasonably allows an adversary to determine the original input through pre-image attacks. Though this is a weak hash function and vulnerable to security attacks, it still exists due to support for backward compatibility. This is an example of a source code with CWE-328: Use of Weak Hash [12]. These API patterns can be detected using the static analysis of the source code.
Our framework uses CodeQL [26] for unsafe API matching. CodeQL is a static code analysis engine designed to automatically check for vulnerabilities in a project by executing QL queries against a database generated from the source code. CodeQL can be used to match the function of the function call. For example, the QL query shown in Listing 4 is taken from the CodeQL repository, which can match a method name and check if it is called.
Another thing is that several vulnerability types (i.e., injection vulnerabilities) are caused by untrusted data flows [39, 72]. These weaknesses are traditionally detectable through taint analysis, which is a technique that tracks flows of sources of potentially untrusted (tainted) data (e.g., parameters in HTTP requests) to sensitive program areas (sinks) [59]. Taint analysis can be performed at compile time (static) or runtime (dynamic).
For example, the code in listing 5 contains an OS Command Injection (CWE-78) [13]. This function uses os.system but does not check the input, which may come from an untrusted source and potentially leads to OS injection.
In these cases, our framework uses CodeQL to perform static analysis to track the taint variables and check if they reach a sink method (e.g., os.system).
To illustrate, listing 6 represents a taint tracking code where the user input is given using a network call and dumps the untrusted data in a file. We used this taint tracking system from CodeQL to measure whether the generated code is vulnerable.
Code generation models produce multiple potential solutions (i.e., code snippets) for a given prompt. Models are commonly evaluated using the pass@k metric [10, 32]. This metric aims to evaluate the probability that at least one out of k generated samples are functionally correct. To evaluate the pass@k, we generate n samples per prompt (n ≥ k), count the number of samples c that are functionally correct (c ≤ n), and calculate the unbiased estimator from Kulal et al. [32]:
(3) The secure@k metric measures the probability that all code snippets out of k samples are secure (where s is the number of secure samples generated). That is, the prompt is considered secure if all of the generated code in the top-k passes our assessment techniques. To clarify, consider that we have 10 prompts, and a model generates 10 outputs for each problem described in a prompt. If our assessment technique finds that out of 10 outputs, 6 prompts have all of the generated code passing the assessment techniques, then the secure@k score will be 60%.
The vulnerable@k metric measures the probability that at least one code snippets out of k samples are vulnerable (where v is the number of vulnerable generated samples). We consider a prompt to be vulnerable if any of the top-k generated samples has a vulnerability detected by our assessment techniques. For this metric, the model is better if the vulnerable@k score is lower.
▶ Estimating the pass@k, secure@k, and vulnerable@k: Since calculating Kulal et al. [32] estimator directly results in large numbers and numerical instability [35], to compute the metrics, we used a numerically stable implementation from Chen et al. [10]. This numerically stable implementation simplifies the expression and evaluates the product term-byterm.