paint-brush
Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code: Resultsby@textmodels

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code: Results

tldt arrow

Too Long; Didn't Read

Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code.
featured image - Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code: Results
Writings, Papers and Blogs on Text Models HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Mohammed Latif Siddiq, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame;

(2) Joanna C. S. Santos, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame.

Table of Links

5 Results

The next subsections describe the results and provide an answer to each of our RQs.

5.1 RQ1 Results

Table 1 contrasts each dataset, including our framework’s dataset (denoted by SALLM on this table).



CWE Coverage

As shown in this table, our dataset covers 2.5 times more CWEs (45 CWEs) than LLMSecEval [67], which covers only 18 CWEs (a subset of the CWE top 25 [42]. In contrast, SecurityEval [63] covers 69 CWEs, whereas SALLM’s dataset has a slightly less amount of CWEs.



Upon closer inspection, we noticed that this is due to how the authors of the SecurityEval dataset chose to assign CWE IDs to their prompts. The CWE list includes hierarchical relationships (e.g., CWE-89: SQL Injection is a child of CWE943: Improper Neutralization of Special Elements in Data Query Logic). In our dataset, we deliberately chose to map to CWE IDs that were at the lowest level of the CWE hierarchy (as more specialized as possible), unlike SecurityEval, which would have prompts tagged with higher-level abstraction CWE when a more specific one was available.


Dataset Size

As shown in this table, LLMSecEval has prompts instructing an LLM to generate C code and Python code. Out of their 150 prompts, only 83 of them are for Python. Unlike our dataset, their prompts are natural language prompts in the form of “Generate [language] code for the following: [coding problem description]”. Thus, they can only be used for fine-tuned LLMs for natural language instructions, which is not true for all LLMs. For example, StarCoder [35] is an LLM that was not trained for natural language prompts and, as a result, is unable to understand prompts in the form of "Write a Python code that parses a CSV file.".


It is also important to highlight that although SecurityEval has more prompts than SALLM’s dataset, its dataset size in the number of tokens is smaller than ours. SALLM’s dataset prompts have an average of 265 tokens, whereas SecurityEval has 157 tokens on average. Moreover, we also found several prompts that were not compilable because they required external libraries or were single scripts part of a codebase (e.g., a Django application).


5.2 RQ2 Results

In this section we report the results of running our assessment techniques for the code generated by the studied LLMs.


Table 2 presents the vulnerable@k and secure@k computed based on the outcomes from SALLM’s assessment technique. The numbers in dark green are those that had the best performance for a given metric; the numbers in dark red are those in which the model had the worst performance. Recall that for the vulnerable@k metric, a lower value is better.


As shown in this table, the vulnerable@k varied from 16% to 59%. For temperature 0, all models had the same value for their vulnerable@1, vulnerable@3, and vulnerable@5 as well as their secure@1, secure@3, and secure@5. The explanation for this observation is that the temperature 0 makes the results more predictable, i.e., the generated output has less variance.


From these results, we also found that, on one hand, StarCoder was the best performing LLM with respect to secure code generation. It had the lowest vulnerable@k across all temperatures. On the other hand, CodeGen-2B and CodeGen2.5-7B had a worse performance, on average, than the other LLMs. For the GPT-style models, GPT-4 performed better than GPT-3.5-Turbo.


5.3 RQ3 Results

We collected 423 compilable Python samples from the ChatGPT-generated code using Developers’ conversationstyle prompts. We run CodeQL to check vulnerable APIs and taint analysis on the generated code. In table 3, we presented the CWEs CodeQL found and the number of vulnerabilities in each CWE. CodeQL found 10 types of CWEs across 12 Python samples. CWE 312: Cleartext Storage of Sensitive Information is the most common occurrence in the generated Python codes. Out of 10 types of CWEs, four CWEs are in the top 25 CWE ranks in 2023 of these 10 CWEs. There is also noticeable no injection based CWE i.e., OS, Path, or SQL Injection.


Upon further inspection of the CodeQL output, we found that ChatGPT uses a pseudo-random generator to generate security-sensitive values. This random generator can limit the search space and generate duplicate values, which the hackers can exploit.


Another common issue we found was flask applications running in debug mode. Though it is helpful for the preproduction phase, debug information can leak sensitive information, and ChatGPT generates code where debugging is on for the Flask application.


We also found that ChatGPT generates a logging code where the sensitive information is not encrypted or hashed. This sensitive information can be used to exploit an application. It also provides hard-coded credentials in the code. Users should modify them before using the code in their application.