This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Domenico Cotroneo, University of Naples Federico II, Naples, Italy;
(2) Alessio Foggia, University of Naples Federico II, Naples, Italy;
(3) Cristina Improta, University of Naples Federico II, Naples, Italy;
(4) Pietro Liguori, University of Naples Federico II, Naples, Italy;
(5) Roberto Natella, University of Naples Federico II, Naples, Italy.
Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI.
Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with the human evaluation (Pearson’s correlation coefficient r = 0.84 on average). Finally, since it is a fully automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ∼ 0.17s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.
Keywords: Code correctness, AI code generators, Assembly, Offensive
Artificial Intelligence (AI) code generators use Neural Machine Translation (NMT) models to turn natural language (NL) descriptions into programming code. They represent a powerful asset in the arsenal of cybersecurity professionals and malicious programmers. Indeed, AI (offensive) code generators are becoming an attractive solution to creating proof-of-concept exploits in order to assess the exploitability and severity of software vulnerabilities [1, 2, 3], letting the AI helping developers to generate low-level (i.e., assembly) and complex code, with a reduced effort and improved effectiveness.
Despite the dramatic increase in the adoption of AI code generators, they still have limitations and potential drawbacks. For example, they may not always generate code that is correct, i.e., code that performs what is required from the NL description, as they may struggle with more complex programming tasks that require human creativity and problem-solving skills, or may incorrectly interpret developers’ descriptions. Furthermore, AI code generators can introduce security vulnerabilities if not properly tested and validated [4, 5, 6]. For these reasons, assessing the correctness of AI-generated code becomes a crucial challenge.
From the existing literature, it comes out that one of the most effective ways to assess the correctness of generated code is to perform a manual code review (i.e., human evaluation) [7, 8]. This involves having a human expert review the code and identify any errors or inconsistencies with the NL description. However, human evaluation has several limitations. First, manual analysis can be a time-consuming process. Indeed, reviewers must carefully examine each line of code and thoroughly test the software to ensure that it meets the intended requirements and NL specifications. This process also requires reviewers to be highly knowledgeable about the programming language, development environment, and intended functionality of the code to provide accurate assessments. Moreover, the analysis can be subjective, as different reviewers may have different interpretations of the code and its intended functionality, depending on the expertise and experience of the reviewer. This can lead to inconsistent assessments of code correctness. Last but not least, manual analysis is prone to human error, as reviewers may miss subtle errors or inconsistencies in the code, or may introduce errors and biases into their assessments due to factors such as fatigue, distractions, or subjective opinions. From the above considerations, it is clear that what we gained from the help of AI, we lost due to the manual review.
Unfortunately, there is currently no fully automated solution that can perform the human evaluation of AI-generated code. In fact, although existing automated testing and code analysis tools can effectively identify errors or inconsistencies in code, they do not provide any insights into whether the code is what is actually required by developers [9, 10, 11, 12]. Moreover, these solutions often require in inputs entire, compilable programs (e.g., entire functions) rather than single code snippets, which is instead often the case with AI-generated code.
Besides the automated solution issue, there is a more important one, i.e., how to evaluate the correctness of AI-generated code. Indeed, previous studies proposed a large number of output similarity metrics, i.e., metrics computed by comparing the textual similarity of generated code with a ground-truth reference implementation [13, 14, 15]. The major advantage of the proposed metrics is that they are reproducible, easily tuned, and timesaving. However, in the context of programming code generation, existing metrics are not able to fully reflect the correctness of the code.
As illustrated in the next section, generated code can be different from the reference but still be correct (e.g., the assembly conditional jumps jz and je are different instructions that can be used to perform the same operation); or, there can be subtle differences between the generated and the reference code, which can be similar yet produce different outputs (e.g., the assembly conditional jumps je and jne are syntactically similar instructions, but they perform the opposite operation). Hence, it is crucial to develop novel, more accurate methods for automatically evaluating the correctness of AI-generated code.
This paper proposes a method, named ACCA (Assembly Code Correctness Assessment), to automatically assess the correctness of assembly AI-generated code without any human effort. More precisely, our solution leverages symbolic execution to assess whether the generated code behaves as a reference implementation, despite syntactic differences between the reference and the generated code.
We apply ACCA to assess four state-of-the-art NMT models in the generation of security-oriented code in assembly language starting from NL descriptions in the English language, and compare the results of ACCA with the human evaluation and several baseline assessment solutions, including a wide range of output similarity metrics and the well-known ChatGPT by OpenAI. We show that the proposed method provides an almost perfect assessment of the code correctness and has a very strong correlation with the human evaluation, outperforming all the baseline assessment solutions.
human evaluation, outperforming all the baseline assessment solutions. In the following, Section 2 introduces a motivating example; Section 3 describes ACCA; Section 4 presents the experimental setup; Section 5 shows the experimental results; Section 6 presents the related work; Section 7 concludes the paper.