This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Domenico Cotroneo, University of Naples Federico II, Naples, Italy;
(2) Alessio Foggia, University of Naples Federico II, Naples, Italy;
(3) Cristina Improta, University of Naples Federico II, Naples, Italy;
(4) Pietro Liguori, University of Naples Federico II, Naples, Italy;
(5) Roberto Natella, University of Naples Federico II, Naples, Italy.
Automatic Program Evaluation. Traditionally, the problem of automatic program assessment has been largely addressed for educational purposes, aiming to assist educators in the student work evaluation process. Insa and Silva [49, 50] presented a tool to assess Java programs by automatically validating different properties, such as the use of interfaces and class hierarchy. Romli et al. [51] developed FaSt-Gen, a framework of test data generation to cover both the functional and structural testing of programs for automatic assessment. Li et al. [52] leveraged random testing and dynamic symbolic execution (DSE), i.e., a software testing technique that simulates the execution of a program by providing symbolic inputs instead of concrete values. They generated test inputs and ran programs on these test inputs to compute values of behavioral similarity. Arifi et al. [53] proposed a method to grade C programs in an educational context automatically. They measured the similarity between programs by comparing the outputs of their symbolic execution. CASM-VERIFY [54] is a tool to automatically check the equivalence of optimized assembly implementations of cryptographic algorithms. The tool decomposes the equivalence checking problem into several small sub-problems using a combination of concrete and symbolic evaluation. The use of symbolic execution to evaluate code similarity has been explored also for security applications. Luo et al. [55] introduced a binary code similarity comparison method for code theft detection. Gao et al. [56] presented BinHunt, a method to identify the semantic differences between an executable and its patched version, revealing the vulnerability that the patch eliminates. Scalabrino et al. focused on automatically assessing the understandability of code snippets by combining 121 existing and new metrics, including code-related, documentation-related, and developer-related metrics. They concluded, however, that these metrics are not suited to capture the complexity of code in practical applications. Ullah and Oh [57] proposed a neural network-based solution to perform binary diffing on x86 architecture binaries, i.e., the process of discovering the differences and similarities in functionality between two binary programs. Leveraging symbolic execution to check semantic equivalence has been proposed and used in the area of compiler validation since compilers should preserve semantics. For example, Bera et al. [58] applied symbolic execution on the bytecode produced by the compilation with optimizations and that produced by the compilation without optimizations, in order to detect compiler bugs. Hawblitzel et al. [59] detected compiler bugs by comparing assembly language outputs through symbolic execution. These solutions, however, require entire programs as input and do not work on portions of code (i.e., code snippets), which is often the case with AI-generated code since NMT is still far from generating entire complex functions, particularly in the context of offensive security.
Programming Language Code-oriented Metrics. In addition to stateof-the-art textual similarity metrics used as a baseline for the evaluation (see § 4.3), recent work proposed a set of novel code-oriented metrics, i.e., metrics created ad-hoc for specific programming languages, to automatically assess the correctness of the generated code. Examples of code-oriented metrics are CodeBLEU [60] and RUBY [61], which were introduced to evaluate programs written in Java and C#. However, these solutions rely on deeper program analysis, including syntax and dataflow match, and require compilable code to function, which prevents them from being language-agnostic. Indeed, none of the available code-oriented metrics is designed for low-level programming languages such as assembly. Previous work on code generation also resorted to functional correctness to evaluate the quality of the generated programs, where a code sample is considered correct if it passes a set of unit tests. Kulal et al. [62] used an evaluation metric based on functional correctness to address the problem of producing correct code starting from pseudocode. They generated k code samples per problem and assessed the ratio of problems in which any of the k samples passed the set of unit tests. Chen et al [63] proposed pass@k, an unbiased and numerically stable implementation of this metric. They generated n ≥ k samples per task (n = 200 and k ≤ 100), counted the number of correct samples c ≤ n that pass unit tests, and calculated an unbiased estimator to benchmark their models in the generation of Python programs from docstrings. To estimate the functional correctness of a program, however, a set of unit tests needs to be manually constructed. This requires a significant effort that is often unfeasible for large amounts of generated code.
AI Generative for Security. Automatic exploit generation (AEG) research challenge consists of automatically generating working exploits [64]. This task requires technical skills and expertise in low-level languages to gain full control of the memory layout and CPU registers and attack lowlevel mechanisms (e.g., heap metadata and stack return addresses). Given their recent advances, AI-code generators have become a new and attractive solution to help developers and security testers in this challenging task. Liguori et al. [33] released a dataset containing NL descriptions and assembly code extracted from software exploits. The authors performed an empirical analysis showing that NMT models can correctly generate assembly code snippets from NL and that in many cases can generate entire exploits with no errors. The authors extended the analysis to the generation of Python security-oriented code used to obfuscate software exploits from systems’ protection mechanisms [2]. Yang et al. [38] proposed a data-driven approach to software exploit generation and summarization as a dual learning problem. The approach exploits the symmetric structure between the two tasks via dual learning and uses a shallow Transformer model to learn them simultaneously. Yang et al. [1] proposed a novel template-augmented exploit code generation approach. The approach uses a rule-based template parser to generate augmented NL descriptions and uses a semantic attention layer to extract and calculate each layer’s representational information. The authors show that the proposed approach outperforms the state-of-the-art baselines from the previous studies of automatic code generation. Ruan et al. [3] designed an approach for software exploit generation based on prompt tuning. The solution aids the generation process by inserting trainable prompt tokens into the original input to simulate the pre-training stage of the model to take advantage of its prior knowledge distribution. Xu et al. [65] introduced an artifact-assisted AEG solution that automatically summarizes the exploit patterns from artifacts of known exploits and uses them to guide the generation of new exploits. The authors implemented AutoPwn, an AEG system that automates the generation of heap exploits for Capture-The-Flag pwn competitions. Recent work also explored the role of GPT-based models, including ChatGPT and Auto-GPT, in the offensive security domain. Botacin [66] found that, by using these models, attackers can both create and deobfuscate malware by splitting the implementation of malicious behaviors into smaller building blocks. Pa et al. [67] and [68] proved the feasibility of generating malware and attack tools through the use of reverse psychology and jailbreak prompts, i.e., maliciously crafted prompts able to bypass the ethical and privacy safeguards for abuse prevention of AI code generators like ChatGPT. Gupta et al. [68] also examined the use of AI code generators to improve security measures, including cyber defense automation, reporting, threat intelligence, secure code generation and detection, attack identification, and malware detection. All previous work uses state-of-the-art output similarity metrics or performs manual analysis to assess the correctness of AI-generated code/programs.
Our work is complementary to previous ones. Indeed, this work proposes a method that leverages symbolic execution to automatically assess the correctness of low-level code snippets used in security contexts. Since the method does not necessarily require full programs in inputs, it is suitable for assessing AI-generated code because they are often incomplete or noncompilable programs. Moreover, the proposed method does not require any human intervention, yet, differently from traditional text similarity metrics, which are commonly used to assess the performance of AI-generated code, its accuracy is comparable to human evaluation.