Automating the Correctness Assessment of AI-generated Code for Security Contexts: Proposed Methodby@escholar
116 reads

Automating the Correctness Assessment of AI-generated Code for Security Contexts: Proposed Method

tldt arrow

Too Long; Didn't Read

A paper on the automated method assessing AI-generated assembly code for security, outperforming human evaluation and baseline metrics.
featured image - Automating the Correctness Assessment of AI-generated Code for Security Contexts: Proposed Method
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.


(1) Domenico Cotroneo, University of Naples Federico II, Naples, Italy;

(2) Alessio Foggia, University of Naples Federico II, Naples, Italy;

(3) Cristina Improta, University of Naples Federico II, Naples, Italy;

(4) Pietro Liguori, University of Naples Federico II, Naples, Italy;

(5) Roberto Natella, University of Naples Federico II, Naples, Italy.

Abstract & Introduction

Motivating Example

Proposed Method

Experimental Setup

Experimental Results

Related Work

Conclusion & References

3. Proposed Method

To overcome the limitations in the assessment of AI-generated assembly code described in the previous section, we propose ACCA, a method to automatically assemble and symbolically execute both the reference (i.e., the ground truth) and predicted snippets (i.e., the code generated by the models). Through symbolic execution, the method simulates the execution of both programs and determines whether, starting from the same state, they terminate producing equivalent results and effects.

First, ACCA compares, at the string level, the prediction with the reference to assess its correctness, because, if the prediction is equal to the reference, then we assume the code prediction is correct. If this is not the case, the method assesses whether the code is syntactically correct. Indeed, if a code prediction is not structured according to the rules of the target programming languages, it is classified as incorrect.

If this is not the case, i.e., if the prediction differs from the ground truth reference and the prediction is syntactically correct, ACCA generates two source code files, one for the reference and one for the predicted snippet. It then assembles and links them to produce the executable files needed for the symbolic execution. At this point, the method symbolically executes both files resulting from the assembling process to assess whether they are equivalent.

Finally, ACCA returns the pair of syntactic correctness value (SYN) and the semantic correctness value (SEM) of the code predicted by the model, which is equal to 1 when the correctness is verified, 0 otherwise. Figure 1 shows a detailed flowchart of the syntactic and semantics correctness evaluation process.

In the rest of this section, we detail the phases of the workflow. For the sake of simplicity, we showcase examples of assembly code for the Linux OS running on IA-32 architecture to describe the evaluation process, although the proposed method is not restricted to this specific context, and it can be applied to different instruction set architectures and operating systems.

Figure 1: Detailed flowchart of ACCA.

3.1. String Comparison

ACCA first checks whether the predicted code snippet perfectly matches the ground truth by performing a string comparison. If they match (i.e., the code generated by the model is equal to ground truth), the prediction is considered both syntactically and semantically correct (i.e., SYN=1, SEM=1) and the evaluation ends. Otherwise, i.e., if they differ, the method proceeds with the evaluation of the syntactic and semantic correctness. The preliminary string comparison is done to speed up the evaluation process by skipping the symbolic execution process when not needed, i.e., when the prediction perfectly matches the reference snippet and is, therefore, correct.

3.2. Assembling Process

The purpose of the assembling process is to assess whether each code snippet generated by the model adheres to the syntax rules of the programming language it is written in, i.e., to check whether it is compilable. Since NMT is still far from producing entire complex programs, the output predicted by the models is a portion of an entire program (i.e., a single-line or multi-line statement). Thus, ACCA constructs a complete program by adding the necessary entry point and exit point code. For instance, consider the following code snippets for IA-32:

cmp EAX, EBX

je loop

pop ECX

This code compares the contents of two registers and, based on the result, either performs a jump operation or reads from the stack and saves the value into another register. The snippet is syntactically correct according to the assembly language, yet it is incomplete for the execution since code snippets need to contain a minimal set of information to be properly executed.

In the case of Linux OS, these instructions are kept in the text section, which must begin with the declaration global start to define the initial program entry point. Moreover, to make sure the program terminates correctly, the proposed method also includes a fictitious label that represents the exit address the code jumps to at the end of its execution. This label is declared in the data section, which contains initialized read-write data.

Therefore, to have a complete program, the code is modified as follows:

section .data

exit_addr db 0x56

section .text

global _start


cmp EAX, EBX

je loop

pop ECX

jmp exit_addr

Once the whole program is created, ACCA generates a source file and leverages an assembler to assess its syntactic correctness. Indeed, if the programs compile, then all the instructions of the programs are syntactically correct, therefore also the code generated by the model respects the structure of assembly programming language. There are three possible output scenarios for the compilation:

• No errors, in which the assembling process is completely correct;

• Warnings, in which the assembler reports some type of warning (e.g., word data exceeds bounds, invalid register size specification ignored, etc.), but the compilation still terminates without errors;

• Errors, in which the assembling process results in an error that prevents the code from being assembled.

In the first two cases, the output produced by the model is considered syntactically correct (i.e., SYN=1). Warnings are considered acceptable since they indicate issues involving bad practice, but are not severe enough to prevent the code compilation. When the compilation produces an error (i.e., the third case), we investigate the nature of the error. More precisely, we focus on a specific category of the error raised by the compiler, the SymbolNot-Defined (SDN) errors, which occur when the code contains a symbol (e.g., a label or variable) that has not been previously defined or initialized. We handle this category of errors appropriately. Indeed, since the predicted snippets contain only one or a few instructions, they might reference a label or variable defined in a different portion of the program, which leads to an assembling error even when the program is syntactically correct.

Consider, again, the previous code snippet: the first instruction compares the contents of the EAX and EBX registers and, if they are equal, the execution jumps to the loop label. However, this symbol has not been defined yet. To handle these cases, we analyze the assembler output to determine the missing symbol’s name and include it in the source code file as a fictitious label. This label simply points to a jump operation to the previously defined exit address (i.e., myExitAddr). Indeed, the destination of the jump is not significant for the evaluation since we are only interested in checking the correctness of the instructions generated by the model. Therefore, after a SDN error, ACCA further modifies the program as follows:

section .data

myExitAddr db 0x56

section .text

global _start


cmp EAX, EBX

je loop

pop ECX

jmp myExitAddr


jmp myExitAddr

Once the source code file is modified accordingly, ACCA repeats the assembling process as before to check if the compilation ends with no errors or warnings. In this case, ACCA assigns the SYN score equal to 1 and continues the evaluation to check the semantic correctness.

When the compilation ends with errors different from the SDN, such as an invalid combination of opcode and operands, expression syntax error, etc., then ACCA labels the model’s prediction as syntactically incorrect (SYN=0). Since a snippet syntactically incorrect is also semantically incorrect, then the evaluation process terminates, assigning also the SEM score equal to zero. A source code file is generated and assembled for both the ground truth and the predicted code snippet.

3.3. Symbolic Execution

To evaluate the semantic correctness, ACCA leverages the symbolic execution. To this aim, the method needs the program executable. If the assembling process ends correctly, the assembler outputs an object file, which is then fed to the linker to complete the process.

Since the same operation may be correctly implemented in different ways, a simple textual comparison with the reference is not enough to assess the semantic correctness of a program. We still consider the ground truth as the reference for the correct implementation of the intended code. However, we do not limit the comparison to a textual similarity, but we examine the system state at the end of the execution of both the reference and the generated code. Indeed, two programs that implement the same functionality using different operations can be considered semantically equivalent if they result in the same final system state. Since the final execution state depends on the inputs and initial state of the program, we need to compare the state produced by both programs for every possible combination of inputs and initial state.

Symbolic execution is a state-of-the-art solution for program analysis based on abstract execution. It consists in simulating the execution of a program providing symbolic values as its input instead of concrete ones. The result of this execution is a set of output values expressed as functions of the input symbols. ACCA uses symbolic execution to determine all the existing execution paths and all possible corresponding output system states. It then compares the set of output system states of the generated program with the set of output system states of the ground truth program: if they match, then the programs are semantically equivalent (SEM=1), otherwise, the method classifies the model’s prediction and the ground-truth as not semantically equivalent (SEM=0).

To symbolically execute the programs, we use a binary analysis platform (BAP) that loads each executable and provides a complete abstract representation of the target architecture, CPU registers, memory address space, and stack. The program is conceived as a sequence of basic blocks, i.e., a straight-line code sequence with no branches, and the interconnections between the blocks represent the jump operations. An example is shown in Fig. 2: the program compares the contents of two registers and, if they are equal, the execution jumps to a specific address, otherwise, it jumps to the next instruction. Each possible branch is the entry point of a different basic block, which contains a sequence of operations that can be executed in order and one last instruction that causes the execution to move to another basic block.

Figure 2: Example of a code snippet represented as a sequence of basic blocks.

ACCA begins the execution by initializing the abstract registers and flags to the same symbolic values; the method also sets the value of the stack pointer and initializes the memory to a blank state. Then, it executes the program by simulating each instruction step by step and keeping track of its state at each given step. During the execution of the programs, performing operations, such as arithmetic operations, comparisons, assignments, etc., that involve a variable (e.g., if X > 10) will yield an execution tree, i.e., a tree of operations formed by all the possible paths of the program, which encode all branch decisions taken up to a specific point. Execution trees are then translated into constraints, i.e., formulas that express a set of assumptions on the symbolic outputs of the execution. These constraints are finally solved by a satisfiability modulo theories (SMT) solver, which is typically employed in program verification to verify whether constraints can be satisfied by some assignment of concrete values to the program’s symbolic arguments [17]. As an example, consider the following simple constraints:

x > y

y > 2

10 > x

The SMT solver treats them as assertions that must be satisfied with the valid values of symbolic variables. It, therefore, outputs a value that is consistent with all the constraints (e.g., x=4, y=3).

ACCA symbolically executes both the reference code and the predicted snippet: assuming that the program state at the beginning of the execution is identical, if the programs are semantically equivalent, then their state is also identical at the end of the execution. Therefore, to assess the semantic correctness of the generated code compared to the ground truth, the proposed method checks whether the states of the architecture are equal at the end of both executions. The program state includes:

• state of the registers, i.e., the contents of the abstract CPU registers;

• state of the flags, i.e., the abstract status register that keeps track of the CPU state, including the result of arithmetic operations (e.g., carry flag, overflow flag, etc.) and interruptions;

• values on stack, i.e., the contents of the memory area used to store temporary values;

• path constraints, i.e., the condition of input values, defined over the previous items, that leads to the corresponding final state.

To this end, ACCA compares the state of each leaf node, i.e., the final states at the end of each path in the execution tree representing the program, of both executables. To compare the leaf nodes, the method constructs a set of lists for every final basic block of the two programs. Each set contains a list of register values, flag values, boolean constraints, and stack values. For example, a reference program whose execution tree ends with two basic blocks (leaf nodes) has two sets of lists. Each set contains all the values that represent the system state for that particular execution path. If the execution tree of the generated program has the same number of leaf nodes, then each list of the two sets is compared with each list of two sets of the reference program. If there is a correspondence between each leaf of the first program and each leaf of the second program, then they are semantically equivalent (SEM=1) and the evaluation process ends. Contrarily, if the leaf nodes of the two program execution trees do not match, then we conclude that the predicted code is not semantically equivalent to the reference snippet (SEM=0) and the process ends.

Since the total number of states can grow exponentially in the number of branches, one of the main challenges of symbolic execution is the path explosion problem [17]. Indeed, keeping track of a large number of pending branches to be explored impacts both the running time and the space requirements of symbolic execution. The primary causes of path explosion are loops and function calls. One common method to deal with this problem is to bind the path exploration to a limited number of iterations. To handle programs whose symbolic execution does not terminate, we set a maximum number of execution steps. Since AI-generated code is typically concise and consists of a few assembly instructions, a correct program concludes its execution in a few execution steps. If it runs for more than max steps, then the symbolic execution stops, and the generated program is classified as incorrect (SEM=0).

3.4. Implementation Details

To assemble the programs and generate the executable files, we rely on the wide set of available open-source software. For the previous examples (on the IA-32), we used the Netwide Assembler (NASM) [18], an 80x86 and x86-64 assembler that supports a wide range of executable formats, including Linux, BSD, and Windows operating system formats. As a binary analysis platform, we use ANGR [19]. ANGR provides support for a variety of CPU architectures, including ARM, MIPS, PPC, and x86 processors. It comprises a series of sub-components that implement the different steps necessary for the symbolic execution: to disassemble the executables and lift the binary code to an intermediate representation; to simulate the program state and execution, including registers and memory; and to solve the generated constraints, using the z3 [20] SMT solver as a backend. We set the maximum number of execution steps max steps to 100 to avoid infinite loops. Our implementation runs on both Linux and Windows OS. We will publicly share the implementation of ACCA on GitHub.