Automating the Correctness Assessment of AI-generated Code for Security: Experimental Result

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Domenico Cotroneo, University of Naples Federico II, Naples, Italy;

(2) Alessio Foggia, University of Naples Federico II, Naples, Italy;

(3) Cristina Improta, University of Naples Federico II, Naples, Italy;

(4) Pietro Liguori, University of Naples Federico II, Naples, Italy;

(5) Roberto Natella, University of Naples Federico II, Naples, Italy.

Table of Links

Abstract & Introduction

Conclusion & References

5. Experimental Results

To perform the experiments, we split the dataset into training, validation, and test sets using a common 80%/10%/10% ratio [45, 46]. For our experiments, we used a machine with a Debian-based distribution, with 8 vCPU, 16 GB RAM, and one Nvidia T4 GPU.

5.1. Quantitative Analysis

First, for all the models, we compared the average code correctness values computed by ACCA over all the examples of the test set with respect to the average semantic correctness assessed with the human evaluation. Table 3 shows the results.

The table highlights that the results provided by ACCA are very close to the human evaluation. Indeed, the method classifies, on average, 64% of the generated code snippets by all the models as correct. On the other hand, according to our manual code review, we found that, on average, the 71% of the generated code snippets are semantically correct. Hence, although they are very close, ACCA underestimates the code correctness when compared to human evaluation.

The quantitative analysis is, however, of limited use if we do not consider the percentage of code snippets that are equivalently classified by both human evaluation and the proposed method. For instance, both ACCA and the human evaluation can assess the code snippets as correct for 50% of the cases but still have 100% of discrepancy cases (i.e., the code snippets considered correct by ACCA and human evaluation are disjoint sets). Therefore, the table also shows a matching rate, which expresses the percentage of code snippets that are considered correct or incorrect by both human evaluation and ACCA. We found that our method and human evaluation provide the same classification in the ∼ 92% of the predictions (min 90%, max 94%). These results suggest that the proposed approach aligns well with human evaluations.

To better appreciate the evaluation provided by ACCA, we compared the results of the human evaluation with the results provided by the baseline solutions (described in § 4.3). To this aim, we computed an offset value, i.e., the difference between the optimal value represented by the human evaluation and the value provided by different assessing solutions. The lower the offset, the closer the result is to the human evaluation. Table 4 shows the results.

The average offset of the output similarity metrics ranges between a minimum (best) value of 0.09 (for SacreBLEU and edit distance) and a maximum (worst) value of 0.24 (for BLEU-4). ChatGPT provided results similar to the best-performing output similarity metrics, with an average offset equal to 0.09 over all the models when the correctness of models’ predictions is computed with respect to the NL intent (ChatGPT-NL), and equal to 0.10 when the predictions are compared to the ground truth (ChatGPT-GT). ACCA provided the lowest offset in 3 out of 4 models and, an average offset equal to 0.07, which results to be the lowest value, i.e., the code correctness computed by the proposed approach is, on average, the closest to the human evaluation.

5.2. Qualitative Analysis

We performed a manual inspection of the cases of discrepancy to examine when the method provides different results from the human evaluation. We have a discrepancy case when the method assesses the code as correct but the human evaluation does not, or when the method assesses the code as incorrect it is semantically correct according to the human evaluation.

As shown in Table 3, the method underestimates the performance of the models. In fact, an in-depth inspection of the results revealed that ∼ 99% of the discrepancy cases were due to examples classified as correct by human evaluation (value 1) but incorrect ACCA (value 0). To better discuss these discrepancy cases, Table 5 illustrates four representative examples of mismatch between ACCA and the human evaluation.

The first two rows of the table showcase two model predictions that are correctly labeled by human evaluation, but considered incorrect by our method. These misclassifications were due to the ambiguity of the code snippets since the same NL description can be expressed by semantically different code snippets. For instance, to zero out the stack (row # 1), a programmer can reset any register and then push the contents of the register (i.e., 0) into the stack register. Also, to move the contents of a register into a different one (row #2), a programmer can use the mov instruction to transfer a value from a source to a destination, or, equivalently, the xchg instruction, to swap the contents of the registers. Both the code snippets generated by the model accomplish what is required in the NL intent, but at the end of the symbolic execution, the state of the registers is different from the one obtained with the code in the ground truth (EAX is reset instead of EDX in the row # 1, while, in row # 2, EAX contains the value of ESI, instead of his original value). Therefore, ACCA provides the SEM score equal to zero, even if the snippets are semantically correct.

The last two rows of the table show examples of incorrect predictions that are wrongly classified as correct by the tool. As already remarked, these cases are very limited in numbers and represent situations in which, although the symbolic execution of predictions and ground-truth reference lead to the same state of the registers at the end of the execution, the model’s prediction is not what is required by the NL description. For instance, in row # 3, the prediction contains what is described in the NL intents except for the label L1. The label does not affect the state of the registers during the execution of the code, but it will impact the behavior of the whole program (unless the label is never used by other instructions). Row #4, instead, showcases a more complex example in which the correct instruction loop, which decrements the value of the counter register ECX and jumps to the args (i.e., the l1 label) if the counter is not zero, is replaced, in the model’s prediction, by the decrement of the counter (dec ecx and an unconditional jump (jmp). In this case, although the instructions led to the same state of the registers because the counter was not zero after the decrement, the prediction is incorrect since the unconditional jump does not take into account the condition on the ecx register specified in the NL intent.

5.3. Correlation Analysis

Additionally, we performed a statistical analysis by computing the correlation of ACCA with the human evaluation overall the code snippets of the test set (i.e., we considered the values on the single predictions).

To this aim, we computed the Pearson correlation coefficient r, which measures the strength of association (i.e., the linear relationship) between two variables in a correlation analysis and is defined as the covariance of the two variables divided by the product of their respective standard deviations [47]. The correlation coefficient is a unit-free value between −1 and 1, which represents perfect correlation, negative, and positive, respectively. Positive values indicate a positive correlation, i.e., the values of both variables tend to increase together, while negative values indicate a negative correlation, i.e., the values of one variable tend to increase when the values of the other variable decrease. A high value of the coefficient indicates that there is a strong correlation with the human evaluation. On the contrary, a small value indicates that there is a weak correlation with human evaluation. To provide context for the evaluation, we also computed the correlation coefficients between the baseline solutions and the human evaluation. Table 6 shows the results.

Confirming previous work [15], the analysis shows that Edit Distance and Exact Match are the output similarity metrics most correlated to the semantic correctness for security-oriented code, with both Pearson’s r coefficients equal to 0.70 and 0.61, respectively. The output similarity metric that is less correlated to human evaluation is the compilation accuracy (r = 0.48), showing that the syntactic correctness of the code is not highly correlated to its semantic correctness.

An important takeaway from our experiments is that, despite ChatGPTbased assessments providing results close to the human evaluation in the quantitative analysis (see 4), they have a correlation coefficient lower than the best-performing output similarity metric. Indeed, ChatGPT-GT has a correlation coefficient equal to 0.67, while ChatGPT-NL has a very poor correlation with human evaluation, resulting in the lowest value among all the baseline solutions (r = 0.41). This is a consequence of the high number of discrepancy cases between these solutions and the human evaluation, which is even more exacerbated in the ChatGPT-NL solution.

Finally, ACCA provides the highest correlation coefficient over all the four models, with an average value r = 0.84, hence being the only one to have a very strong correlation with human evaluation [48].

5.4. Computational Cost

We assessed the computational cost of ACCA in assessing the code correctness. Since the method skips the symbolic execution process for all generated snippets that are identical to the ground truth or that are not syntactically correct, we performed a thorough analysis considering three different cases: the evaluation of all the predictions in the test set (i.e., 590 code snippets), the evaluation of the subset of generated snippets that do not match the ground truth (i.e., “PR ̸= GT”), and evaluation of the subset of generated snippets that do not match the ground truth and are also syntactically correct (i.e., “PR ̸= GT & SYN=1”).

Table 7 presents a detailed analysis of the computational cost of ACCA. The table shows the average and standard deviation of our method’s cost, in terms of time (seconds) to assess a single code snippet, for each model.

Regarding the evaluation of the entire test set, the average time required to assess a snippet is, as expected, lower than in other cases, with an average time equal to ∼ 0.17s. This is because predictions matching the ground truth are included in the analysis. In fact, in these cases, both syntactic assessment and symbolic execution for the semantic assessment are skipped.

When we consider the subset of samples in which PR ̸= GT, i.e. when we exclude from our analysis the predictions that perfectly match the ground truth, the mean time per snippet increases on all four models. Indeed, in this case, ACCA requires on average ∼ 0.28s to assess the code correctness.

In the last scenario, PR ̸= GT & SYN=1, the value increases again because snippets that were labeled as syntactically incorrect during the assembling process (see § 3.2) are excluded from the analysis. Therefore, this evaluation concerns only the predictions that went through the symbolic execution, i.e., all the evaluation steps of the proposed method. Overall, regardless of the model, ACCA needs on average 0.34s to symbolically execute the code and perform the evaluation.

Another aspect that influences the total computational cost required for the analysis is the type of operation performed by the code snippet. For instance, while logical operations (e.g., and, xor, not) and instructions that handle register contents (e.g., inc, dec, mov) are fast computed (i.e., ∼ 0.25s), instructions used to iterate over a variable, to compare two registers, or to perform conditional jumps (e.g., cmp, loop, jns) are less time-efficient. This is because arithmetical and logical operations are often simpler to implement because they involve basic bit-level manipulation. Contrarily, comparisons usually involve comparing values from different registers or memory locations, and conditional jumps depend on the result of these comparisons. This complexity can lead to longer execution times compared to simple logical operations. Table 8 presents two examples of outliers in the computational cost analysis, ACCA’s result, and the time needed for their evaluation.

Both ground truth snippets perform similar operations: a comparison between two registers or a numerical value, a conditional jump based on the previous result, and an unconditional jump to a specific label. While in row # 1 the code generated by the model has the same code complexity, in the second one the prediction exhibits lower complexity since the last jump is missing (i.e., jmp while). Both predictions are classified as incorrect by ACCA and take ∼ 24s and ∼ 12s, respectively.

Finally, to provide a context for the evaluation, we compared the computational cost of ACCA with the ones of the output similarity metrics, which are automatic and time-saving solutions, and the ChatGPT-based assessment solutions. Unsurprisingly, we found that the output similarity metrics provide an average estimate of similarity in a very limited amount of time (∼ 0.004 seconds on average per snippet), ranging from 0.001 seconds for the exact match to ∼ 0.01 seconds for the SacreBLEU metric. ChatGPT is also time-efficient, needing only ∼ 0.003 seconds to evaluate the correctness of the generated code with respect to the code description (ChatGPT-NL) and ∼ 0.001 for the comparison between predicted and ground truth snippets (ChatGPT-GT). As a result, the computational costs of ACCA, are higher than one of the baseline solutions since it depends on the non-negligible time needed by the binary execution.

However, it is important to stress again that the output similarity metrics provide only an estimate of the code similarity rather than evaluating the code’s correctness. Moreover, although ChatGPT provides limited computational time for the assessment, it is not an automated solution as it requires a non-trivial manual effort, including detailed instructions and several iterations with the human operator. On the contrary, our method is fully automated as it does not require any human intervention for the assessment.

Finally, it is worth noticing that the computational times of ACCA are definitely lower than the average time required by human analysts to manually inspect the code, based on our experience. Indeed, since the human analyst needs to check both the NL description and the code snippet predicted by the models, in our experiments, the assessment of the semantic correctness required ∼ 20 seconds on average per code snippet.