Standard GI Mutations vs. LLM Edits in Random Sampling and Local Search

Authors:

(1) Alexander E.I. Brownlee, University of Stirling, UK;

(2) James Callan, University College London, UK;

(3) Karine Even-Mendoza, King’s College London, UK;

(4) Alina Geiger, Johannes Gutenberg University Mainz, Germany;

(5) Justyna Petke, University College London, UK;

(6) Federica Sarro, University College London, UK;

(7) Carol Hanna, University College London, UK;

(8) Dominik Sobania, Johannes Gutenberg University Mainz, Germany.

Table of Links

Abstract & Introduction

Experimental Setup

Results

Conclusions and Future Work

Acknowledgements & References

3 Results

The first experiment compares standard GI mutations, namely Insert and Statement edits, with LLM edits using differently detailed prompts (Simple, Medium, and Detailed) using Random Sampling. Table 1 shows results for all patches as well as for unique patches only. We report how many patches were successfully parsed by JavaParser (named as Valid), how many compiled, and how many passed all unit tests (named as Passed). We excluded patches syntactically equivalent to the original software. Best results are in bold.

We see that although substantially more valid patches were found with the standard Insert and Statement edits, more passing patches could be found by using the LLM-generated edits. In particular, for the Medium, and Detailed prompts 292 and 230 patches passed the unit tests, respectively. For the Insert and Statement edits only 166 and 91 passed the unit tests, respectively. Anecdotally, the hot methods with lowest/highest patch pass rates differed for each operator: understanding this variation will be interesting for future investigation.

It is also notable that LLM patches are less diverse: over 50% more unique patches were found by standard mutation operators than the LLM using Medium,

Table 1. Results of our Random Sampling experiment. We exclude patches syntactically equivalent to the original software in this table. For all and unique patches we report: how many patches passed JavaParser, compiled, and passed all unit tests.

Table 2. Local Search results. We exclude all empty patches. We report how many patches compiled, passed all unit tests, and how many led to improvements in runtime. We report best improvement found and median improvement among improving patches.

and Detailed prompts. With the Simple prompt, however, not a single patch passed the unit tests, since the suggested edits often could not be parsed. Thus detailed prompts are necessary to force LLM to generate usable outputs.

We investigated further the differences between Medium and Detailed prompts to understand the reduction in performance with Detailed (in the unique patches sets) as Medium had a higher number of compiled and passed patches. In both prompt levels, the generated response was the same for 42 cases (out of the total unique valid cases). However, Detailed tended to generate longer responses with an average of 363 characters, whereas Medium had an average of 304 characters. We manually examined several Detailed prompt responses, in which we identified some including variables from other files, potentially offering a significant expansion of the set of code variants GI can explore.

The second experiment expands our analysis, comparing the performance of the standard and LLM edits with Local Search. Table 2 shows the results of the Local Search experiment. We report the number of compiling and passing patches as well as the number of patches were runtime improvements were found. Furthermore, we report the median and best improvement in milliseconds (ms). In the table, we excluded all empty patches. As before, best results are in bold.

Again, we see that more patches passing the unit tests could be found with the LLM using the Medium, and Detailed prompts. In addition, more improvements could be found by using the LLM with these prompts. Specifically, with Medium and Detailed, we found 164 and 196 improvements, respectively, while we only found 136 with Insert and 71 with Statement. The best improvement could be found with 508 ms with the Statement edit. The best improvement found using LLMs (using the Medium prompt) was only able to improve the runtime by 395 ms. We also examined a series of edits in Local Search results to gain insights into the distinctions between Medium and Detailed prompts due to the low compilation rate of Detailed prompt’s responses. In the example, a sequence of edits aimed to inline a call to function clip. The Detailed prompt tried to incorporate the call almost immediately within a few edits, likely leading to invalid code. On the other hand, the Medium prompt made less radical changes, gradually refining the code. It began by replacing the ternary operator expression with an if-then-else statement and system function calls before eventually attempting to inline the clip function call.

This paper is available on arxiv under CC 4.0 license.