This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Alexander E.I. Brownlee, University of Stirling, UK;
(2) James Callan, University College London, UK;
(3) Karine Even-Mendoza, King’s College London, UK;
(4) Alina Geiger, Johannes Gutenberg University Mainz, Germany;
(5) Justyna Petke, University College London, UK;
(6) Federica Sarro, University College London, UK;
(7) Carol Hanna, University College London, UK;
(8) Dominik Sobania, Johannes Gutenberg University Mainz, Germany.
Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI’s API to generate edits for the JCodec tool. We randomly sample the space of edits using 5 different edit types. We find that the number of patches passing unit tests is up to 75% higher with LLM-based edits than with standard Insert edits. Further, we observe that the patches found with LLMs are generally less diverse compared to standard edits. We ran GI with local search to find runtime improvements. Although many improving patches are found by LLM-enhanced GI, the best improving patch was found by standard GI.
Keywords: Large language models · Genetic Improvement
As software systems grow larger and more complex, significant manual effort is required to maintain them [2]. To reduce developer effort in software maintenance and optimization tasks, automated paradigms are essential. Genetic Improvement (GI) [15] applies search-based techniques to improve non-functional properties of existing software such as execution time as well as functional properties like repairing bugs. Although GI has had success in industry [12,13], it remains limited by the set of mutation operators it employs in the search [14].
Able to process textual queries without additional training for the particular task at hand. LLMs have been pre-trained on millions of code repositories spanning many different programming languages [5]. Their use for software engineering tasks has had great success [9,6], showing promise also for program repair [17,19].
Kang and Yoo [10] have suggested that there is untapped potential in using LLMs to enhance GI. GI uses the same mutation operators for different optimization tasks. These operators are hand-crafted before starting the search and thus result in a limited search space. We hypothesize that augmenting LLM patch suggestions as an additional mutation operator will enrich the search space and result in more successful variants.
In this paper, we conduct several experiments to explore whether using LLMs as a mutation operator in GI can improve the efficiency and efficacy of the search. Our results show that the LLM-generated patches have compilation rates of 51.32% and 53.54% for random search and local search, respectively (with the Medium prompt category). Previously LLMs (using an LLM model as-is) were shown to produce code that compiled roughly 40% of the time [16,18]. We find that randomly sampled LLM-based edits compiled and passed unit tests more often compared to standard GI edits. We observe that the number of patches passing unit tests is up to 75% higher for LLM-based edits than GI Insert edits. However, we observe that patches found with LLMs are less diverse. For local search, the best improvement is achieved using standard GI Statement edits, followed by LLM-based edits. These findings demonstrate the potential of LLMs as mutation operators and highlight the need for further research in this area.
This paper is available on arxiv under CC 4.0 license.