Authors:
(1) Alexander E.I. Brownlee, University of Stirling, UK;
(2) James Callan, University College London, UK;
(3) Karine Even-Mendoza, King’s College London, UK;
(4) Alina Geiger, Johannes Gutenberg University Mainz, Germany;
(5) Justyna Petke, University College London, UK;
(6) Federica Sarro, University College London, UK;
(7) Carol Hanna, University College London, UK;
(8) Dominik Sobania, Johannes Gutenberg University Mainz, Germany.
Genetic improvement of software is highly dependent on the mutation operators it utilizes in the search process. To diversify the operators and enrich the search space further, we incorporated a Large Language Model (LLM) as an operator.
Limitations. To generalise, future work should consider projects besides our target, jCodec. Our experiments used an API giving us no control over the responses generated by the LLM or any way of modifying or optimizing them. Though we did not observe changes in behaviour during our experiments, OpenAI may change the model at any time, so future work should consider local models. We experimented with only three prompt types for LLM requests and within this limited number of prompts found a variation in the results. Finally, our implementation for parsing the responses from the LLMs was relatively simplistic. However, this would only mean that our reported results are pessimistic and an even larger improvement might be achieved by the LLM-based operator.
Summary. We found that, although more valid and diverse patches were found with standard edits using Random Sampling, more patches passing the unit tests were found with LLM-based edits. For example, with the LLM edit using the Medium prompt, we found over 75% more patches passing the unit tests than with the classic Insert edit. In our Local Search experiment, we found the best improvement with the Statement edit (508 ms). The best LLMbased improvement was found with the Medium prompt (395 ms). Thus there is potential in exploring approaches combining both LLM and ‘classic’ GI edits.
Our experiments revealed that the prompts used for LLM requests greatly affect the results. Thus, in future work, we hope to experiment more with prompt engineering. It might also be helpful to mix prompts: e.g., starting with medium then switching to detailed to make larger edits that break out of local minima. Further, the possibility of combining LLM edits with others such as standard copy/delete/replace/swap or PAR templates [11] could be interesting. Finally, we hope to conduct more extensive experimentation on additional test programs.
Data Availability. The code, LLMs prompt and experimental infrastructure, data from the evaluation, and results are available as open source at [1]. The code is also under the ‘llm’ branch of github.com/gintool/gin (commit 9fe9bdf; branched from master commit 2359f57 pending full integration with Gin).
This paper is available on arxiv under CC 4.0 license.