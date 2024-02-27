Search icon
ReadWrite
see notifications
Notifications
see more
    paint-brush
    Standard GI Mutations vs. LLM Edits in Random Sampling and Local Searchby@mutation

    Standard GI Mutations vs. LLM Edits in Random Sampling and Local Search

    by The Mutation PublicationFebruary 27th, 2024
    Read on Terminal Reader
    Read this story w/o Javascript
    tldt arrow

    Too Long; Didn't Read

    Discover the outcomes of experiments comparing standard Genetic Improvement mutations with Large Language Model (LLM) edits in both random sampling and local search scenarios. Gain insights into the performance differences and implications for software evolution. TLDR: This section presents the results of experiments comparing standard Genetic Improvement mutations with Large Language Model (LLM) edits in both random sampling and local search contexts. Explore the findings regarding patch validation, compilation rates, unit test passing, diversity, and runtime improvements, shedding light on the effectiveness of LLMs in software evolution.
    featured image - Standard GI Mutations vs. LLM Edits in Random Sampling and Local Search
    The Mutation Publication HackerNoon profile picture

    Authors:

    (1) Alexander E.I. Brownlee, University of Stirling, UK;

    (2) James Callan, University College London, UK;

    (3) Karine Even-Mendoza, King’s College London, UK;

    (4) Alina Geiger, Johannes Gutenberg University Mainz, Germany;

    (5) Justyna Petke, University College London, UK;

    (6) Federica Sarro, University College London, UK;

    (7) Carol Hanna, University College London, UK;

    (8) Dominik Sobania, Johannes Gutenberg University Mainz, Germany.

    Abstract & Introduction

    Experimental Setup

    Results

    Conclusions and Future Work

    Acknowledgements & References

    3 Results

    The first experiment compares standard GI mutations, namely Insert and Statement edits, with LLM edits using differently detailed prompts (Simple, Medium, and Detailed) using Random Sampling. Table 1 shows results for all patches as well as for unique patches only. We report how many patches were successfully parsed by JavaParser (named as Valid), how many compiled, and how many passed all unit tests (named as Passed). We excluded patches syntactically equivalent to the original software. Best results are in bold.


    We see that although substantially more valid patches were found with the standard Insert and Statement edits, more passing patches could be found by using the LLM-generated edits. In particular, for the Medium, and Detailed prompts 292 and 230 patches passed the unit tests, respectively. For the Insert and Statement edits only 166 and 91 passed the unit tests, respectively. Anecdotally, the hot methods with lowest/highest patch pass rates differed for each operator: understanding this variation will be interesting for future investigation.


    It is also notable that LLM patches are less diverse: over 50% more unique patches were found by standard mutation operators than the LLM using Medium,


    Table 1. Results of our Random Sampling experiment. We exclude patches syntactically equivalent to the original software in this table. For all and unique patches we report: how many patches passed JavaParser, compiled, and passed all unit tests.



    Table 2. Local Search results. We exclude all empty patches. We report how many patches compiled, passed all unit tests, and how many led to improvements in runtime. We report best improvement found and median improvement among improving patches.



    and Detailed prompts. With the Simple prompt, however, not a single patch passed the unit tests, since the suggested edits often could not be parsed. Thus detailed prompts are necessary to force LLM to generate usable outputs.


    We investigated further the differences between Medium and Detailed prompts to understand the reduction in performance with Detailed (in the unique patches sets) as Medium had a higher number of compiled and passed patches. In both prompt levels, the generated response was the same for 42 cases (out of the total unique valid cases). However, Detailed tended to generate longer responses with an average of 363 characters, whereas Medium had an average of 304 characters. We manually examined several Detailed prompt responses, in which we identified some including variables from other files, potentially offering a significant expansion of the set of code variants GI can explore.


    The second experiment expands our analysis, comparing the performance of the standard and LLM edits with Local Search. Table 2 shows the results of the Local Search experiment. We report the number of compiling and passing patches as well as the number of patches were runtime improvements were found. Furthermore, we report the median and best improvement in milliseconds (ms). In the table, we excluded all empty patches. As before, best results are in bold.


    Again, we see that more patches passing the unit tests could be found with the LLM using the Medium, and Detailed prompts. In addition, more improvements could be found by using the LLM with these prompts. Specifically, with Medium and Detailed, we found 164 and 196 improvements, respectively, while we only found 136 with Insert and 71 with Statement. The best improvement could be found with 508 ms with the Statement edit. The best improvement found using LLMs (using the Medium prompt) was only able to improve the runtime by 395 ms. We also examined a series of edits in Local Search results to gain insights into the distinctions between Medium and Detailed prompts due to the low compilation rate of Detailed prompt’s responses. In the example, a sequence of edits aimed to inline a call to function clip. The Detailed prompt tried to incorporate the call almost immediately within a few edits, likely leading to invalid code. On the other hand, the Medium prompt made less radical changes, gradually refining the code. It began by replacing the ternary operator expression with an if-then-else statement and system function calls before eventually attempting to inline the clip function call.


    This paper is available on arxiv under CC 4.0 license.


    MongoDB
    L O A D I N G
    . . . comments & more!

    About Author

    The Mutation Publication HackerNoon profile picture
    The Mutation Publication@mutation
    Mutation: process of changing in form or nature. We publish the best academic journals & first hand accounts of Mutation
    Read my storiesRead My Stories

    TOPICS

    purcat-imgmachine-learning #large-language-models #genetic-improvement #genetic-improvement-mutations #llms-for-genetic-improvement #gpt3.5-for-genetic-improvement #llm-applications #llm-research-papers #generic-programming

    THIS ARTICLE WAS FEATURED IN...

    Permanent on Arweave
    Read on Terminal Reader Terminal
    Read this story w/o Javascript Lite
    Sumi
    Aivataro
    Lizedin

    RELATED STORIES

    Article Thumbnail
    Persistent Laplacian and Pre-Trained Transformer for Protein Solubility Changes Upon Mutation
    by mutation
    Feb 16, 2024
    #mutation
    Article Thumbnail
    Generic Programming in Go
    by vgukasov
    Jun 02, 2022
    #generics
    Article Thumbnail
    The Ultimate Guide To Design Patterns And Generic Composite In Python
    by asher-sterkin
    Apr 11, 2021
    #python
    Article Thumbnail
    Type safety and Spark Datasets in Scala
    by manish.katoch
    Jan 01, 2019
    #dataset
    Article Thumbnail
    Enhancing Genetic Improvement Mutations Using Large Language Models
    by mutation
    Feb 27, 2024
    #large-language-models
    Join HackerNoonloading
    Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas