An Exploratory Study on Using Large Language Models for Mutation Testing: Experiment Settings

Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Table of Links

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

3.5 Experiment Settings

3.5.1 Settings for Mutation Generation. One major target of our study is to investigate the similarity between LLM-generated mutations and real bugs. Therefore, we generate mutations within the context of real bugs to enable such a comparison. To determine the number of mutations generated in a prompt, we adopt the setting of PIT [12]. We use PIT-generated mutations with the full set of operators and find that PIT generates 1.15 mutations per line of code. Therefore, based on the given code context, we generate one mutation per line. We explore the influence of context length on mutation generation in Section 5.

3.5.2 Settings for RQ1-RQ2. In the first part of our study, we intend to qualitatively and comparatively evaluate the capability of LLMs in generating mutations. We adopt the default settings, which contain two models (i.e., GPT-3.5-Turbo and CodeLlama), and use the default prompt template, generating mutations on the universal set of the 405 bugs in Section 3.2.

To understand the performance of the LLMs, we adopt LEAM [70], 𝜇Bert [15], PIT [12], and Major [39] as baselines. Among these approaches, LEAM is the state-of-the-art mutation generation approach which employs a model trained from 13 million real Java bugs. While 𝜇Bert is based on Bert [20], a masked pre-trained language model that can not conversational respond to human-like text and code.PIT [12] and Major [39] are the popular traditional mutation testing tools with human-predefined simple mutation operators. Note that for PIT and Major, we employ all mutation operators.

3.5.3 Settings for RQ3-RQ4. In the second part of our study, we intend to explore how different prompts and different models impact the performance of mutation generation. Due to budget limitations, we sample a subset of 105 bugs from our dataset, which consists of 10 sampled bugs from each Defects4J project and all ConDefects bugs.

In RQ3, for the prompts exploration, we use the GPT-3.5-Turbo model and switch different prompt templates. More specifically, we modify the default prompt template by adding and removing different sources of information in the Context, and we finally get 4 types of prompts, shown as follows:

(1) Prompt 1 (P1): The Default Prompt. P1 is the default prompt, shown as Figure 1.

(2) Prompt 2 (P2): P1 without Few-Shot Examples. P2 is based on P1 by removing all the examples.

(3) Prompt 3 (P3): P2 without Whole Java Method. P3 is based on P2 by removing the surrounding Java method code, only keeping the target code element to be mutated.

(4) Prompt 4 (P4): P1 with Unit Tests. Based on the default prompt P1, P4 further adds the source code of corresponding unit tests of the target method into Context.

In RQ4, for the LLMs exploration, we use the default prompt and investigate the performance of GPT-3.5-Turbo, GPT-4-Turbo, CodeLlama-13b-Instruct, and StarChat-𝛽-16b.

3.5.4 Settings for RQ5. Generating non-compilable mutations incurs additional meaningless costs, so in the third part of our study, we study the root causes behind LLM generating non-compilable mutations. Specifically, we sample 384 non-compilable mutations of each mutation generation approach and analyze their reason rejected by the Java compiler and the features of their surrounding code context.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

An Exploratory Study on Using Large Language Models for Mutation Testing: Experiment Settings

Table of Links

3.5 Experiment Settings

About Author

Topics

Around The Web...

Trending Topics

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps