An Exploratory Study on Using Large Language Models for Mutation Testing: Datasets

Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Table of Links

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

3.2 Datasets

We intend to evaluate our approach with real bugs, and thus we need to use bug datasets with the following properties:

• The datasets should comprise Java programs, as existing methods are primarily based on Java, and we need to compare with them.

• The bugs of the datasets should be real-world bugs so that we can compare the difference between mutations and real bugs.

• Every bug in datasets has the correctly fixed version provided by developers so that we can mutate the fixed version and compare them with the corresponding real bugs.

• Every bug is accompanied by at least one bug-triggering test because we need to measure whether the mutations affect the execution of bug-triggering tests.

To this end, we employ the Defects4J v1.2.0 [37] and ConDefects [82] to evaluate the mutation generation approaches, shown as Table 1. In total, we conduct the experiments on 440 bugs.

Defects4J is a widely used benchmark in the field of mutation testing [15, 38, 40, 43, 55, 70], which contains history bugs from 6 open-source projects of diverse domains, ensuring a broad representation of real-world bugs. In total, these 6 projects contain 395 real bugs. However, from Table 2 and Table 1, we observe the time spans of the Defects4J bugs are earlier than the LLMs’ training time, which may introduce data leakage. Therefore, we supplement another dataset, ConDefects [82], designed to address the data leakage concerns. ConDefects consists of tasks from AtCoder [2] programming contest. To prevent data leakage, we exclusively use bugs reported after the LLMs’ release date, specifically those identified on or after August 31, 2023, and in total we collect 45 Java programs from ConDefects.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

An Exploratory Study on Using Large Language Models for Mutation Testing: Datasets

Table of Links

3.2 Datasets

About Author

Topics

Around The Web...

Trending Topics

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps