Authors:
(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);
(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);
(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);
(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);
(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).
Table of Links
3 Study Design
3.1 Overview and Research Questions
3.3 Mutation Generation via LLMs
4 Evaluation Results
4.1 RQ1: Performance on Cost and Usability
4.3 RQ3: Impacts of Different Prompts
4.4 RQ4: Impacts of Different LLMs
4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations
5 Discussion
5.1 Sensitivity to Chosen Experiment Settings
5.3 Threats to Validity
The selected LLMs, programming language, datasets, and baseline approaches could be a threat to the validity of our results. To mitigate this threat, we adopt the most widely studied models (i.e., GPT and CodeLlama), the most popular language (i.e., Java), and the most popular dataset Defects4J. We also employ state-of-the-art mutation testing approaches as baselines, including learning-based (i.e., 𝜇Bert and LEAM) and rule-based (i.e., PIT and Major).
Another validity threat may be due to data leakage, i.e., the fact that the data in Defects4J [37] may be covered in the training set of the studied LLMs. To mitigate this threat, we employed another dataset ConDefects [82] which includes programs and faults that were made after the release time of the LLMs we use and thus have limited data leakage risk. Additionally, to increase confidence in our results, we also checked whether the tools can introduce exact matches (syntactically) with the studied faults. We hypothesize that in case the tools have been tuned based on specific fault instances, the tools would introduce at least one mutation that is an exact match with the faults we investigate. Our results are: GPT, CodeLlama, Major, LEAM, and 𝜇Bert, 282, 77, 67, 386, 39 on the Defects4J dataset while on the ConDefects are 7, 9, 13, 8, 1, respectively, and indicate that on Defects4J GPT and LEAM, approaches tend to produce significantly more exact matches than the other approaches. Interestingly, Major produces a similar number of exact matches with Codellama. 𝜇Bert wields significantly the least number of exact matches, indicating a minimal or no advantage for all these approaches (except from GPT and LEAM) due to exact matches (in the case of Defects4J). Perhaps more interesting, in the ConDefects dataset, which has not been seen by any of the tools, Major has the majority of the exact matches, indicating a minor influence of any data leakage on the reported results. Nevertheless, the LLMs we studied exhibit the same trend on the two datasets, achieving the Spearman coefficient of 0.943 and the Pearson correlation of 0.944, both with 𝑝-value less than 0.05, indicating their performance is similar on the two datasets.
The different experimental settings may also threaten the validity of our results. To address this threat, we elaborately explore the impacts of prompts, context length, few-shot examples, and mutation numbers on the performance of LLMs. The results show that different settings are highly similar.
The subjective nature of human decisions in labeling for equivalent mutations and non-compilation errors is another potential threat. To mitigate this threat, we follow a rigorous annotation process where two co-authors independently annotated each mutation. The final Cohen’s Kappa coefficient indicates a relatively high level of agreement between the two annotators.
This paper is