This story draft by @escholar has not been reviewed by an editor, YET.

An Exploratory Study on Using Large Language Models for Mutation Testing: Threats to Validity

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Table of Links

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

5.3 Threats to Validity

The selected LLMs, programming language, datasets, and baseline approaches could be a threat to the validity of our results. To mitigate this threat, we adopt the most widely studied models (i.e., GPT and CodeLlama), the most popular language (i.e., Java), and the most popular dataset Defects4J. We also employ state-of-the-art mutation testing approaches as baselines, including learning-based (i.e., 𝜇Bert and LEAM) and rule-based (i.e., PIT and Major).


Another validity threat may be due to data leakage, i.e., the fact that the data in Defects4J [37] may be covered in the training set of the studied LLMs. To mitigate this threat, we employed another dataset ConDefects [82] which includes programs and faults that were made after the release time of the LLMs we use and thus have limited data leakage risk. Additionally, to increase confidence in our results, we also checked whether the tools can introduce exact matches (syntactically) with the studied faults. We hypothesize that in case the tools have been tuned based on specific fault instances, the tools would introduce at least one mutation that is an exact match with the faults we investigate. Our results are: GPT, CodeLlama, Major, LEAM, and 𝜇Bert, 282, 77, 67, 386, 39 on the Defects4J dataset while on the ConDefects are 7, 9, 13, 8, 1, respectively, and indicate that on Defects4J GPT and LEAM, approaches tend to produce significantly more exact matches than the other approaches. Interestingly, Major produces a similar number of exact matches with Codellama. 𝜇Bert wields significantly the least number of exact matches, indicating a minimal or no advantage for all these approaches (except from GPT and LEAM) due to exact matches (in the case of Defects4J). Perhaps more interesting, in the ConDefects dataset, which has not been seen by any of the tools, Major has the majority of the exact matches, indicating a minor influence of any data leakage on the reported results. Nevertheless, the LLMs we studied exhibit the same trend on the two datasets, achieving the Spearman coefficient of 0.943 and the Pearson correlation of 0.944, both with 𝑝-value less than 0.05, indicating their performance is similar on the two datasets.


The different experimental settings may also threaten the validity of our results. To address this threat, we elaborately explore the impacts of prompts, context length, few-shot examples, and mutation numbers on the performance of LLMs. The results show that different settings are highly similar.


The subjective nature of human decisions in labeling for equivalent mutations and non-compilation errors is another potential threat. To mitigate this threat, we follow a rigorous annotation process where two co-authors independently annotated each mutation. The final Cohen’s Kappa coefficient indicates a relatively high level of agreement between the two annotators.


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks