Mutation-based Fault Localization of Deep Neural Networks: Threats to Validity

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Ali Ghanbari, Dept. of Computer Science, Iowa State University;

(2) Deepak-George Thomas, Dept. of Computer Science, Iowa State University;

(3) Muhammad Arbab Arshad, Dept. of Computer Science, Iowa State University;

(4) Hridesh Rajan, Dept. of Computer Science, Iowa State University.

Table of Links

VII. THREATS TO VALIDITY

As with most empirical evaluations, we do not have a working definition of representative sample of DNN bugs, but we made efforts to ensure that the bugs we used in the evaluation is as representative as possible by making sure that our dataset has diverse examples of bugs from each subcategory of model bugs.

ategory of model bugs. Many of the bugs obtained from StackOverflow did not come with accompanying training datasets. To address this issue, we utilized the dataset generation API provided by scikit-learn [64] to generate synthetic datasets for regression or classification tasks. We ensured that the errors described in each StackOverflow post would manifest when using the synthesized data points and that applying the fix suggested in the accepted response post would eliminate the bug. However, it is possible that this change to the training process may introduce new unknown bugs. To mitigate this risk, we have made our bug benchmark publicly available [36]. Another potential threat to the validity of our results is the possibility of bugs in the construction of deepmufl itself, which could lead to incorrect bug localization. To mitigate this, we make the source code of deepmufl publicly available for other researchers to review and validate the tool.

Another threat to the validity of our results is the potential impact of external factors, such as the stochastic nature of the training process and the synthesized training/testing datasets, as well as system load, on our measurements. To address this, besides using deterministic seeds for dataset generation and splitting, we repeated our experiments with deepmufl three times. Similarly, we also ran other dynamic tools three times to ensure that their results were not affected by randomness during training. We did not observe any differences in effectiveness between the rounds for either deepmufl or the other studied techniques. Additionally, we repeated the time measurements for each round, and reported the average timing, to ensure that our time measurements were not affected by system load. Furthermore, judging whether or not any of the tools detect a bug requires manual analysis of textual description of the bugs and matching it to the tools; output messages which might be subject to bias. To mitigate this bias, we have made the output messages by the tools available for other researchers [36].

Lastly, deepmufl uses a threshold parameter to compare floating-point values (see §IV-C). In our experiments, we used the default value of 0.001 and ensured that smaller threshold values yield the same results.