This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Ali Ghanbari, Dept. of Computer Science, Iowa State University;
(2) Deepak-George Thomas, Dept. of Computer Science, Iowa State University;
(3) Muhammad Arbab Arshad, Dept. of Computer Science, Iowa State University;
(4) Hridesh Rajan, Dept. of Computer Science, Iowa State University.
Deep neural networks (DNNs) are susceptible to bugs, just like other types of software systems. A significant uptick in using DNN, and its applications in wide-ranging areas, including safety-critical systems, warrant extensive research on software engineering tools for improving the reliability of DNNbased systems. One such tool that has gained significant attention in the recent years is DNN fault localization. This paper revisits mutation-based fault localization in the context of DNN models and proposes a novel technique, named deepmufl, applicable to a wide range of DNN models. We have implemented deepmufl and have evaluated its effectiveness using 109 bugs obtained from StackOverflow. Our results show that deepmufl detects 53/109 of the bugs by ranking the buggy layer in top-1 position, outperforming state-of-the-art static and dynamic DNN fault localization systems that are also designed to target the class of bugs supported by deepmufl. Moreover, we observed that we can halve the fault localization time for a pre-trained model using mutation selection, yet losing only 7.55% of the bugs localized in top-1 position.
Index Terms—Deep Neural Network, Mutation, Fault Localization
Software bugs [1] are a common and costly problem in modern software systems, costing the global economy billions of dollars annually [2]. Recently, data-driven solutions have gained significant attention for their ability to efficiently and cost-effectively solve complex problems. With the advent of powerful computing hardware and an abundance of data, the use of deep learning [3], which is based on deep neural networks (DNNs), has become practical. Despite their increasing popularity and success stories, DNN models, like any other software, may contain bugs [4], [5], [6], [7], which can undermine their safety and reliability in various applications. Detecting DNN bugs is not easier than detecting bugs in traditional programs, i.e., programs without any data-driven component in them, as DNNs depend on the properties of the training data and numerous hyperparameters [8]. Mitigating DNN bugs has been the subject of fervent research in recent years, and various techniques have been proposed for testing [9], [10], fault localization [11], [12], and repair [13], [14] of DNN models.
Fault localization in the context of traditional programs has been extensively studied [15], with one well-known approach being mutation-based fault localization (MBFL) [16], [17]. This approach is based on mutation analysis [18], which is mainly used to assess the quality of a test suite by measuring the ratio of artificially introduced bugs that it can detect. MBFL improves upon the more traditional, lightweight spectrum-based fault localization [19], [20], [21], [22], [23], [24] by uniquely capturing the relationship between individual statements in the program and the observed failures. While both spectrum-based fault localization [25], [26] and mutation analysis [27], [28], [29] have been studied in the context of DNNs, to the best of our knowledge, MBFL for DNNs has not been explored by the research community, yet the existing MBFL approaches are not directly applicable to DNN models.
This paper revisits the idea of MBFL in the context of DNNs. Specifically, we design, implement, and evaluate a technique, named deepmufl, to conduct MBFL in pre-trained DNN models. The basic idea behind deepmufl is derived from its traditional MBFL counterparts, namely, Metallaxis [30] and MUSE [17], that are based on measuring the impact of mutations on passing and failing test cases (see §II for more details). In summary, given a pre-trained model and a set of data points, deepmufl separates the data points into two sets of “passing” and “failing” data points (test cases), depending on whether the output of the model matches the ground-truth. deepmufl then localizes the bug in two phases, namely mutation generation phase and mutation testing/execution phase. In mutation generation phase, it uses 79 mutators, a.k.a. mutation operators, to systematically mutate the model, e.g., by replacing activation function of a layer, so as to generate a pool of mutants, i.e., model variants with seeded bugs. In mutation testing phase, deepmufl feeds each of the mutants with passing and failing data points and compares the output to the output of the original model to record the number of passing and failing test cases that are impacted by the injected bugs. In this paper, we study two types of impacts: type 1 impact, a` la MUSE, which tracks only fail to pass and pass to fail, and type 2 impact, like Metallaxis, which tracks changes in the actual output values. deepmufl uses these numbers to calculate suspiciousness values for each layer according to MUSE, as well as two variants of Metallaxis formulas. The layers are then sorted in descending order of their suspiciousness values for the developer to inspect.
We have implemented deepmufl on top of Keras [31], and it supports three types of DNN models for regression, as well as classification tasks that must be written using Sequential API of Keras: fully-connected DNN, convolutional neural network (CNN), and recurrent neural network (RNN). Extending deepmufl to other libraries, e.g., TensorFlow [32] and PyTorch [33], as well as potentially to other model architectures, e.g., functional model architecture in Keras, is a matter of investing engineering effort on the development of new mutators tailored to such libraries and models. Since the current implementation of deepmufl operates on pre-trained models, its scope is limited to model bugs [7], i.e., bugs related to activation function, layer properties, model properties, and bugs due to missing/redundant/wrong layers (see §VI).
We have evaluated deepmufl using a diverse set of 109 Keras bugs obtained from StackOverflow. These bugs are representatives of the above-mentioned model bugs, in that our dataset contains examples of each bug sub-category at different layers of the models suited for different tasks. For example, concerning the sub-category wrong activation function model bug, we have bugs in regression and classification fullyconnected DNN, CNN, and RNN models that have wrong activation function of different types (e.g., ReLU, softmax, etc.) at different layers. For 53 of the bugs, deepmufl, using its MUSE configuration, pinpoints the buggy layer by ranking it in top-1 position. We have compared deepmufl’s effectiveness to that of state-of-the-art static and dynamic DNN fault localization systems Neuralint [12], DeepLocalize [11], DeepDiagnosis [8], and UMLAUT [34] that are also designed to detect model bugs. Our results show that, in our bug dataset, deepmufl, in its MUSE configuration, is 77% more effective than DeepDiagnosis, which detects 30 of the bugs.
Despite this advantage of deepmufl in terms of effectiveness, since it operates on a pre-trained model, it is slower than state-of-the-art DNN fault localization tools from an end-user’s perspective. However, this is mitigated, to some extent, by the fact that similar to traditional programs, one can perform mutation selection [35] to curtail the mutation testing time: we observed that by randomly selecting 50% of the mutants for testing, we can still find 49 of the bugs in top-1 position, yet we halve the fault localization time after training the model.
In summary, this paper makes the following contributions.
• Technique: We develop MBFL for DNN and implement it in a novel tool, named deepmufl, that can be uniformly applied to a wide range of DNN model types.
• Study: We compare deepmufl to state-of-the-art static and dynamic fault localization approaches and observed:
– In four configurations, deepmufl outperforms other approaches in terms of the number of bugs that appear in top-1 position and it detects 21 bugs that none of the studied techniques were able to detect.
– We can halve the fault localization time for a pretrained model by random mutation selection without significant loss of effectiveness.
• Bug Dataset: We have created the largest curated dataset of model bugs, comprising 109 Keras models ranging from regression to classification and fully-connected DNN to CNN and RNN.
Paper organization. In the next section, we review concepts of DNNs, mutation analysis, and MBFL. In §III, we present a motivating example and discusses how deepmufl works under the hood. In §IV, we present technical details of the proposed approach, before discussing the scope of deepmufl in §V. In §VI, we present the results of our experiments with deepmufl and state-of-the-art DNN fault localization tools from different aspects. We discuss threats to validity in §VII and conclude the paper in §IX.
Data availability. The source code of deepmufl and the data associated with our experiments are publicly available [36].