This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Ali Ghanbari, Dept. of Computer Science, Iowa State University;
(2) Deepak-George Thomas, Dept. of Computer Science, Iowa State University;
(3) Muhammad Arbab Arshad, Dept. of Computer Science, Iowa State University;
(4) Hridesh Rajan, Dept. of Computer Science, Iowa State University.
This paper revisits mutation-based fault localization in the context of DNN and presents a novel DNN fault localization technique, named deepmufl. The technique is based on the idea of mutating a pre-trained DNN model and calculating suspiciousness values according to Metallaxis and MUSE approaches, Ochiai and SBI formulas, and two types of impacts of mutations on the results of test data points. deepmufl is compared to state-of-the-art static and dynamic fault localization systems [11], [8], [34], [12] on a benchmark of 109 model bugs. In this benchmark, while deepmufl is slower than the other tools, it proved to be almost two times more effective than them in terms of the total number of bugs detected and it detects 21 bugs that none of the studied tools were able to detect. We further studied the impact of mutation selection on fault localization time. We observed that we can halve the time taken to perform fault localization by deepmufl, while losing only 7.55% of the previously detected bugs.
The authors thank Anonymous ASE 2023 Reviewers for their valuable feedback. We also thank Mohammad Wardat for his instructions on querying StackOverflow. This material is based upon work supported by the National Science Foundation (NSF) under the grant #2127309 to the Computing Research Association for the CIFellows Project. This work is also partially supported by the NSF grants #2223812, #2120448, and #1934884. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
[1] IEEE Standard Classification for Software Anomalies, 2010.
[2] A. McPeak, “What’s the true cost of a software bug?” https://smartbear. com/blog/software-bug-cost/, 2017, accessed 08/10/23.
[3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[4] M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study on deep learning bug characteristics,” in ESEC/FSE, 2019, pp. 510–520.
[5] M. J. Islam, R. Pan, G. Nguyen, and H. Rajan, “Repairing deep neural networks: Fix patterns and challenges,” in ICSE, 2020, pp. 1135–1146.
[6] Y. Zhang, Y. Chen, S.-C. Cheung, Y. Xiong, and L. Zhang, “An empirical study on tensorflow program bugs,” in ISSTA, 2018, pp. 129–140.
[7] N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” in ICSE, 2020, pp. 1110–1121.
[8] M. Wardat, B. D. Cruz, W. Le, and H. Rajan, “Deepdiagnosis: automatically diagnosing faults and recommending actionable fixes in deep learning programs,” in ICSE. IEEE, 2022, pp. 561–572.
[9] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” in SOSP, 2017, pp. 1–18.
[10] J. Kim, R. Feldt, and S. Yoo, “Guiding deep learning system testing using surprise adequacy,” in ICSE, 2019, pp. 1039–1049.
[11] M. Wardat, W. Le, and H. Rajan, “Deeplocalize: fault localization for deep neural networks,” in ICSE, 2021, pp. 251–262.
[12] A. Nikanjam, H. B. Braiek, M. M. Morovati, and F. Khomh, “Automatic fault detection for deep learning programs using graph transformations,” TOSEM, vol. 31, no. 1, pp. 1–27, 2021.
[13] M. Usman, D. Gopinath, Y. Sun, Y. Noller, and C. S. Pas˘ areanu, “Nn ˘ repair: Constraint-based repair of neural network classifiers,” in CAV, 2021, pp. 3–25.
[14] X. Zhang, J. Zhai, S. Ma, and C. Shen, “Autotrainer: An automatic dnn training problem detection and repair system,” in ICSE, 2021, pp. 359–371.
[15] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,” TSE, vol. 42, no. 8, pp. 707–740, 2016.
[16] M. Papadakis and Y. Le Traon, “Using mutants to locate” unknown” faults,” in ICST, 2012, pp. 691–700.
[17] S. Moon, Y. Kim, M. Kim, and S. Yoo, “Ask the mutants: Mutating faulty programs for fault localization,” in ICST, 2014, pp. 153–162.
[18] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test data selection: Help for the practicing programmer,” IEEE Computer, vol. 11, pp. 34–41, 1978.
[19] R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” in TAICPART-MUTATION, 2007, pp. 89–98.
[20] V. Dallmeier, C. Lindig, and A. Zeller, “Lightweight Bug Localization with AMPLE,” in ISAADD, 2005, pp. 99–104.
[21] J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test information to assist fault localization,” in ICSE, 2002, pp. 467–477.
[22] L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,” TOSEM, vol. 20, no. 3, pp. 1–32, 2011.
[23] S. Yoo, “Evolving human competitive spectra-based fault localisation techniques,” in SBSE, 2012, pp. 244–258. [24] X. Xie, F.-C. Kuo, T. Y. Chen, S. Yoo, and M. Harman, “Provably optimal and human-competitive results in sbse for spectrum based fault localisation,” in SBSE, 2013, pp. 224–238.
[25] S. Ma, Y. Liu, W.-C. Lee, X. Zhang, and A. Grama, “Mode: automated neural network model debugging via state differential analysis and input selection,” in ESEC/FSE, 2018, pp. 175–186.
[26] H. F. Eniser, S. Gerasimou, and A. Sen, “Deepfault: Fault localization for deep neural networks,” in FASE, 2019, pp. 171–191.
[27] N. Humbatova, G. Jahangirova, and P. Tonella, “Deepcrime: mutation testing of deep learning systems based on real faults,” in ISSTA, 2021, pp. 67–78.
[28] Q. Hu, L. Ma, X. Xie, B. Yu, Y. Liu, and J. Zhao, “Deepmutation++: A mutation testing framework for deep learning systems,” in ASE, 2019, pp. 1158–1161.
[29] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao et al., “Deepmutation: Mutation testing of deep learning systems,” in ISSRE, 2018, pp. 100–111.
[30] M. Papadakis and Y. Le Traon, “Metallaxis-fl: mutation-based fault localization,” STVR, vol. 25, no. 5-7, pp. 605–628, 2015.
[31] F. Chollet et al., “Keras,” https://keras.io, 2015.
[32] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in OSDI, 2016, p. 265–283.
[33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, highperformance deep learning library,” in NIPS, 2019, pp. 8024–8035.
[34] E. Schoop, F. Huang, and B. Hartmann, “Umlaut: Debugging deep learning programs using program structure and model behavior,” in CHI, 2021, pp. 1–16.
[35] W. E. Wong and A. P. Mathur, “Reducing the cost of mutation testing: An empirical study,” JSS, pp. 185–196, 1995.
[36] A. Ghanbari, D.-G. Thomas, M. A. Arshad, and H. Rajan, “Mutationbased fault localization of deep neural networks,” https://github.com/ ali-ghanbari/deepmufl-ase-2023, 2023.
37] J. H. Andrews, L. C. Briand, and Y. Labiche, “Is mutation an appropriate tool for testing experiments?” in ICSE, 2005, pp. 402–411.
[38] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” in Advances in Computers, 2019, vol. 112, pp. 275–378.
[39] V. Debroy and W. E. Wong, “Using mutation to automatically suggest fixes for faulty programs,” in ICST, 2010, pp. 65–74.
[40] A. Ghanbari, S. Benton, and L. Zhang, “Practical program repair via bytecode mutation,” in ISSTA, 2019, pp. 19–30.
[41] G. Fraser and A. Arcuri, “Achieving scalable mutation-based generation of whole test suites,” ESE, pp. 783–812, 2015.
[42] F. C. M. Souza, M. Papadakis, Y. Le Traon, and M. E. Delamaro, “Strong mutation-based test data generation using hill climbing,” in IWSBST, 2016, pp. 45–54.
[43] D. Shin, S. Yoo, M. Papadakis, and D.-H. Bae, “Empirical evaluation of mutation-based test case prioritization techniques,” STVR, p. e1695, 2019.
[44] J. P. Galeotti, C. A. Furia, E. May, G. Fraser, and A. Zeller, “Inferring loop invariants by mutation, dynamic analysis, and static checking,” TSE, pp. 1019–1037, 2015. [45] A. Groce, I. Ahmed, C. Jensen, and P. E. McKenney, “How verified is my code? falsification-driven verification (t),” in ASE, 2015, pp. 737– 748.
[46] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalable statistical bug isolation,” ACM SIGPLAN Notices, vol. 40, no. 6, pp. 15–26, 2005.
[47] R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “An evaluation of similarity coefficients for software fault localization,” in PRDC, 2006, pp. 39–46.
[48] Wikipedia contributors, “Hierarchical data format — Wikipedia, the free encyclopedia,” 2022, accessed 08/10/23.
[49] H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: a practical mutation testing tool for java,” in ISSTA, 2016, pp. 449–452.
[50] P. Ammann and J. Offutt, Introduction to software testing. Cambridge University Press, 2016.
[51] R. Just, F. Schweiggert, and G. M. Kapfhammer, “Major: An efficient and extensible tool for mutation analysis in a java compiler,” in ASE, 2011, pp. 612–615.
[52] Y.-S. Ma, J. Offutt, and Y. R. Kwon, “Mujava: an automated class mutation system,” STVR, pp. 97–133, 2005.
[53] D. Schuler and A. Zeller, “Javalanche: Efficient mutation testing for java,” in ESEC/FSE, 2009, pp. 297–298.
[54] “Junit,” http://junit.org/, 2019, accessed 08/10/23.
[55] “Testng documentation,” https://testng.org/doc/documentation-main. html, 2017, accessed 08/10/23. [56] M. M. Morovati, A. Nikanjam, F. Khomh, Z. Ming et al., “Bugs in machine learning-based systems: A faultload benchmark,” arXiv, 2022.
[57] X. Li and L. Zhang, “Transforming programs and tests in tandem for fault localization,” Proceedings of the ACM on Programming Languages, vol. 1, no. OOPSLA, pp. 1–30, 2017.
[58] C. Parnin and A. Orso, “Are automated debugging techniques actually helping programmers?” in ISSTA, 2011, pp. 199–209.
[59] P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on automated fault localization,” in ISSTA, 2016, pp. 165–176.
[60] D. Shin and D.-H. Bae, “A theoretical framework for understanding mutation-based testing methods,” in ICST, 2016, pp. 299–308.
[61] A. V. Pizzoleto, F. C. Ferrari, J. Offutt, L. Fernandes, and M. Ribeiro, “A systematic literature review of techniques and metrics to reduce the cost of mutation testing,” Journal of Systems and Software, vol. 157, p. 110388, 2019.
[62] J. Zhang, “Scalability studies on selective mutation testing,” in ICSE, vol. 2, 2015, pp. 851–854.
[63] A. J. Offutt, J. Pan, K. Tewary, and T. Zhang, “An experimental evaluation of data flow and mutation testing,” Software: Practice and Experience, vol. 26, no. 2, pp. 165–176, 1996.
[64] scikit-learn Contributors, “scikit-learn: Machine learning in python,” 2020, accessed 08/10/23. [Online]. Available: https://scikit-learn.org/stable/
[65] R. Heckel, “Graph transformation in a nutshell,” ENTCS, vol. 148, no. 1, pp. 187–198, 2006.
[66] J. Cao, M. Li, X. Chen, M. Wen, Y. Tian, B. Wu, and S.-C. Cheung, “Deepfd: Automated fault diagnosis and localization for deep learning programs,” in ICSE, 2022, p. 573–585.
[67] Y. Ishimoto, M. Kondo, N. Ubayashi, and Y. Kamei, “Pafl: Probabilistic automaton-based fault localization for recurrent neural networks,” IST, vol. 155, p. 107117, 2023.
[68] Y. Sun, H. Chockler, X. Huang, and D. Kroening, “Explaining image classifiers using statistical fault localization,” in ECCV, 2020, pp. 391–406.