Why Developers’ Confidence in Testing Techniques Doesn’t Always Match Reality

Written by escholar | Published 2025/12/18
Tech Story Tags: software-testing | code-review-vs-testing | branch-testing | equivalence-partitioning | empirical-software-engineering | software-testing-effectiveness | developer-decision-making | software-testing-techniques

TLDRThis study shows that developers’ perceptions of how effective their testing techniques are often don’t align with reality. Through replicated experiments, the research finds that confidence is shaped by perceived application success rather than actual defect detection. The results suggest developers should rely less on intuition and more on empirical feedback, tooling, and evidence-based guidance when evaluating software quality.via the TL;DR App

Abstract

1 Introduction

2 Original Study: Research Questions and Methodology

3 Original Study: Validity Threats

4 Original Study: Results

5 Replicated Study: Research Questions and Methodology

6 Replicated Study: Validity Threats

7 Replicated Study: Results

8 Discussion

9 Related Work

10 Conclusions And References

In recent years, several experiments on defect detection technique effectiveness (static techniques and/or test-case design techniques) have been run with and without humans. Experiments without human compare the efficiency and effectiveness of specification-based, code-based, and fault-based techniques, as for example the ones conducted by Bieman & Schultz [8], Hutchins et al. [27], Offut et al. [43], Offut & Lee [44], Weyuker [51] and Wong & Mathur [53].

Most of the experiments with humans evaluate static techniques, as for example the ones run by Basili et al. [5], Biffl [9], Dunsmore et al. [18], Maldonado et al.[37], Porter et al. [45] and Thelin et al. [48]. Experiments evaluating test-case design techniques studied the efficiency and effectiveness of specification-based and control-flow-code-based techniques applied by humans, as the ones run by Basili & Selby [4], Briand et al. [10], Kamsties & Lott [29], Myers [40] and Roper et al. [46]. These experiments focus on strictly quantitative issues, leaving aside human factors like developers’ perceptions and opinions.

There are surveys that study developers’ perceptions and opinions with respect to different testing issues, like the ones performed by Deak [13], DiasNeto et al.[15], Garousi et al. [23], Goncalves et al. [24], Guaiani & Muccini [25], Khan et al. [31] and Hern´andez & Marsden [38]. However, the results are not linked to quantitative issues. In this regard, some studies link personality traits to preferences according to the role of software testers, as for example Capretz et al. [11], Kanij et al. [30] and Kosti et al. [33]. However, there are no studies looking for a relationship between personality traits and quantitative issues like testing effectiveness.

There are some approaches for helping developers to select the best testing techniques to apply under particular circumstances, like the ones made by Cotroneo et al.[12], Dias-Neto & Travassos [16] or Vegas et al. [50]. Our study suggests that this type of research needs to be more widely disseminated to improve knowledge about techniques.

Finally, there are several ways in which developers can make decisions in the software deveelopment industry. The most basic approach is the classical perceptions and/or opinions, as reported in Dyb˚a et al. [19] and Zelkowitz et al. [55]. Other approaches suggest using classical decision-making models [2]. Experiments can also be used for industry decision-making, as described by Jedlitschka et al. [28]. Devanbu et al. [14] have observed the use of past experience (beliefs). More recent approaches advocate automatic decision-making from mining repositories[7].

10 Conclusion

The goal of this paper was to discover whether developers’ perceptions of the effectiveness of different code evaluation techniques are right in absence of prior experience. To do this, we conducted an empirical study with students plus a replication. The original study revealed that participants’ perceptions are wrong. As a result, we conducted a replication aimed at discovering what was behind participants’ misperceptions. We opted to study participants’ opinions on techniques. The results of the replicated study corroborate the findings of the original study.

They also reveal that participants’ perceptions of technique effectiveness are based on how well they applied the techniques. We also found that participants’ perceptions are not influenced by their opinions about technique complexity and preferences for techniques. Based on these results, we derived some recommendations for developers: they should not trust their perceptions and be aware that correct technique application does not assure that they will find all the program defects.

Additionally, we identified a number of lines of action that could help to mitigate the problem of misperception, such as developing tools to inform developers about how effective their testing is, conducting more empirical studies to discover technique applicability conditions, developing instruments to allow easy access to experimental results, investigating other possible drivers of misperceptions or investigating what is behind opinions. Future work includes running new replications of these studies to better understand their results.

References

  1. Altman, D.: Practial Statistics for Medical Research. Chapman and Hall (1991)
  2. Aurum, A., Wohlin, C.: Applying decision-making models in requirements engineering. In: Proceedings of Requirements Engineering for Software Quality (2002)
  3. Banerjee, M.V., Capozzoli, M., McSweeney, L., Sinha, D.: Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics 27, 3–23 (1999)
  4. Basili, V., Selby, R.: Comparing the effectiveness of software testing strategies. IEEE Transactions on Software Engineering 13(2), 1278–1296 (1987)
  5. Basili, V.R., Green, S., Laitenberger, O., Lanubile, F., Shull, F., Sorumgard, S., Zelkowitz, M.V.: The empirical investigation of perspective based reading. Empirical Software Engineering 1(2), 133–164 (1996)
  6. Beizer, B.: Software Testing Techniques, second edn. International Thomson Computer Press (1990)
  7. Bhattacharya, P.: Quantitative decision-making in software engineering. Ph.D. thesis, University of California Riverside (2012)
  8. Bieman, J., Schultz, J.: An empirical evaluation (and specification) of the all-du-paths testing criterion. Software Engineering Journal pp. 43–51 (1992)
  9. Biffl, S.: Analysis of the impact of reading technique and inspector capability on individual inspection performance. In: 7th Asia-Pacific Software Engineering Conference, pp. 136–145 (2000)
  10. Briand, L., Penta, M., Labiche, Y.: Assessing and improving state-based class testing: A series of experiments. IEEE Transactions on Software Engineering 30(11), 770–793 (2004)
  11. Capretz, L., Varona, D., Raza, A.: Influence of personality types in software tasks choices. Computers in Human Behavior 52, 373–378 (2015)
  12. Cotroneo, D., Pietrantuono, R., Russo, S.: Testing techniques selection based on odc fault types and software metrics. Journal of Systems and Software 86(6), 1613–1637 (2013)
  13. Deak, A.: Understanding socio-technical factors influencing testers in software development organizations. In: 36th Annual Computer Software and Applications Conference (COMPSAC’12), pp. 438–441 (2012)
  14. Devanbu, P., Zimmermann, T., Bird, C.: Belief & evidence in empirical software engineering. In: Proceedings of the 38th international conference on software engineering, pp. 108–119 (2016)
  15. Dias-Neto, A., Matalonga, S., Solari, M., Robiolo, G., Travassos, G.: Toward the characterization of software testing practices in south america: looking at brazil and uruguay. Software Quality Journal pp. 1–39 (2016)
  16. Dias-Neto, A., Travassos, G.: Supporting the combined selection of model-based testing techniques. IEEE Transactions on Software Engineering 40(10), 1025–1041 (2014)
  17. Dieste, O., Aranda, A., Uyaguari, F., Turhan, B., Tosun, A., Fucci, D., Oivo, M., Juristo, N.: Empirical evaluation of the effects of experience on code quality and programmer productivity: an exploratory study. Empirical Software Engineering (2017). DOI https: //doi.org/10.1007/s10664-016-9471-3
  18. Dunsmore, A., Roper, M., Wood, M.: Further investigations into the development and evaluation of reading techniques for object-oriented code inspection. In: 24th International Conference on Software Engineering, p. 47–57 (2002)
  19. Dyb˚a, T., Kitchenham, B., Jorgensen, M.: Evidence-based software engineering for practitioners. IEEE software 22(1), 58–65 (2005)
  20. Everitt, B.: The analysis of contingency tables. In: Monographs statistics and applied probability, 45. Chapman & Hall/CRC (2000)
  21. Falessi, D., Juristo, N., Wohlin, C., Turhan, B., M¨unch, J., Jedlitschka, A., Oivo, M.: Empirical software engineering experts on the use of students and professionals in experiments. Empirical Software Engineering (2017). DOI https://doi.org/10.1007/ s10664-017-9523-3
  22. Fleiss, J., B. Levin, M.P.: Statistical Methods for Rates and Proportions, 3rd edn. Wiley & Sons (2003)
  23. Garousi, V., Felderer, M., Kuhrmann, M., Herkilo˘glu, K.: What industry wants from academia in software testing?: Hearing practitioners’ opinions. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, EASE’17, pp. 65–69 (2017)
  24. Gon¸calves, W., de Almeida, C., de Ara´ujo L, L., Ferraz, M., Xand´u, R., de Farias, I.: The influence of human factors on the software testing process: The impact of these factors on the software testing process. In: Information Systems and Technologies (CISTI), 2017 12th Iberian Conference on, pp. 1–6 (2017)
  25. Guaiani, F., Muccini, H.: Crowd and laboratory testing, can they co-exist? an exploratory study. In: 2nd International Workshop on CrowdSourcing in Software Engineering (CSI-SE), pp. 32–37 (2015)
  26. Hayes, A., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1, 77–89 (2007)
  27. Hutchins, M., Foster, H., Goradia, T., Ostrand, T.: Experiments on the effectiveness of dataflow- and controlflow-based test adequacy criteria. In: Proceedings of the 16th International Conference on Software Engineering, pp. 191–200 (1994)
  28. Jedlitschka, A., Juristo, N., Rombach, D.: Reporting experiments to satisfy professionals’ information needs. Empirical Software Engineering 19(6), 1921–1955 (2014)
  29. Kamsties, E., Lott, C.: An empirical evaluation of three defect-detection techniques. In: Proceedings of the Fifth European Software Engineering Conference, pp. 84–89
  30. Kanij, T., Merkel, R., Grundy, J.: An empirical investigation of personality traits of software testers. In: 8th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE’15), pp. 1–7 (2015)
  31. Khan, T., Pezeshki, V., Clear, F., Al-Kaabi, A.: Diverse virtual social networks: implications for remote software testing teams. In: European, Mediterranean & Middle Eastern Conference on Information Systems (2010)
  32. Kocaguneli, E., Tosun, A., Bener, A., Turhan, B., Caglayan, B.: Prest: An intelligent software metrics extraction, analysis and defect prediction tool. pp. 637–642 (2009)
  33. Kosti, M., Feldt, R., Angelis, L.: Personality, emotional intelligence and work preferences in software engineering: An empirical study. Information and Software Technology 56(8), 973–990 (2014)
  34. Kuehl, R.: Design of Experiments: Statistical Principles of Research Design and Analysis, 2nd edn. Duxbury Thomson Learning (2000)
  35. Landis, J., Koch, G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
  36. Linger, R.: Structured Programming: Theory and Practice (The Systems programming series). Addison-Wesley (1979)
  37. Maldonado, J., Carver, J., Shull, F., Fabbri, S., D´oria, E., Martimiano, L., Mendon¸ca, M., Basili, V.: Perspective-based reading: A replicated experiment focused on individual reviewer effectiveness. Empirical Software Engineering 11(1), 119–142 (2006)
  38. Marsden, N., P´erez Renter´ıa y Hern´andez, T.: Understanding software testers in the automotive industry a mixed-method case study. In: 9th International Conference on Software Engineering and Applications (ICSOFT-EA), pp. 305–314 (2014)
  39. Massey, A., Otto, P., Ant´on, A.: Evaluating legal implementation readiness decisionmaking. IEEE Transactions on Software Engineering 41(6), 545–564 (2015)
  40. Myers, G.: A controlled experiment in program testing and code walkthroughs/inspections. Communications of the ACM 21(9), 760–768 (1978)
  41. Myers, G., Badgett, T., Sandler, C.: The Art of Software Testing, second edn. WileyInterscience (2004)
  42. Octaviano, F., Felizardo, K., Maldonado, J., Fabbri, S.: Semi-automatic selection of primary studies in systematic literature reviews: is it reasonable? Empirical Software Engineering 20(6), 1898–1917 (2015)
  43. Offut, A., Lee, A., Rothermel, G., Untch, R., Zapf, C.: An experimental determination of sufficient mutant operators. ACM Transactions on Software Engineering and Methodology 5(2), 99–118 (1996)
  44. Offut, A., Lee, S.: An empirical evaluation of weak mutation. IEEE Transactions on Software Engineering 20(5), 337–344 (1994)
  45. Porter, A., Votta, L., Basili, V.: Comparing detection methods for software requirements inspection: A replicated experiment. IEEE Transactions on Software Engineering 21(6), 563–575 (1995)
  46. Roper, M., Wood, M., Miller, J.: An empirical evaluation of defect detection techniques. Information and Software Technology 39, 763–775 (1997)
  47. Shull, F., Carver, J., Vegas, S., Juristo, N.: The role of replications in empirical software engineering. Empirical Software Engineering 13, 211–218 (2008)
  48. Thelin, T., Runeson, P., Wohlin, C., Olsson, T., Andersson, C.: Evaluation of usagebased reading—conclusions after three experiments. Empirical Software Engineering 9, 77–110 (2004)
  49. Vegas, S., Basili, V.: A characterisation schema for software testing techniques. Empirical Software Engineering 10(4), 437–466 (2005)
  50. Vegas, S., Juristo, N., Basili, V.: Maturing software engineering knowledge through classifications: A case study on unit testing techniques. IEEE Transactions on Software Engineering 35(4), 551–565 (2009)
  51. Weyuker, E.: The complexity of data flow criteria for test data selection. Information Processing Letters 19(2), 103–109 (1984)
  52. Wohlin, C., Runeson, P., H¨ost, M., Ohlsson, M.C., Regnell, B., Wessl´en, A.: Experimentation in software engineering: an introduction, 2nd edn. Springer (2014)
  53. Wong, E., Mathur, A.: Fault detection effectiveness of mutation and data-flow testing. Software Quality Journal 4, 69–83 (1995)
  54. Zapf, A., Castell, S., Morawietz, L., Karch., A.: Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Medical Research Methodology 16(93) (2016)
  55. Zelkowitz, M., Wallace, D., Binkley, D.: Experimental validation of new software technology. Series on Software Engineering and Knowledge Engineering 12, 229–263 (2003)

Authors:

  1. Sira Vegas
  2. Patricia Riofr´ıo
  3. Esperanza Marcos
  4. Natalia Juristo

This paper is available on arxiv under CC BY-NC-ND 4.0 license.


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2025/12/18