This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Zhe Liu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;
(2) Chunyang Chen, Monash University, Melbourne, Australia;
(3) Junjie Wang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China & Corresponding author;
(4) Mengzhuo Chen, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;
(5) Boyu Wu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;
(6) Zhilin Tian, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;
(7) Yuekai Huang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;
(8) Jun Hu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;
(9) Qing Wang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China & Corresponding author.
Motivational Study and Background
Discussion and Threats to Validity
The primary idea of InputBlaster is to generate unusual inputs for text widgets with the context information when running the apps. Although we only experiment with Android mobile apps, since other platforms have these similar types of information, InputBlaster can be used to conduct the testing of input widgets for other platforms. We conduct a small-scale experiment for another two popular platforms, and experiment on 10 iOS apps with 15 bugs and 10 Web apps with 18 bugs, with details on our website. Results show that InputBlaster’s bug detection rate is 80% for iOS apps and 78% for Web apps within 30 minutes testing time. This further demonstrates the generality and usefulness of InputBlaster, and we will conduct more thorough experiments in the future.
The first threat concerns the representativeness of the experimental apps. We have selected popular and active apps which can partially reduce this threat.
The second threat relates to the baseline selection. Since there are hardly any existing approaches for the unusual input generation of mobile apps, we employ 18 approaches from various aspects for a thorough comparison. There are inputs generation techniques for Web apps [5, 6, 62, 63], yet because they need to analyze the web code which is different from mobile apps considering the different rendering mechanism, and cannot be directly applied in our task, hence we don’t include them as the baselines.
The third threat is that we only focus on the crash bugs, since they cause more serious effects and can be automatically observed, and existing studies also only explore this type of bug [38, 40, 53].
The fourth threat might lie in the process of manual categorization in Section 2.1.2. The process involves multiple practitioners and double-checking for the final decision. Also note that, the derived categorization is only for illustration, rather than serving as the ground truth for evaluation.
The Fifth threat may exist in the uncertainty of LLM output results. LLM may not generate the corresponding output as expected, and we also design in-context learning and feedback mechanisms to ensure the output format and content of LLM.
Last but not least, InputBlaster gradually builds the example dataset (Section 3.3.1) as the test goes on. This indicates the performance can be influenced by the testing order, e.g., when arranged in the first place, the crash could not be detected, yet when arranged after 10 apps are tested, the crash can be revealed, since the example dataset has accumulated more knowledge. In this paper, we use a random order of the experimental apps and would explore more in the future.