paint-brush
Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phoneby@fewshot

Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone

tldt arrow

Too Long; Didn't Read

Researchers at Microsoft and University of California San Diego have developed an AI model capable of navigating your smartphone screen.
featured image - Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone
The FewShot Prompting Publication  HackerNoon profile picture

Authors:

(1) An Yan, UC San Diego, [email protected];

(2) Zhengyuan Yang, Microsoft Corporation, [email protected] with equal contributions;

(3) Wanrong Zhu, UC Santa Barbara, [email protected];

(4) Kevin Lin, Microsoft Corporation, [email protected];

(5) Linjie Li, Microsoft Corporation, [email protected];

(6) Jianfeng Wang, Microsoft Corporation, [email protected];

(7) Jianwei Yang, Microsoft Corporation, [email protected];

(8) Yiwu Zhong, University of Wisconsin-Madison, [email protected];

(9) Julian McAuley, UC San Diego, [email protected];

(10) Jianfeng Gao, Microsoft Corporation, [email protected];

(11) Zicheng Liu, Microsoft Corporation, [email protected];

(12) Lijuan Wang, Microsoft Corporation, [email protected].

Editor’s note: This is the part 10 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.


5.3 Ablation Studies

For the ablation studies, we randomly sampled 50 episodes in total from 5 categories, which is a different subset used by the main results.


Different tagging methods. We first perform an ablation study to compare the performance with different methods to add tags on screen, shown in Table 4. We consider three methods: (1) By side which adds tags with black squares (same style as (Rawles et al., 2023) by the left side of each detected icon; (2) Red which uses red circles for each tag; (3) Center which adds tags with black squares at the center of each detected box. First, adding tags by the left side of boxes may cause problems, for example, some icons may be too close to each other, hence leading to slightly worse results. For tagging styles, we didn’t find a significant difference between red cycles and black rectangles, though empirically black rectangles (Yang et al., 2023b) perform slightly better.


Different prompts. We then perform robustness check with different prompting variants: (1) Baseline: Simply ask GPT-4V to take actions; (2) Think: Prompt GPT-4V to think step by step (Kojima et al., 2022); (3) Detail: Provide more context for this task. Overall, we did not observe improvements by “thinking step by step”, but adding more task descriptions helps GPT-4V to better execute actions.


This paper is available on arxiv under CC BY 4.0 DEED license.