Authors:
(1) An Yan, UC San Diego, [email protected];
(2) Zhengyuan Yang, Microsoft Corporation, [email protected] with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, [email protected];
(4) Kevin Lin, Microsoft Corporation, [email protected];
(5) Linjie Li, Microsoft Corporation, [email protected];
(6) Jianfeng Wang, Microsoft Corporation, [email protected];
(7) Jianwei Yang, Microsoft Corporation, [email protected];
(8) Yiwu Zhong, University of Wisconsin-Madison, [email protected];
(9) Julian McAuley, UC San Diego, [email protected];
(10) Jianfeng Gao, Microsoft Corporation, [email protected];
(11) Zicheng Liu, Microsoft Corporation, [email protected];
(12) Lijuan Wang, Microsoft Corporation, [email protected].
Editor’s note: This is the part 3 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.
When presented with a user instruction Xinstr in natural language, the agent is asked to complete a series of actions on the smartphone to complete this instruction. The entire process of agentenvironment interactions from initial to final states is called an episode. At each time step t of an episode, the agent will be given a screenshot I t , and decide the next step action to take in order to complete the task.
GPT-4V serves as a multimodal model that takes visual images and text as inputs and produces text output. One challenge is how do we communicate with GPT-4V to perform actions on screen. A possible solution is to ask the model to reason about coordinates to click given a screen. However, based on our preliminary exploration, though GPT-4V have a good understanding of the screen and approximately where to click to perform an instruction by describing the corresponding icon or text, it appears to be bad at estimating accurate numerical coordinates.
Therefore, in this paper, we seek a new approach, to communicate with GPT-4V via Set-ofMark prompting (Yang et al., 2023b) on the screen. Specifically, given a screen, we will detect UI elements via the OCR tool and IconNet (Sunkara et al., 2022). Each element has a bounding box and either OCR-detected text or an icon class label (one of the possible 96 icon types detected by (Sunkara et al., 2022)) are contained. At each step time t, we add numeric tags to those elements, and present GPT-4V with the original screen I t and the screen with tags I t tags. The output text Yaction of GPT-4V will be conditioned on the two images. If GPT-4V decides to click somewhere on the screen, it will choose from the available numeric tags. In practice, we found this simple method works well, setting up a strong baseline for screen navigation with large multimodal models.
This paper is available on arxiv under CC BY 4.0 DEED license.