Researchers Find Clever Way to Get AI to Navigate Your Screen

by The FewShot Prompting Publication December 11th, 2024

Too Long; Didn't Read

Researchers at Microsoft and University of California San Diego have developed an AI model capable of navigating your smartphone screen.

featured image - Researchers Find Clever Way to Get AI to Navigate Your Screen

Authors:

(1) An Yan, UC San Diego, [email protected];

(2) Zhengyuan Yang, Microsoft Corporation, [email protected] with equal contributions;

(3) Wanrong Zhu, UC Santa Barbara, [email protected];

(4) Kevin Lin, Microsoft Corporation, [email protected];

(5) Linjie Li, Microsoft Corporation, [email protected];

(6) Jianfeng Wang, Microsoft Corporation, [email protected];

(7) Jianwei Yang, Microsoft Corporation, [email protected];

(8) Yiwu Zhong, University of Wisconsin-Madison, [email protected];

(9) Julian McAuley, UC San Diego, [email protected];

(10) Jianfeng Gao, Microsoft Corporation, [email protected];

(11) Zicheng Liu, Microsoft Corporation, [email protected];

(12) Lijuan Wang, Microsoft Corporation, [email protected].

Editor’s note: This is the part 3 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.

Table of Links

Abstract and 1 Introduction
2 Related Work
3 MM-Navigator
3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
3.3 History Generation via Multimodal Self Summarization
4 iOS Screen Navigation Experiment
4.1 Experimental Setup
4.2 Intended Action Description
4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
5 Android Screen Navigation Experiment
5.1 Experimental Setup
5.2 Performance Comparison
5.3 Ablation Studies
5.4 Error Analysis
6 Discussion
7 Conclusion and References

3 MM-Navigator

3.1 Problem Formulation

When presented with a user instruction Xinstr in natural language, the agent is asked to complete a series of actions on the smartphone to complete this instruction. The entire process of agentenvironment interactions from initial to final states is called an episode. At each time step t of an episode, the agent will be given a screenshot I t , and decide the next step action to take in order to complete the task.

GPT-4V serves as a multimodal model that takes visual images and text as inputs and produces text output. One challenge is how do we communicate with GPT-4V to perform actions on screen. A possible solution is to ask the model to reason about coordinates to click given a screen. However, based on our preliminary exploration, though GPT-4V have a good understanding of the screen and approximately where to click to perform an instruction by describing the corresponding icon or text, it appears to be bad at estimating accurate numerical coordinates.

Therefore, in this paper, we seek a new approach, to communicate with GPT-4V via Set-ofMark prompting (Yang et al., 2023b) on the screen. Specifically, given a screen, we will detect UI elements via the OCR tool and IconNet (Sunkara et al., 2022). Each element has a bounding box and either OCR-detected text or an icon class label (one of the possible 96 icon types detected by (Sunkara et al., 2022)) are contained. At each step time t, we add numeric tags to those elements, and present GPT-4V with the original screen I t and the screen with tags I t tags. The output text Yaction of GPT-4V will be conditioned on the two images. If GPT-4V decides to click somewhere on the screen, it will choose from the available numeric tags. In practice, we found this simple method works well, setting up a strong baseline for screen navigation with large multimodal models.

This paper is available on arxiv under CC BY 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

The FewShot Prompting Publication @fewshot

Spearheading research, publications, and advancements in few-shot learning, and redefining artificial intelligence.

Read my stories About @fewshot

TOPICS

machine-learning #artificial-intelligence #mm-navigator #gpt-4v-based-agent #gpt-4v #gpt-4v-research #large-multimodal-models #ai-gui-navigation #ai-for-smartphones

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Researchers Find Clever Way to Get AI to Navigate Your Screen

Too Long; Didn't Read

Table of Links

3 MM-Navigator

3.1 Problem Formulation

3.2 Screen Grounding and Navigation via Set of Mark

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES