Authors:
(1) An Yan, UC San Diego, ayan@ucsd.edu;
(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;
(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;
(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;
(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;
(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;
(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;
(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;
(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;
(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;
(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com**.** Editor’s note: This is the part 4 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Table of Links Abstract and 1 Introduction
2 Related Work
3 MM-Navigator
3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
3.3 History Generation via Multimodal Self Summarization
4 iOS Screen Navigation Experiment
4.1 Experimental Setup
4.2 Intended Action Description
4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
5 Android Screen Navigation Experiment
5.1 Experimental Setup
5.2 Performance Comparison
5.3 Ablation Studies
5.4 Error Analysis
6 Discussion
7 Conclusion and References 3.3 History Generation via Multimodal Self Summarization Set-of-Mark prompting bridges the gap between text outputs from GPT-4V and executable localized actions. However, the agent’s ability to maintain a Table 1: Zero-shot GPT-4V iOS screen navigation accuracy on the “intended action description” and “localized action execution” tasks, respectively. historical context is equally important in successfully completing tasks on smartphones. The key difficulty lies in devising a strategy that allows the agent to effectively determine the subsequent action at each stage of an episode, taking into account both its prior interactions with the environment and the present state of the screen. The naive approach of feeding all historical screens or actions into the agent is computationally expensive and may decrease the performance due to information overload. For example, screens at each step can change rapidly, and most of the historical screen information is not useful for reasoning about future actions. Humans, on the other hand, can keep track of a short memory of the key information after performing a sequence of actions. We aim to find a more concise representation than a sequence of screens or actions. Specifically, at each time step, we ask GPT-4V to perform multimodal self summarization, which converts the historical actions and current step information into a concise history in the form of natural language, which is formulated as follows: This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) An Yan, UC San Diego, ayan@ucsd.edu; (2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions; (3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu; (4) Kevin Lin, Microsoft Corporation, keli@microsoft.com; (5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com; (6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com; (7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com; (8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu; (9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu; (10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com; (11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com; (12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com**.** Authors: Authors: (1) An Yan, UC San Diego, ayan@ucsd.edu; (2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions; (3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu; (4) Kevin Lin, Microsoft Corporation, keli@microsoft.com; (5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com; (6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com; (7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com; (8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu; (9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu; (10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com; (11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com; (12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com**.** Editor’s note: This is the part 4 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 4 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 4 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 4 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Table of Links Abstract and 1 Introduction 2 Related Work 3 MM-Navigator 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.3 History Generation via Multimodal Self Summarization 4 iOS Screen Navigation Experiment 4.1 Experimental Setup 4.2 Intended Action Description 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 5 Android Screen Navigation Experiment 5.1 Experimental Setup 5.2 Performance Comparison 5.3 Ablation Studies 5.4 Error Analysis 6 Discussion 7 Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 MM-Navigator 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.3 History Generation via Multimodal Self Summarization 3.3 History Generation via Multimodal Self Summarization 4 iOS Screen Navigation Experiment 4.1 Experimental Setup 4.1 Experimental Setup 4.2 Intended Action Description 4.2 Intended Action Description 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 5 Android Screen Navigation Experiment 5.1 Experimental Setup 5.1 Experimental Setup 5.2 Performance Comparison 5.2 Performance Comparison 5.3 Ablation Studies 5.3 Ablation Studies 5.4 Error Analysis 5.4 Error Analysis 6 Discussion 6 Discussion 7 Conclusion and References 7 Conclusion and References 3.3 History Generation via Multimodal Self Summarization Set-of-Mark prompting bridges the gap between text outputs from GPT-4V and executable localized actions. However, the agent’s ability to maintain a Table 1: Zero-shot GPT-4V iOS screen navigation accuracy on the “intended action description” and “localized action execution” tasks, respectively. historical context is equally important in successfully completing tasks on smartphones. The key difficulty lies in devising a strategy that allows the agent to effectively determine the subsequent action at each stage of an episode, taking into account both its prior interactions with the environment and the present state of the screen. The naive approach of feeding all historical screens or actions into the agent is computationally expensive and may decrease the performance due to information overload. For example, screens at each step can change rapidly, and most of the historical screen information is not useful for reasoning about future actions. Humans, on the other hand, can keep track of a short memory of the key information after performing a sequence of actions. We aim to find a more concise representation than a sequence of screens or actions. Specifically, at each time step, we ask GPT-4V to perform multimodal self summarization, which converts the historical actions and current step information into a concise history in the form of natural language, which is formulated as follows: This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Researchers Teach AI to Retain Memory by Summarizing Its Own Work

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

Researchers Discover How Flawed Labels Derail AI’s On-Screen Navigation Skills

New AI Model Outshines Popular Smart Assistants by Navigating, Interacting with Smartphone Screens

Researchers Highlight Need for New Benchmarks to Assess AI Models That Can Navigate Smartphones

Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone

Researchers Develop New Vision-Enabled AI Model That Outsmarts Text-Only AI

A Close Look at Misalignment in Pretraining Datasets

Researchers Discover How Flawed Labels Derail AI’s On-Screen Navigation Skills

New AI Model Outshines Popular Smart Assistants by Navigating, Interacting with Smartphone Screens

Researchers Highlight Need for New Benchmarks to Assess AI Models That Can Navigate Smartphones

Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone

Researchers Develop New Vision-Enabled AI Model That Outsmarts Text-Only AI

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps