Authors:
(1) An Yan, UC San Diego, ayan@ucsd.edu;
(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;
(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;
(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;
(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;
(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;
(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;
(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;
(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;
(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;
(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com. Editor’s note: This is the part 2 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Table of Links Abstract and 1 Introduction
2 Related Work
3 MM-Navigator
3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
3.3 History Generation via Multimodal Self Summarization
4 iOS Screen Navigation Experiment
4.1 Experimental Setup
4.2 Intended Action Description
4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
5 Android Screen Navigation Experiment
5.1 Experimental Setup
5.2 Performance Comparison
5.3 Ablation Studies
5.4 Error Analysis
6 Discussion
7 Conclusion and References 2 Related Work Autonomous GUI navigation. Autonomous GUI navigation involves a model following instructions to maneuver through different graphical user interfaces, such as websites or applications, to perform the user-queried task. Current benchmarks collected either synthetic or real-world usergenerated instructions to evaluate models’ abilities in identifying specific UI elements (Shi et al., 2017; Li et al., 2020; Bai et al., 2021), or achieving overarching task objectives by interacting with a series of GUI views (Li et al., 2020; Burns et al., 2021; Venkatesh et al., 2022; Deng et al., 2023; Rawles et al., 2023). To understand the visual information from these GUI views, one line of work adopts a model structure that can process multimodal inputs (Sun et al., 2022; Redmon et al., 2016). Other methods focus on converting the UI scene text and icons into the text-only HTML format, such as single-module LLMs can process these text inputs for GUI navigation (Zhang et al., 2021; Rawles et al., 2023; Wen et al., 2023). Multimodal agents. Recent advancements in LLMs (Brown et al., 2020; OpenAI, 2023a; Chowdhery et al., 2022; Anil et al., 2023; Touvron et al., 2023; Hoffmann et al., 2022) have catalyzed the exploration of LLM-based agent systems (Madaan et al., 2023; Shinn et al., 2023; Pan et al., 2023; Yao et al., 2022; Schick et al., 2023; Paranjape et al., 2023; Pryzant et al., 2023; Guo et al., 2023; Zhao et al., 2023; Yang et al., 2023a), which integrate reasoning logic and external tools for a variety of complex language tasks. Inspired by the success in the NLP domain, multimodal researchers delve into multimodal agents. The line of research begins with LLM-based multimodal agents (Gupta and Kembhavi, 2023; Surís et al., 2023; Wu et al., 2023; Yang* et al., 2023; Shen et al., 2023; Lu et al., 2023; Yu et al., 2023; Li et al., 2023), such as MM-ReAct (Yang* et al., 2023) for advanced visual reasoning and Visual ChatGPT (Wu et al., 2023) for iterative visual generation and editing. Propelled by the rapid advancements of LMMs (Alayrac et al., 2022; Driess et al., 2023; OpenAI, 2023a,b,c; gpt, 2023; Yang et al., 2023c; Google, 2023), the latest studies have begun to investigate the LMM-powered multimodal agents (Yang et al., 2023; Liu et al., 2023), thereby surpassing the need for basic visual description tools like caption models (Wang et al., 2022a; Wu et al., 2022). Our proposed methodology represents a specialized LMM-based agent for GUI navigation. We aim to provide a comprehensive analysis and a strong baseline for this task. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) An Yan, UC San Diego, ayan@ucsd.edu; (2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions; (3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu; (4) Kevin Lin, Microsoft Corporation, keli@microsoft.com; (5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com; (6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com; (7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com; (8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu; (9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu; (10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com; (11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com; (12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com. Authors: Authors: (1) An Yan, UC San Diego, ayan@ucsd.edu; (2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions; (3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu; (4) Kevin Lin, Microsoft Corporation, keli@microsoft.com; (5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com; (6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com; (7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com; (8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu; (9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu; (10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com; (11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com; (12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com. Editor’s note: This is the part 2 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 2 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 2 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 2 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Table of Links Abstract and 1 Introduction 2 Related Work 3 MM-Navigator 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.3 History Generation via Multimodal Self Summarization 4 iOS Screen Navigation Experiment 4.1 Experimental Setup 4.2 Intended Action Description 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 5 Android Screen Navigation Experiment 5.1 Experimental Setup 5.2 Performance Comparison 5.3 Ablation Studies 5.4 Error Analysis 6 Discussion 7 Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 MM-Navigator 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.3 History Generation via Multimodal Self Summarization 3.3 History Generation via Multimodal Self Summarization 4 iOS Screen Navigation Experiment 4.1 Experimental Setup 4.1 Experimental Setup 4.2 Intended Action Description 4.2 Intended Action Description 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 5 Android Screen Navigation Experiment 5.1 Experimental Setup 5.1 Experimental Setup 5.2 Performance Comparison 5.2 Performance Comparison 5.3 Ablation Studies 5.3 Ablation Studies 5.4 Error Analysis 5.4 Error Analysis 6 Discussion 6 Discussion 7 Conclusion and References 7 Conclusion and References 2 Related Work Autonomous GUI navigation. Autonomous GUI navigation involves a model following instructions to maneuver through different graphical user interfaces, such as websites or applications, to perform the user-queried task. Current benchmarks collected either synthetic or real-world usergenerated instructions to evaluate models’ abilities in identifying specific UI elements (Shi et al., 2017; Li et al., 2020; Bai et al., 2021), or achieving overarching task objectives by interacting with a series of GUI views (Li et al., 2020; Burns et al., 2021; Venkatesh et al., 2022; Deng et al., 2023; Rawles et al., 2023). To understand the visual information from these GUI views, one line of work adopts a model structure that can process multimodal inputs (Sun et al., 2022; Redmon et al., 2016). Other methods focus on converting the UI scene text and icons into the text-only HTML format, such as single-module LLMs can process these text inputs for GUI navigation (Zhang et al., 2021; Rawles et al., 2023; Wen et al., 2023). Autonomous GUI navigation. Multimodal agents. Recent advancements in LLMs (Brown et al., 2020; OpenAI, 2023a; Chowdhery et al., 2022; Anil et al., 2023; Touvron et al., 2023; Hoffmann et al., 2022) have catalyzed the exploration of LLM-based agent systems (Madaan et al., 2023; Shinn et al., 2023; Pan et al., 2023; Yao et al., 2022; Schick et al., 2023; Paranjape et al., 2023; Pryzant et al., 2023; Guo et al., 2023; Zhao et al., 2023; Yang et al., 2023a), which integrate reasoning logic and external tools for a variety of complex language tasks. Inspired by the success in the NLP domain, multimodal researchers delve into multimodal agents. The line of research begins with LLM-based multimodal agents (Gupta and Kembhavi, 2023; Surís et al., 2023; Wu et al., 2023; Yang* et al., 2023; Shen et al., 2023; Lu et al., 2023; Yu et al., 2023; Li et al., 2023), such as MM-ReAct (Yang* et al., 2023) for advanced visual reasoning and Visual ChatGPT (Wu et al., 2023) for iterative visual generation and editing. Propelled by the rapid advancements of LMMs (Alayrac et al., 2022; Driess et al., 2023; OpenAI, 2023a,b,c; gpt, 2023; Yang et al., 2023c; Google, 2023), the latest studies have begun to investigate the LMM-powered multimodal agents (Yang et al., 2023; Liu et al., 2023), thereby surpassing the need for basic visual description tools like caption models (Wang et al., 2022a; Wu et al., 2022). Our proposed methodology represents a specialized LMM-based agent for GUI navigation. We aim to provide a comprehensive analysis and a strong baseline for this task. Multimodal agents. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Meet the AI That Can Actually Use Your Smartphone for You

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

Researchers Discover How Flawed Labels Derail AI’s On-Screen Navigation Skills

New AI Model Outshines Popular Smart Assistants by Navigating, Interacting with Smartphone Screens

Researchers Highlight Need for New Benchmarks to Assess AI Models That Can Navigate Smartphones

Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone

Researchers Develop New Vision-Enabled AI Model That Outsmarts Text-Only AI

A Close Look at Misalignment in Pretraining Datasets

Researchers Discover How Flawed Labels Derail AI’s On-Screen Navigation Skills

New AI Model Outshines Popular Smart Assistants by Navigating, Interacting with Smartphone Screens

Researchers Highlight Need for New Benchmarks to Assess AI Models That Can Navigate Smartphones

Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone

Researchers Develop New Vision-Enabled AI Model That Outsmarts Text-Only AI

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps