Authors:
(1) An Yan, UC San Diego, ayan@ucsd.edu;
(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;
(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;
(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;
(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;
(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;
(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;
(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;
(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;
(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;
(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com. Editor’s note: This is the part 12 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Table of Links Abstract and 1 Introduction
2 Related Work
3 MM-Navigator
3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark
3.3 History Generation via Multimodal Self Summarization
4 iOS Screen Navigation Experiment
4.1 Experimental Setup
4.2 Intended Action Description
4.3 Localized Action Execution and 4.4 The Current State with GPT-4V
5 Android Screen Navigation Experiment
5.1 Experimental Setup
5.2 Performance Comparison
5.3 Ablation Studies
5.4 Error Analysis
6 Discussion
7 Conclusion and References 6 Discussion Future benchmarks for device-control. For future benchmarks, more dynamic interaction environments are needed. Even humans can make mistakes sometimes, and in this case, it is important that the evaluation benchmark would allow the model to explore and return to previous status when a mistake is made and realized by the model. It is also interesting to explore how to automatically evaluate success rates for this task, e.g., by using LMMs (Zhang et al., 2023). Another direction is to build GUI navigation datasets with different devices and diverse contents, e.g., personal computers and iPads. Error correction. A pretrained LMM may make mistakes due to data or algorithm bias. For example, if the agent fails to complete tasks in certain novel settings, how do we correct its errors to avoid mistakes in the future? Moreover, it would be interesting to study this in a continual learning setting, where the agent keeps interacting with new environments and receives new feedback continually. Model distillation. Using a large-scale model such as GPT-4V for GUI navigation is costly. In the future, it would be interesting to explore model distillation (Polino et al., 2018) for this task, to obtain a much smaller model with competitive navigation performance, which may achieve lower latency and higher efficiency This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) An Yan, UC San Diego, ayan@ucsd.edu; (2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions; (3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu; (4) Kevin Lin, Microsoft Corporation, keli@microsoft.com; (5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com; (6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com; (7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com; (8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu; (9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu; (10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com; (11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com; (12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com. Authors: Authors: (1) An Yan, UC San Diego, ayan@ucsd.edu; (2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions; (3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu; (4) Kevin Lin, Microsoft Corporation, keli@microsoft.com; (5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com; (6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com; (7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com; (8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu; (9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu; (10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com; (11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com; (12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com. Editor’s note: This is the part 12 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 12 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 12 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Editor’s note: This is the part 12 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below. Table of Links Abstract and 1 Introduction 2 Related Work 3 MM-Navigator 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.3 History Generation via Multimodal Self Summarization 4 iOS Screen Navigation Experiment 4.1 Experimental Setup 4.2 Intended Action Description 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 5 Android Screen Navigation Experiment 5.1 Experimental Setup 5.2 Performance Comparison 5.3 Ablation Studies 5.4 Error Analysis 6 Discussion 7 Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 MM-Navigator 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.1 Problem Formulation and 3.2 Screen Grounding and Navigation via Set of Mark 3.3 History Generation via Multimodal Self Summarization 3.3 History Generation via Multimodal Self Summarization 4 iOS Screen Navigation Experiment 4.1 Experimental Setup 4.1 Experimental Setup 4.2 Intended Action Description 4.2 Intended Action Description 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 4.3 Localized Action Execution and 4.4 The Current State with GPT-4V 5 Android Screen Navigation Experiment 5.1 Experimental Setup 5.1 Experimental Setup 5.2 Performance Comparison 5.2 Performance Comparison 5.3 Ablation Studies 5.3 Ablation Studies 5.4 Error Analysis 5.4 Error Analysis 6 Discussion 6 Discussion 7 Conclusion and References 7 Conclusion and References 6 Discussion Future benchmarks for device-control. For future benchmarks, more dynamic interaction environments are needed. Even humans can make mistakes sometimes, and in this case, it is important that the evaluation benchmark would allow the model to explore and return to previous status when a mistake is made and realized by the model. It is also interesting to explore how to automatically evaluate success rates for this task, e.g., by using LMMs (Zhang et al., 2023). Another direction is to build GUI navigation datasets with different devices and diverse contents, e.g., personal computers and iPads. Future benchmarks for device-control. Error correction. A pretrained LMM may make mistakes due to data or algorithm bias. For example, if the agent fails to complete tasks in certain novel settings, how do we correct its errors to avoid mistakes in the future? Moreover, it would be interesting to study this in a continual learning setting, where the agent keeps interacting with new environments and receives new feedback continually. Error correction. Model distillation . Using a large-scale model such as GPT-4V for GUI navigation is costly. In the future, it would be interesting to explore model distillation (Polino et al., 2018) for this task, to obtain a much smaller model with competitive navigation performance, which may achieve lower latency and higher efficiency Model distillation This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Researchers Highlight Need for New Benchmarks to Assess AI Models That Can Navigate Smartphones

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

Researchers Discover How Flawed Labels Derail AI’s On-Screen Navigation Skills

New AI Model Outshines Popular Smart Assistants by Navigating, Interacting with Smartphone Screens

Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone

Researchers Develop New Vision-Enabled AI Model That Outsmarts Text-Only AI

Researchers Rank AI Models Based on How Well They Can Navigate Your Android Screen

A Close Look at Misalignment in Pretraining Datasets

Researchers Discover How Flawed Labels Derail AI’s On-Screen Navigation Skills

New AI Model Outshines Popular Smart Assistants by Navigating, Interacting with Smartphone Screens

Study Finds AI Shouldn't Be Made to Think Deeply to Navigate Your Phone

Researchers Develop New Vision-Enabled AI Model That Outsmarts Text-Only AI

Researchers Rank AI Models Based on How Well They Can Navigate Your Android Screen

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps