paint-brush
Researchers Teach AI to Retain Memory by Summarizing Its Own Workby@fewshot

Researchers Teach AI to Retain Memory by Summarizing Its Own Work

by The FewShot Prompting Publication 2mDecember 11th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Researchers at Microsoft and University of California San Diego have developed an AI model capable of navigating your smartphone screen.
featured image - Researchers Teach AI to Retain Memory by Summarizing Its Own Work
The FewShot Prompting Publication  HackerNoon profile picture
0-item

Authors:

(1) An Yan, UC San Diego, [email protected];

(2) Zhengyuan Yang, Microsoft Corporation, [email protected] with equal contributions;

(3) Wanrong Zhu, UC Santa Barbara, [email protected];

(4) Kevin Lin, Microsoft Corporation, [email protected];

(5) Linjie Li, Microsoft Corporation, [email protected];

(6) Jianfeng Wang, Microsoft Corporation, [email protected];

(7) Jianwei Yang, Microsoft Corporation, [email protected];

(8) Yiwu Zhong, University of Wisconsin-Madison, [email protected];

(9) Julian McAuley, UC San Diego, [email protected];

(10) Jianfeng Gao, Microsoft Corporation, [email protected];

(11) Zicheng Liu, Microsoft Corporation, [email protected];

(12) Lijuan Wang, Microsoft Corporation, [email protected]**.**

Editor’s note: This is the part 4 of 13 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.


3.3 History Generation via Multimodal Self Summarization

Set-of-Mark prompting bridges the gap between text outputs from GPT-4V and executable localized actions. However, the agent’s ability to maintain a Table 1: Zero-shot GPT-4V iOS screen navigation accuracy on the “intended action description” and “localized action execution” tasks, respectively.

historical context is equally important in successfully completing tasks on smartphones. The key difficulty lies in devising a strategy that allows the agent to effectively determine the subsequent action at each stage of an episode, taking into account both its prior interactions with the environment and the present state of the screen. The naive approach of feeding all historical screens or actions into the agent is computationally expensive and may decrease the performance due to information overload. For example, screens at each step can change rapidly, and most of the historical screen information is not useful for reasoning about future actions. Humans, on the other hand, can keep track of a short memory of the key information after performing a sequence of actions. We aim to find a more concise representation than a sequence of screens or actions. Specifically, at each time step, we ask GPT-4V to perform multimodal self summarization, which converts the historical actions and current step information into a concise history in the form of natural language, which is formulated as follows:


This paper is available on arxiv under CC BY 4.0 DEED license.