paint-brush
Hacking Reinforcement Learning with a Little Help from Humans (and LLMs)by@languagemodels
New Story

Hacking Reinforcement Learning with a Little Help from Humans (and LLMs)

by Language ModelsDecember 3rd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

ICPL enhances reinforcement learning by integrating Large Language Models with human preference feedback, refining reward functions interactively. Building on works like EUREKA, it sets new standards in RL reward design efficiency.
featured image - Hacking Reinforcement Learning with a Little Help from Humans (and LLMs)
Language Models HackerNoon profile picture
  1. Abstract and Introduction
  2. Related Work
  3. Problem Definition
  4. Method
  5. Experiments
  6. Conclusion and References


A. Appendix

A.1. Full Prompts and A.2 ICPL Details

A. 3 Baseline Details

A.4 Environment Details

A.5 Proxy Human Preference

A.6 Human-in-the-Loop Preference

Reward Design. In reinforcement learning, reward design is a core challenge, as rewards must both represent a desired set of behaviors and provide enough signal for learning. The most common approach to reward design is handcrafting, which requires a large number of trials by experts (Sutton, 2018; Singh et al., 2009). Since hand-coded reward design requires extensive engineering effort, several prior works have studied modeling the reward function with precollected data. For example, Inverse Reinforcement Learning (IRL) aims to recover a reward function from expert demonstration data (Arora & Doshi, 2021; Ng et al., 2000). With advances in pretrained foundation models, some recent works have also studied using large language models or vision-language models to provide reward signals (Ma et al., 2022; Fan et al., 2022; Du et al., 2023; Karamcheti et al., 2023; Kwon et al., 2023; Wang et al., 2024; Ma et al., 2024).


Among these approaches, EUREKA (Ma et al., 2023) is the closest to our work, instructing the LLM to generate and select novel reward functions based on environment feedback with an evolutionary framework. However, as opposed to performing preference learning, EUREKA uses the LLM to design dense rewards that help with learning a hand-designed sparse reward. In contrast, ICPL focuses on finding a reward that correctly orders a set of preferences. However, we note that EUREKA also has a small, preliminary investigation combining human preferences with an LLM to generate human-preferred behaviors in a single scenario. We note that this experiment had the human provide their feedback in text, whereas we only use preference queries. This paper is a significantly scaled-up version of that investigation as well as a methodological study of how best to incorporate prior rounds of feedback.


Human-in-the-loop Reinforcement Learning. Feedback from humans has been proven to be effective in training reinforcement learning agents that better match human preferences (Retzlaff et al., 2024; Mosqueira-Rey et al., 2023; Kwon et al., 2023). Previous works have investigated human feedback in various forms, such as trajectory comparisons, preferences, demonstrations, and corrections (Wirth et al., 2017; Ng et al., 2000; Jeon et al., 2020; Peng et al., 2024). Among these various methods, preference-based RL has been successfully scaled to train large foundation models for hard tasks like dialogue, e.g. ChatGPT (Ouyang et al., 2022). In LLM-based applications, prompting is a simple way to provide human feedback in order to align LLMs with human preferences (Giray, 2023; White et al., 2023; Chen et al., 2023). Iteratively refining the prompts with feedback from the environment or human users has shown promise in improving the output of the LLM (Wu et al., 2021; Nasiriany et al., 2024). This work extends the usage of the ability to control LLM behavior via in-context prompts. We aim to utilize interactive rounds of preference feedback between the LLM and humans to guide the LLM to generate reward functions that can elicit behaviors that align with human preferences.


Authors:

(1) Chao Yu, Tsinghua University;

(2) Hong Lu, Tsinghua University;

(3) Jiaxuan Gao, Tsinghua University;

(4) Qixin Tan, Tsinghua University;

(5) Xinting Yang, Tsinghua University;

(6) Yu Wang, with equal advising from Tsinghua University;

(7) Yi Wu, with equal advising from Tsinghua University and the Shanghai Qi Zhi Institute;

(8) Eugene Vinitsky, with equal advising from New York University ([email protected]).


This paper is available on arxiv under CC 4.0 license.