Personalized Soups: LLM Alignment Via Parameter Merging - Abstract & Introduction

Written by escholar | Published 2024/03/20
Tech Story Tags: large-language-models | reinforcement-learning | personalized-alignment | ai-human-feedback | parameter-merging | model-adaptation | human-feedback | proximal-policy-optimization

TLDRvia the TL;DR App

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Joel Jang, CarperAI,University of Washington & Allen Institute for AI;

(2) Seungone Kim, KAIST AI;

(3) Yizhong Wang, University of Washington;

(4) Jack Hessel, University of Washington;

(5) Luke Zettlemoyer, Aleph Alpha;

(6) Hannaneh Hajishirzi, University of Washington & Allen Institute for AI;

(7) Yejin Choi, UC San Diego.

ABSTRACT

While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. [1]

1 INTRODUCTION

Reinforcement Learning from Human Feedback (RLHF) (Nakano et al., 2021a; Ouyang et al., 2022a; Bai et al., 2022a; Dubois et al., 2023; Bai et al., 2022b) typically optimizes a policy model that receives training signals from a single reward model that aims to capture the general preferences of a population. In this work, we instead propose Reinforcement Learning from Personalized Human Feedback (RLPHF), a new, multi-objective formulation of the human preference alignment problem, where Large Language Models (LLMs) are trained to be efficiently aligned with a range of different, potentially personalized combinations of human preferences.

We model RLPHF as a Multi-Objective Reinforcement Learning (MORL) problem, which allows training the policy model with multiple, conflicting objectives since it aims to vary the importance of each objective during inference. In existing RLHF formulations, pairwise human feedback is collected by asking human annotators to choose which model response is generally better and is used to train a general reward model. This makes implicit assumptions that may not hold for everyone. For example, recent work has shown that LLMs aligned with RLHF prefer verbose output generations (Zheng et al., 2023; Dubois et al., 2023; Wang et al., 2023; Singhal et al., 2023). We aim to support a wider range of multifaceted preferences that are explicitly declared as desirable by the user—giving the user control over the facets of output text they want to see as well as the personal data they wish to reveal to the model. We collect personalized human feedback corresponding to multiple such dimensions, noting that they may also be conflicting in nature.

We first implement a strong MORL baseline called PROMPTED-MORL where there are multiple reward signals for each of the objectives (preferences) given via prompts during RL training. Next, we propose PERSONALIZED SOUPS, a method that circumvents simultaneously optimizing multiple preferences by first optimizing multiple policy models each with distinct preferences with Proximal Policy Optimization (PPO) and merging the parameters of the policy models whose preferences we want to composite together on the fly during inference. This modular approach significantly reduces the computational complexity from exponential to linear in relation to the total number of unique preferences. Furthermore, since PERSONALIZED SOUPS does not have to be trained in a multitask fashion, it does not require re-training the underlying policy every time a novel preference (objective) is added.

We empirically show that by transforming the problem of aligning LLMs to human preferences into a MORL problem, we are able to have personalized alignment that provides a deeper level of adaptation to individual users that supervised fine-tuning, RLHF, and prompting cannot attain. We also emphasize the modularity of PERSONALIZED SOUPS by performing experiments in a scenario where the user additionally writes novel preferences that they want to integrate with existing preferences. We show that in this scenario, PERSONALIZED SOUPS still performs competitively to PROMPTED-MORL while being exponentially more efficient through parameter merging.


[1] Code: https://github.com/joeljang/RLPHF


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2024/03/20