This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Joel Jang, CarperAI,University of Washington & Allen Institute for AI;
(2) Seungone Kim, KAIST AI;
(3) Yizhong Wang, University of Washington;
(4) Jack Hessel, University of Washington;
(5) Luke Zettlemoyer, Aleph Alpha;
(6) Hannaneh Hajishirzi, University of Washington & Allen Institute for AI;
(7) Yejin Choi, UC San Diego.
2 RELATED WORK
Aligning Language Models To Human Preferences Incorporating human preference feedback into a reward model, and subsequently optimizing a language model to output text that reward model scores highly with an RL algorithm, has been shown to result in language models that generate outputs humans generally prefer (Ouyang et al., 2022b). This process has been applied to summarization (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021a), answering questions with long-form answers using text retrieved from the web (Nakano et al., 2021b; Menick et al., 2022), generating engaging responses in a dialogue settings (Thoppilan et al., 2022; Cohen et al., 2022) and following human instructions (Kojima et al., 2021; Suhr & Artzi, 2022; Kim et al., 2023).
However, the standard RLHF setup commonly addressed in prior work assumes a reward model that accounts only for average annotator preference, i.e., the fact that different users may desire different outputs, even for the same prompt, is ignored Casper et al. (2023). Individual preferences can vary not only on aesthetic axes, but also on semantics. For example, Santurkar et al. (2023) use public opinion polling to show that “default” LLM preferences vary in their degree of expressed-opinion alignment with different average opinions among demographic groups.[2] Kirk et al. (2023) defines a taxonomy and policy framework for the alignment of LLMs with personalized feedback. While Wu et al. (2023) performs fine-grained RLHF which is very similar in spirit and allows personalization, our work develops MORL algorithms for scenarios where there are conflicting preferences, not only orthogonal objectives.
Multi-objective Reinforcement Learning (MORL) In this work, we propose formulating LLM personalization as a MORL problem, which was typically studied in decision-making tasks (Hayes et al., 2022) that aims to tackle the problem of simply optimizing by a single, scalar, additive reward function (Sutton & Barto, 2018), which possesses many limitations such as (1) suboptimal solutions due to lack of representation (Hayes et al., 2022), (2) lack of explainability of distinct objectives, and (3) ensuring fair outcomes for multiple participants (Vamplew et al., 2018; Siddique et al., 2020).
Previous work has aimed to alleviate these problems through novel MORL methods (Van Moffaert et al., 2013; Van Moffaert & Nowe´, 2014; Yang et al., 2019; Xu et al., 2020). Other work aims to solve complex problems such as water management, military purchasing, wind farm control, etc. (Hayes et al., 2022) by converting the single-objective RL problem into a MORL problem. In this work, we convert the problem of aligning LLMs to human preferences into a MORL problem to (1) provide a more optimal solution for each individual, (2) allow users to dynamically choose the distinct objectives they want to optimize, and (3) ensure fairness by allowing preferences that may be in the long-tail to be integrated.
Personalization in Natural Language Processing Personalization in Natural Language Processing (NLP) has mainly been focused on creating personalized dialogue agents (Zhang et al., 2018; Mazare et al. ´ , 2018; Zheng et al., 2019; Wu et al., 2021b; Xu et al., 2022), where the task is to create chitchat agents that are engaging with distinct personas based on user profile (e.g. gender, age, residence, etc.) or past user history data (e.g. Reddit posts, etc.). Another line of work (Salemi et al., 2023) leverages personalized information to boost performance on specific tasks such as review generation (Li & Tuzhilin, 2019), recipe generation (Majumder et al., 2019), and headline generation (Ao et al., 2021). This line of work requires model providers to make better models utilizing the personal information of the user. In our work, we propose a framework that allows users to choose which preference the language model should prefer, essentially giving control to the user.
Parameter Merging Recent work has shown that performing weighted linear interpolation of model parameters leads to the composition of each model ability (Li et al., 2022; Wortsman et al., 2022b;a; Don-Yehiya et al., 2022; Huang et al., 2023). This line of work has led to many interesting applications of model merging such as composing the abilities of expert models that perform different tasks (Ilharco et al., 2022; Jang et al., 2023) and introducing language-specific modules for growing the total capacity of multilingual LMs (Pfeiffer et al., 2022).
Most recently, Rame et al. (2023) proposed to merge policy models that were trained to perform specific tasks such as question answering and summarization using proxy reward models. While they mostly deal with reward models trained on the same data, our proposed MORL methods are an extension of this work that actually deals with diverse reward models trained on multifaceted human feedback to show compositional abilities through parameter merging rather than just ensembling.
[2] Feng et al. (2023) suggests that “default” LLM expressed opinions stem directly from the pretraining data.