This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Joel Jang, CarperAI,University of Washington & Allen Institute for AI;
(2) Seungone Kim, KAIST AI;
(3) Yizhong Wang, University of Washington;
(4) Jack Hessel, University of Washington;
(5) Luke Zettlemoyer, Aleph Alpha;
(6) Hannaneh Hajishirzi, University of Washington & Allen Institute for AI;
(7) Yejin Choi, UC San Diego.
3 REINFORCEMENT LEARNING FROM Personalized HUMAN FEEDBACK
Collecting Conflicting Pairwise Feedback We utilize Tulu-7B LM (Wang et al., 2023), a model that uses LLaMA-7B (Touvron et al., 2023) as a base model and is instruction tuned on a mixture of open-source instruction-tuning datasets, as the base model for our experiments. We utilize 10k prompt instances from GPT4-Alpaca (Peng et al., 2023), one of the datasets used to train Tulu-7B, as our instruction dataset Dtrain to generate rollouts and collect pairwise feedback data. We also use the same Dtrain during Proximal Policy Optimization (PPO) training (Schulman et al., 2017) of Tulu-7B.
Following previous work, we simulate human annotators with GPT-4 for collecting large-scale pairwise feedback data (Bai et al., 2022b; Dubois et al., 2023)—but note that our evaluations are validated with (smaller-scale) human preference data collected from crowdworkers. While Dubois et al. (2023) mostly simulates GPT-4 and other LLMs to choose which is generally a better response between two candidate responses, we provide GPT-4 with a single preference (full list shown in Table 1) to decide which is a better response. We also provide the same preference criteria via additional prompts during the rollout generation of the two candidate responses; we use Tulu-30B for the rollout generation while the actual policy model we train is Tulu-7B for our main experimental setup, making our experimental setting an off-policy training set-up.
Reward Model Training While we have feedback on which of the two model responses is more aligned with a single preference via GPT-4 annotation, utilizing only two positive pairs during reward model training was empirically shown to be less robust during the PPO training. Instead, we train our reward model on multiple comparisons (Song et al., 2023; Kim et al., 2023) by including a neutral response and a negative response as shown in Figure 2. Specifically, the reward model is provided with four different comparisons for a single prompt during training: positive 1 > positive 2 (decided by GPT-4), positive > neutral, positive > negative, and neutral > negative. The positive response when compared with the neutral and the negative response is chosen randomly. This allows the reward model to be exposed to different granularity of the specific preference and give scores accordingly during PPO training. We explore (1) training a single reward model in a multitask fashion that leverages the preference prompts during inference to give distinct rewards according to each preference and (2) training multiple reward models, each tailored to the distinct preference.
3.2 MULTI-OBJECTIVE REINFORCEMENT LEARNING (MORL)
The MORL problem can be denoted as:
3.3 MULTIFACETED EVALUATION
Evaluation For evaluation, we manually filter out 50 instances from the Koala evaluation (Geng et al., 2023) that require open-ended generations. We also modified some of the prompts so that the evaluation prompts do not contain any elements requiring individual preferences (e.g., removing the phrase asking for a elementary-level response from the original prompt since we want to test the LLM to generate a expert-level response). The full list of evaluation prompts used for our experiments is shown in Appendix C. In our evaluation setup, we simulate users to have a unique combination of preferences, each from the three preference dimensions (Expertise, Informativeness, Style) in Table 1, which equates to 8 unique preference combinations (examples shown in Figure 1). We get the average win rate across the simulated 8 preference combinations for our final evaluation. We use a variant of the AlpacaFarm evaluation framework for simulated (GPT4) evaluation and hire 24 crowdworkers for human evaluation. Details of human evaluation are provided in Appendix A.
[3] P1, P2, P3 each represents preference prompts from each preference dimension in Table 1. For one example, one unique combination might be P1A + P2B + P3A (ABA) where the combined objective for the response needs to be elementary level, informative, and friendly.
[4] Empirically, utilizing a single reward model instead of multiple reward models led to better performance. We hypothesize this is due to the problem of normalizing signals from different reward models (Hayes et al., 2022), which is known to be a nontrivial problem in MORL.