This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Joel Jang, CarperAI,University of Washington & Allen Institute for AI;
(2) Seungone Kim, KAIST AI;
(3) Yizhong Wang, University of Washington;
(4) Jack Hessel, University of Washington;
(5) Luke Zettlemoyer, Aleph Alpha;
(6) Hannaneh Hajishirzi, University of Washington & Allen Institute for AI;
(7) Yejin Choi, UC San Diego.
4 EXPERIMENTS
4.1 BASELINE METHODS
In this subsection, we provide details of the single-objective baseline methods we implement. The summary of the key component differences in comparison with our proposed methods is provided in Table 2.
Vanilla Baseline (VB) As the most simple baseline, we simply utilize the base Tulu-7B model to generate responses without providing it any notion of preferences. During the evaluation, we use the same response to evaluate on the 8 different preference combinations.
Reinforcement Learning from Human Feedback (RLHF) We perform RLHF in the traditional manner where GPT-4 labels which response is generally better, train a reward model using the pairwise feedback data, and use the reward model to adapt the policy model with PPO training. The same 10k instances from Dtrain are used for RLHF.
Preference Prompting (PP) Next, we observe how far the instruction-tuned base LM can integrate multiple preference combinations by simply prompting for the preferences without any additional training
Multi-task Training (MT) For a competitive baseline, we utilize the positive candidate selected by GPT-4 as the output for imitation learning, which is essentially performing rejection sampling (Nakano et al., 2021a) that uses GPT-4 as the reward model for selecting golden responses from the distribution of responses. We append the individual preference prompt with instances from Dtrain and multitask train the Tulu-7B model across all six individual preferences. This method also allows distilling the outputs of Tulu-30B for training Tulu-7B.
4.2 EXPERIMENTAL DETAILS
For both the reward model and policy training, we limit ourselves to going through Dtrain only once (1 epoch). In the initial exploration stage, the end performance for the policy model did not improve even if we trained the reward model for longer epochs. For policy model training, we utilize our evaluation dataset (50 prompts) to get the average reward and chose the policy model checkpoint that showed the highest average reward on the evaluation set for our final evaluation. We utilize LoRA (Hu et al., 2022) for both the reward model and policy model training. The detailed hyperparameters for the reward model and policy model training are provided in our github repository [5] .
4.3 MAIN RESULTS
Table 3 and 4 show the results of doing all possible pairwise comparisons across the methods using GPT-4 and humans as judges, respectively. Note that the win rate of each battle is calculated using the aggregated win rate explained in Section 3.3. Each individual preference combination results are shown in Appendix D. We also show the average criteria-wise win rate instead of the aggregated win rate across all of the methods in Appendix B.
The first thing to note is that there is a limitation to the extent prompting (PP) can integrate multiple preferences. This means that specific training for integrating the multiple preferences is necessary to composite them together. Next, we can see that supervised fine-tuning (RS) underperforms MORLbased methods, which is consistent with prior work that also showed the advantage of RL-based approaches when aligning LLMs with human feedback compared to its supervised-finetuning counterpart. Finally, while P-MORL and P-SOUPS both outperform other methods on average, there exists a discrepancy between the simulated and human evaluation; P-SOUPS has the highest average win rate in GPT-4 evaluation while P-MORL has the highest in human evaluation. Nonetheless,P-SOUPS is able to show superior performance in comparison to baseline methods and competitive performance to P-MORL.
In previous parameter merging literature, multi-task fine-tuning (RS) used to be considered the upper bound for compositional abilities through parameter merging (Don-Yehiya et al., 2022). However, in our scenario, we can see that parameter merging (P-SOUPS) is able to outperform multitask fine-tuning, showing promising results for parameter merging not only as a distributed multitask finetuning method but a method that can result in superior performance than multitask training.
Trade-off Between General Helpfulness One might still wonder about the general ‘helpfulness’ capabilities of models that are trained to be tailored to multiple preferences. In Figure 3, we first show the average pairwise win rate from Table 3 in green. Next, we instead ONLY perform pairwise comparisons with the unseen objective ‘helpfulness’ (ask GPT-4 which model response they generally prefer better) and report the average pairwise win rate in red.
RLHF performs the best in this scenario, which shows that there is no free lunch; while the objective of RLHF was to provide model responses that are generally preferred (highly correlated with ‘helpfulness’), the other methods were prompted/trained to be optimized towards the personalized preference aspects, possibly deviating away from general helpfulness. While RS, P-MORL, and P-SOUPS are able to retain similar performance in terms of helpfulness compared to the initial instructiontuned model (VB), we observe that prompting (PP) significantly underperforms compared to other methods which also highlights the limitation of simply prompting base/instruction-tuned models for personalized preferences and shows the need for specialized training methods for personalization.
4.4 SCALING TO NEW PREFERENCES
While we explore 6 distinct preferences in this work, we are still limited in doing ‘declarative’ personalization; that is, the individual preferences have been pre-defined to measure the performance of different methodologies. However, in the real world, individuals may not be bound by predefined preferences. Furthermore, people’s preferences might change over time, which requires continual learning of new preferences. This means that we may be required to train infinite numbers of preferences to be truly personalized to individuals’ preferences. Considering this aspect, the scalability of methods becomes a critical factor in implementing RLPHF in real-world scenarios.
In order to compare the scalability of P-MORL and P-SOUPS, we add two new preferences (in addition to the ones in Table 1 to the STYLE dimensions: (P3C) “Generate/Choose a response (that answers) in a sassy manner.” and (P3D) “Generate/Choose a response (that answers) in a sarcastic manner.”, which results in a total of 16 (2 × 2 × 4) unique preference combinations. We re-train P-MORL on the 16 new preference combinations and only train two new policy models for integrating P-SOUPS. The simulated win-rate between P-MORL and P-SOUPS on each of the original preference combinations (53.04% win rate of P-SOUP over P-MORL in Table 3 decomposed into each preference combinations) and the 16 new preference combinations are shown in Figure 4.
As shown in the figure, P-SOUPS shows competitive performance compared to P-MORL while being much more efficient considering that it (1) did not have to observe all 16 possible preference combinations and (2) did not have to re-train on the previous preferences, but just train two new policies each for the new preference in a modular manner and merge their parameters on-the-fly during inference. Considering that P-MORL is bounded by O(2n) while P-SOUPS is bounded by O(n) where n is the total number of preferences (assuming there are two unique preferences for each dimension), we assert that P-SOUPS allows tackling RLPHF to be feasible.
[5] https://github.com/joeljang/RLPHF