Helpful & Harmless AI: Alignment Training Improves Performance on Almost All NLP Evaluations

Written by anthropic | Published 2026/01/19
Tech Story Tags: rlhf | reinforcement-learning | helpful-ai | harmless-ai | alignment-training | alignment-training-performance | nlp-evaluations | ai

TLDRThis paper shows how large language models can be aligned using reinforcement learning from human feedback (RLHF). By training preference models on helpfulness and harmlessness data, then optimizing with PPO, the authors produce assistants that are safer, more useful, and often more capable than base models. Larger models gain alignment benefits with little to no performance loss.via the TL;DR App

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Authors:

  1. Yuntao Bai
  2. Andy Jones
  3. Kamal Ndousse
  4. Amanda Askell
  5. Anna Chen
  6. Nova DasSarma
  7. Dawn Drain
  8. Stanislav Fort
  9. Deep Ganguli
  10. Tom Henighan
  11. Nicholas Joseph
  12. Saurav Kadavath
  13. Jackson Kernion
  14. Tom Conerly
  15. Sheer El-Showk
  16. Nelson Elhage
  17. Zac Hatfield-Dodds
  18. Danny Hernandez
  19. Tristan Hume
  20. Scott Johnston
  21. Shauna Kravec
  22. Liane Lovitt
  23. Neel Nanda
  24. Catherine Olsson
  25. Dario Amodei
  26. Tom Brown
  27. Jack Clark
  28. Sam McCandlish
  29. Chris Olah
  30. Ben Mann
  31. Jared Kaplan

Abstract

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

1 Introduction

We would like to develop techniques to train AI agents that are helpful, honest, and harmless [Askell et al., 2021]. In this paper we show that we can train a relatively helpful and harmless1 (HH) natural language assistant by collecting human preference data and applying the techniques of preference modeling (PMing) and reinforcement learning from human feedback (RLHF). Our full training process is summarized in Figure 2.

Our goal is not to define or prescribe what ‘helpful’ and ‘harmless’ mean but to evaluate the effectiveness of our training techniques, so for the most part we simply let our crowdworkers interpret these concepts as they see fit. We treat helpfulness and harmlessness separately, collecting distinct human-preference datasets for each. For helpfulness, we ask crowdworkers to solicit our models to assist with any purely text-based tasks such as answering questions, writing or editing documents, or discussing plans and decisions. For harmlessness, we invite crowdworkers to adversarially probe or ‘red-team’ our language models in order to provoke harmful responses: either to help them with harmful goals, such as planning a bank robbery, or to cause the AI to use toxic language.2 At each stage of their conversations with the AI assistant, crowdworkers are presented with two possible responses. Those engaged in the helpfulness task are instructed to choose the more helpful and honest (i.e. better) response. Those engaged in the red teaming task are instructed to choose the more harmful (i.e. worse) response. These conversations and the expressed human preferences form our datasets.3

Helpfulness and harmlessness often stand in opposition to each other. An excessive focus on avoiding harm can lead to ‘safe’ responses that don’t actually address the needs of the human. An excessive focus on being helpful can lead to responses that help humans cause harm or generate toxic content.

We demonstrate this tension quantitatively by showing that preference models trained to primarily evaluate one of these qualities perform very poorly (much worse than chance) on the other. Fortunately, we find that PMs trained on a mixture of both datasets can nevertheless learn the right lessons and behave helpfully when appropriate, while encouraging the polite refusal of harmful requests. With preference models in hand, we then train helpful and harmless assistants via reinforcement learning, using the PM scores as rewards. We evaluate both PM performance and the more relevant performance characteristics of our RLHF-trained models. As can be seen in Figure 1, purely helpful RLHF-trained models are far easier to red-team, while helpful+harmless models are both very helpful and much less harmful.

A question that’s often raised about alignment training is whether it will compromise AI capabilities. We find that when RLHF is applied to large language models, the answer seems to be an almost-categorical no. Our RLHF-trained models tend to perform better than their raw, generative counterparts on virtually all evaluations, as summarized in Figure 3. We also argue that one can mix specialized skills with alignment- related training without compromising either alignment or performance. In practice, aligned models are likely to be more user-friendly and deployable than their raw counterparts, which suggests that there’s little reason to deploy models that have not been finetuned for alignment.

1.1 Contributions

Dialogue Preference Datasets

•    We collect separate helpfulness and harmlessness (i.e. red-teaming) datasets primarily using various 52B language models (see Section 2 for details) in our interface (Figure 6). Crowdworkers have open-ended conversations with the models, either soliciting help, or providing instructions, or at- tempting to get the model to emit harmful responses, and they are asked to choose the more helpful response or the more harmful4 response at each conversational step, respectively.

•    We collect three tranches of data, one from our initial models, one with rejection sampling against early preference models, and a final dataset gathered with models trained with ‘online’ reinforcement learning from human feedback, which we improve on a roughly weekly cadence. See Section 2.3.

Alignment with Human Values Has Many Benefits and Essentially No Cost to Performance

•    Smaller models experience severe ‘alignment taxes’ – their performance on a wide variety of eval- uations declines after RLHF training. However, we find a variety of alignment bonuses, with our 13B and 52B5 RLHF-trained models performing better at zero-shot NLP evaluations, and the same at few-shot evaluations.

•    Natural language RLHF training for HH can be applied to models that have been first finetuned on code, and it improves their programming ability on evaluations (presumably by improving general-purpose instruction following). We also find that mixing preference model training for HH with the specialized skill of summarization [Stiennon et al., 2020] incurs no degradation in perfor- mance in either HH or summarization. So there is no reason not to combine alignment training with more specific, valuable skills.

•    There is a tension between helpfulness and harmlessness, which can be measured at the level of both preference modeling and RLHF-trained policies (Figure 1). However, as model size increases, PMs perform better on both distributions simultaneously and become much more robust to the rela- tive proportions of helpful and harmless training data.

•    We also show that one can use OOD detection techniques [Fort et al., 2021] to reject most strange and harmful requests (Figure 22), with little or no harmful examples (Figure 23).

Scaling, RLHF Robustness, and Iterated ‘Online’ Training

•    We study scaling relations for PM accuracy as a function of model and dataset size, and find roughly log-linear trends (Figure 7), though we encounter some idiosyncrasies (Figures 31 and 32).

•  We conduct experiments on the robustness of RLHF (see Figure 4), where we split our datasets in half and train separate preference models on each half. Then we train RL models against one PM while evaluating with the other. We conclude that larger PMs are more robust than smaller PMs, and as expected, overfitting increases during RLHF training.

•    We find that DKL(π π0) and reward are approximately linearly related for much of RLHF training (see Figures 4 and 13), where π and π0 are the policy and initial policy, respectively. We explain how this relation may arise and discuss possible applications and future directions.

•    We study iterated online training, where we update our preference models and RLHF policies on a weekly cadence, and then re-deploy these fresh RLHF models to interact with crowdworkers. This significantly improved our models as evaluated by crowdworkers (Figure 1), and greatly improved our dataset as judged by our own PMs (Figure 15), filling out the upper tail in terms of quality.

To remove confounders and bolster our conclusions, we perform additional controlled experiments (Figure 16) holding dataset size and other hyperparameters fixed.

1.2 Summary of Evaluations and Metrics

•    NLP and Code Evaluations: We evaluate our models on MMLU [Hendrycks et al., 2021b], Lam- bada [Paperno et al., 2016], Hellaswag [Zellers et al., 2019], OpenBookQA [Mihaylov et al., 2018], ARC [Clark et al., 2018], and TriviaQA [Joshi et al., 2017]; see Figures 28 and 29 for full results and Figure 3 for the mean. In every case except for TriviaQA, 12B and 52B RLHF-trained models per- form better than base LMs. Separately, we take Python coding models and finetune them with natural language RLHF, and then evaluate them on the codex HumanEval [Chen et al., 2021]; see Figure 21. We also experiment with mixing PM training for HH with summarization [Stiennon et al., 2020] as a specialized skill, and evaluate the resulting PM performance (Figure 20), finding that mixed training does not degrade PM accuracies.

•    Static Alignment Evaluations:  We evaluate our PMs using our HHH Evaluations [Askell et al., 2021] from BIG-Bench6 (Figure 5), on Bot Adversarial Dialogues [Xu et al., 2020], and for gender bias [Rae et al., 2021] (Figure 12). We evaluate our RLHF models on TruthfulQA [Lin et al., 2021] (Figure 5), BBQ-Lite [Parrish et al., 2021] from BIG-Bench, gender bias (Figure 40), and sentiment based on race and religion [Rae et al., 2021] (Figure 17). RLHF improves senti- ment towards all groups, but does not remove bias.

•    Human Evaluations: We compute Elo scores based on the preferences of our crowdworkers, com- paring context-distilled models, base RLHF trained models, and final online RLHF models (Figure 1). We also test our online models’ performance during training (Figure 15), compare various levels of rejection sampling (Figure 36), and perform a controlled experiment on iterated online training (Figure 16). Furthermore, we hired professional writers to compose conversations where an assistant provides high-quality, helpful and honest responses, and we then asked crowdworkers to compare our model’s responses to those of these writers. Crowdworkers prefer our online HH model to these writers7 about 57% of the time.

•    Samples: We provide samples from all of the PALMs [Solaiman and Dennison, 2021] sensi- tive questions and from prompts provided with InstructGPT [Ouyang et al., 2022] and LaMDA [Thoppilan et al., 2022] in Appendix C. We show some comparisons with human writers in Sec- tion 6.1, and we show several short dialogues in Section 6.3. To mitigate the problem of cherry picking samples, we generate 17 samples per prompt and display only the median sample, as ranked by our online HH preference model.

Two recent papers, LaMDA [Thoppilan et al., 2022] and InstructGPT [Ouyang et al., 2022] have particular similarity to this work. Both use human data to train large language models to be more generally useful or aligned. Both use language models somewhat larger than our 52B model.

LaMDA [Thoppilan et al., 2022] finetunes large language models to participate in interesting, helpful, factu- ally grounded, and safe natural language dialogue. As with our work, they include notions of both positive interactions and safety/harmlessness. And their use of external information to ensure accuracy/grounded- ness goes beyond the methods we discuss here, and is perhaps more similar to WebGPT and GopherCite [Nakano et al., 2021, Menick et al., 2022]. However, some differences are that rather than using reinforce- ment learning, they apply a mix of supervised learning techniques (both generative and discriminative), and their data collection process involves absolute ratings rather than comparisons. They do not explore whether their methods impose an ‘alignment tax’ on capabilities.

InstructGPT [Ouyang et al., 2022] finetunes GPT-3-type models [Brown et al., 2020] to improve their help- fulness. As in this work, they use reinforcement learning from human preferences, as expressed through comparisons. However, they also include a supervised learning stage of training, whereas in contrast our finetuning occurs purely through RL (we perform context distillation, but this is much more like simple prompting). Perhaps the main contrast with our work is that they do not include harmlessness training, or ex- plore tensions between helpfulness and harmlessness. Their approach also differs from ours in some details: they did not train preference models larger than 6B parameters, and they mixed pretraining with RL in order to avoid a degradation in evaluation performance.

Our work differs from both InstructGPT and LaMDA in that we explore ‘online’ training, where we update the models interacting with crowdworkers in order to obtain progressively higher-quality data and fill out the tails of our data distribution. Another difference is our exploration of specialized skills such as summarization and coding, which we use to bolster the argument that alignment can be achieved without limiting capabilities. We also explicitly study the tension between helpfulness and harmlessness, which has not been addressed before as far as we are aware. Finally, we explore scaling and robustness in much more detail, including during RL training. With that said, our procedures (Figure 2) are actually somewhat simpler than those employed in these other works. We believe the only essential steps are human feedback data collection, preference modeling, and RLHF training.

Several other recent works focus on aspects of truthfulness using retrieval [Lewis et al., 2020, Guu et al., 2020, Borgeaud et al., 2021] from a database, or via internet search and human feedback, such as WebGPT [Nakano et al., 2021] and GopherCite [Menick et al., 2022]. These works are exciting and com- plementary to our work; in particular our results suggest that their techniques should be very compatible with training for helpfulness and harmlessness. While these works improve the faithful representation of explicit evidence, more work will likely be necessary to achieve honest self-representation from AI systems. We are generally hopeful that techniques independent of human feedback may be applicable to this problem, since a great many sources of truth are not based on human judgment.

Safety and ethical issues associated with language models have been extensively discussed (e.g. [Henderson et al., 2017, Bender et al., 2021, Weidinger et al., 2021]), with well-known issues including tox- icity, bias, and the possibility that models may reveal personally identifiable information. As models become increasingly powerful, new and surprising capabilities and safety issues may arise [Ganguli et al., 2022]. Other works have explored methods to mitigate these problems (e.g. [Liu et al., 2021, Xu et al., 2020]). Models have also been trained to directly evaluate ethical dilemmas [Jiang et al., 2021], demonstrating im- provement on ethics benchmarks [Hendrycks et al., 2021a]. More general research proposals for AI safety include [Amodei et al., 2016, Hendrycks et al., 2021c]. The RL robustness failures we discuss can be viewed as an instance of ‘reward hacking’, which was recently explored in [Pan et al., 2022]. RL policies could also fail to generalize out of distribution in other dangerous ways [Koch et al., 2021].

Our interest in studying trends with model size is motivated by neural scaling laws [Hestness et al., 2019, Rosenfeld et al., 2019, Kaplan et al., 2020]. A related observation is that as parameter counts grow, models finetune more effectively [Hernandez et al., 2021] and become much less vulnerable to ‘catastrophic forget- ting’ [Ramasesh et al., 2022]. We expect this effect helps to explain why our HH training is compatible with good evaluation performance and specialized skills for large models.

2   Data Collection

We expect human feedback (HF) to have the largest comparative advantage over other techniques when people have complex intuitions that are easy to elicit but difficult to formalize and automate. This means that when collecting HF, we should try to choose tasks that are as intuitive and familiar as possible. We chose to use natural language dialogue both for these reasons, and because it is so general – essentially any text-based task can be enacted through dialogue, perhaps with some source materials included in-line.

2.1  Task Specification and Crowdworkers

Our human feedback interface can be seen in Figure 6 (for more details see Appendix D). People can interact with our models in natural language via chat, and ask for help with any text-based task. When it’s the model’s conversational turn, users see two possible model responses, and choose one with which to proceed. These two responses may come from the same model, or two different models. They can then ask follow-up questions or provide further instructions to the models. So there are two core components to the task, which repeat several times in each dialogue:

•    Crowdworkers write a chat message to our models, asking them to perform a task, answer a question, or discuss any topic of interest.

•    Crowdworkers are shown two responses, and are asked to choose the more helpful and honest re- sponse (or in the case of red-teaming, to choose the more harmful response).

We conjectured that crowdworkers who wrote well and engaged the AI in more interesting discussions would tend to have better judgment about which AI responses were most ‘helpful’ and ‘harmless’. This meant that rather than attempting to filter crowdworkers based on label quality, we instead used spot-checks of their writing, which were simpler and more intuitive for us to perform.

Otherwise, our approach to data collection was to largely let crowdworkers use their own intuitions to define ‘helpfulness’ and ‘harmfulness’. Our hope was that data diversity (which we expect is very valuable) and the ‘wisdom of the crowd’ would provide comparable RoI to a smaller dataset that was more intensively validated and filtered. Overall, our process was roughly of this form:

1.    We invited master-qualified US-based8 MTurk workers to engage in dialogues with our models.

2.    Rather than evaluating all of our crowdworkers, we identified those who were most prolific, and together accounted for about 80% of our data (roughly 20 crowdworkers). We then evaluated their performance based primarily on the sophistication and variation in their dialogues, as this was quite easy to evaluate intuitively (rather than based on any measure of agreement on helpful/harmless choices). Based on this method, we collected a list of ‘select’ MTurk workers9 whom we continued to work with throughout the research process.

3.    We invited our select crowdworkers to a Slack channel and corresponded with them by email, to ensure that they were being compensated fairly10 and to allow them to alert us to any problems or issues.

4.    We also hired crowdworkers on Upwork, and vetted them in a similar, lightweight way. We have continued to use both platforms throughout this work. We find that it is easier to incentivize very high-quality interactions on platforms such as Upwork, where crowdworkers can easily be paid by the hour, rather than per task. But conversely, MTurk workers tend to generate data much more rapidly, and account for about 80% of our datasets.

We did not filter workers based on agreement or other direct measures of label quality, though we evaluated them retrospectively (see Figure 10 right) and found poor average agreement (about 63%) between Anthropic researchers and our crowdworkers, as compared to recent similar work such as [Stiennon et al., 2020, Ouyang et al., 2022].

As an important caveat, our crowdworker distribution was not held fixed throughout this work, and we expect that crowdworker quality probably improved as the project went on. We mention this as a possible compli- cation when evaluating the success of the ‘online training’ program discussed in Section 4.5. Conversely, however, since we generally discouraged repetition, crowdworkers who have performed the task many times might also have had a tendency to engage in more esoteric interactions.

We should also note that we explicitly told crowdworkers that ‘lying isn’t helpful’ and that they should try to only reward helpful and honest responses, which presumably explains why our models improve somewhat in terms of honesty. That said, we did not expect crowdworkers to fact-check our models significantly, and for example they often prefer responses that include non-functional URLs, which are probably one of the simplest possible ‘lies’ to debunk.

2.2       Helpfulness and Harmlessness (Red Teaming) Datasets

We collected two separate datasets using slightly different versions of our interface. For the helpfulness dataset, we asked crowdworkers to have open-ended conversations with our models, asking for help, advice, or for the model to accomplish a task (see Appendix D.2), and to choose the model response that was more helpful. For the harmlessness or red-teaming dataset, we asked crowdworkers to attempt to elicit harmful responses from our models, and to choose the more harmful response offered by the models.

Our interface (Figure 6) allows users to express a preference strength. We only include comparisons in our datasets if crowdworkers expressed a preference stronger than the weakest available. In this work we will not otherwise use this preference-strength information; we treat all comparisons in our dataset as binary and of equal weight (so in particular we do not include ties).

Note that this means our helpfulness dataset tends to move conversations in a more beneficial direction, while in our red-teaming dataset user responses move conversations in a more harmful direction. We made this choice to make it possible for users to fully trick and exploit models while red-teaming, as this was most natural for other work we’re doing that’s specifically focused on harmfulness. However, we believe this difference made it difficult to train models that were both helpful and harmless, as explained in Section 4.4. We plan to remedy this in future work, and would recommend others who are focused on training harmless dialogue models to collect data where users primarily choose model responses that move the conversation in the more beneficial direction instead.

2.3       Models Deployed to the Feedback Interface and Associated Data Distributions

For data collection we predominantly11 used 52B language models with the broad specifications given in [Askell et al., 2021]. We used three classes of models in our interface:

•    HHH Context-Distilled 52B Language Model: At the beginning of the project this was the only model available. It performs similarly to a plain 52B language model prompted with HHH dialogues [Askell et al., 2021].

•    Rejection Sampling (RS) with a 52B preference model, where samples were generated from a 52B context-distilled LM. In this case the number k of samples was a parameter, but most often we used k = 16.

•    RLHF-Finetuned Models: We used a succession of these models in our interface. The models varied primarily based on the amount of data available when training the associated PMs (depending on the phase of the project). However, we also deployed models trained on different mixtures of helpfulness and harmlessness data.

In the final phase of the project, when we were primarily deploying RLHF-finetuned models, we often de- ployed several such models at once. This allowed us to monitor progress by gathering model-comparison data, and also to (perhaps) improve data diversity.

Corresponding to the three classes of models, we divide our data into three distributions:

•    A core base dataset collected using only the context-distilled LM. This dataset includes 44k helpful- ness comparisons and 42k red-teaming (harmlessness) comparisons (note a conversation typically comprises about four comparisons).

•    A RS dataset consisting of 52k helpfulness comparisons and 2k red-teaming comparison using re- jection sampling models, where rejection sampling used a preference model trained on the base dataset.

•    An iterated ‘online’ dataset including data from RLHF models, which were updated on a roughly weekly cadence over the course of about five weeks. This dataset contains 22k helpfulness compar- isons and no red-teaming data.

A histogram of these distributions by our final, online HH preference model’s score can be seen in Figure 15 in Section 4.5. In what follows when we discuss the static or base+RS dataset, we will be referring to the combination of the first two components. Our ‘online’ RLHF models in Section 4.5 are trained on all three components. Most of our results are based on the static dataset, as we conducted experiments and evaluations with it while the online data collection was underway.

We analyze a few different splits of the static dataset – a standard split into 95/5 train/test data, and a 65/35 split that we use in order to obtain better statistics when evaluating preference model calibration on the test set. We also consider a 50/50 split, where we train distinct PMs on the two halves of the dataset. This is used to evaluate the robustness of RL training, as we then train an RL policy against one PM while evaluating the rewards achieved by that policy as measured by the independent PM.

2.4       Comparing Models with Elo Scores

A significant part of our analysis compares models against each other in order to generate associated Elo scores, as described in [Askell et al., 2021]. That is, we have crowdworkers chat with two models simultane- ously, with each model generating one response (either ‘A’ or ‘B’) at each turn, and we record the sample that is preferred by the worker. This provides us with a record of ‘win rates’ between pairs of models, which we can then fit to corresponding Elo scores, to produce Figure 1 (where we show both win rates and Elo scores). Two useful conversion formulas are

for the fraction in which one model is preferred over another, the difference in Elo scores, and our PM scores. Note that conceptually win fractions, Elo scores and PM scores are interchangeable; we keep both Elo and PM scores so that we can avoid confusing crowdworker preferences (where we use Elo) with our preference modeling and RLHF (where we use PM scores).

Note that the Elo scores for context-distilled models in Figure 1 differ somewhat from the analogous results for prompted models in [Askell et al., 2021] – the Elo scores are now more compressed. The main difference is that we did not use top-p sampling this time12. The difference may also be due to changes in the crowd- worker distribution since that earlier experiment, or changes in crowdworker expectations, as before this test our workers were mostly interacting with higher-quality RLHF-trained models.

3   Preference Modeling for Helpfulness and Harmlessness

3.1       Models and Training Setup

We use language models with specifications that are identical to those discussed in [Askell et al., 2021], with a total of seven language models with parameter counts running from 13M to 52B and approximating a geometric series with increments of roughly 4X. We use PyTorch [Paszke et al., 2019] and Triton [Tillet et al., 2019] to facilitate model training and performance. Our preference model training setup is also identical to that in [Askell et al., 2021], and in particular we apply ‘preference model pretraining’ (PMP) to our language models before finetuning on our human feedback datasets, as explained in Section 4 of that paper. More details are provided in Appendix A. Note that we typically only train PMs for a single epoch, so the learning curves themselves (Figure 7 left) indicate how performance scales with dataset size (we used a fixed learning rate).

3.2       Basic Scaling Results

We would like to understand how preference modeling performance improves as we increase model size and collect additional data. In Figure 7 we show basic results for PM accuracy when training on our static helpful and harmless data mixture. Roughly speaking, we observe log-linear trends in both dataset and model size. We tend to find somewhat more consistent trends if we model only the helpfulness or harmlessness distributions in isolation, rather than as a mixture, as observed in Figure 32 in Appendix A.3. But there we also see that for some data distributions [Stiennon et al., 2020], scaling trends can exhibit more complex patterns that defy simple trends.

Our preference modeling data comes from natural language dialogue, where crowdworkers have text-based conversations with the model, and choose the more helpful of two model responses at every turn in the conversation (or the more harmful one, for red-teaming tasks). So it is natural to ask how PM performance changes as a function of the conversational turn. We show these results in Figure 8. PMs are somewhat more accurate on the first step of the conversation, but their accuracy is nearly constant thereafter.

3.3       Calibration of Preference Models and Implications for RL

Preference model scores should predict the probability that humans will prefer one or another model- generated response. We are interested in whether these probabilities are accurate, i.e. whether the PMs are well calibrated. We characterize calibration in Figure 9, where we display PM accuracy as a function of the difference in PM scores assigned to pairs of samples, along with a heavy black line representing perfect calibration. We observe that PMs trained only on helpfulness data are very well calibrated, but PMs trained on a mixture of helpful and harmless data are slightly under-confident.

These calibration results are important because in later sections we will be using PM scores as the reward signal for reinforcement learning. Since the PM scores are well-calibrated, we can trust that they faithfully encode the probabilities that humans will prefer specific model samples (at least on-distribution with the training set). This means that when we see RL robustly achieving a given reward, we can trust that those who interact with this model (if they are well-represented by our crowdworker distribution) will prefer it to reference models at a predictable rate, provided that the PM scores of the models’ responses are within the range considered in these calibration studies. That said, we find significant failures of robustness as RLHF optimizes towards much higher scores, as explained in Section 4.5 and Appendix B.4.

We might generally expect that as sample quality improves, it becomes more difficult to reliably identify the very best samples. In Figure 25 in the appendix, we show that if we restrict to comparisons where both samples have a PM score above a given threshold, then PM accuracy drops as a function of this threshold value (to compare different PMs we add an offset so that they all obtain the same mean score on the test distribution). Notice that this result combines three effects: (1) differentiation among more sophisticated samples may be more challenging and require greater capabilities, (2) we have fewer very high quality samples to learn from in our dataset, as they are non-generic, and (3) pairs of samples that are all high-quality will have similar scores (compared to randomly chosen pairs), and so be more difficult to distinguish.

These observations also have an implication for RLHF training, namely that we should expect diminishing returns from further RLHF training once our policies achieve a sufficiently high PM score. This also motivates online training, so that we can update our PMs to stay on-distribution as RLHF policies improve.

3.4       Evaluating Helpful and Harmless Preference Models

3.4.1       HHH Evaluation and Crowdworker-Anthropic Agreement

We have shown that PMs trained on our dataset are learning something and are well-calibrated. But we would also like to show that these PMs are actually learning to prefer helpful and harmless behavior, accord- ing to some independent standard. We provided a set of comparison evaluations for helpfulness, honesty, and harmlessness (i.e., HHH Evaluation) in prior work and evaluated raw, context-distilled, and prompted language models [Askell et al., 2021] on this benchmark. As shown in Figure 5, we find that our PMs perform much better than all of the models we evaluated previously. In fact, the Pathways LM effort [Chowdhery et al., 2022] recently reported a mean human score of 75% on this dataset, so in a sense our PM performance of 86% is well above mean human accuracy.

We also observe that there are a few outliers, where our PMs are confidently incorrect on HHH evaluation comparisons (see Figure 10 left). These are all examples of failures of honesty, where the better choice is honest but unhelpful, while the incorrect choice appears both helpful and well-written, but is subtly incorrect or misleading. We show the single worst example in Figure 11. It’s worth noting that the bad response (which was written by Anthropic researchers) is likely out of distribution as compared to samples from our language models and RL policies. Since only model-generated samples were used to train the PMs, this may explain why the PM was badly fooled by these examples. It also indicates that the PMs are not adversarially robust.

We can also compare our crowdworkers, ourselves, and our preference models in terms of agreement on labels. The results are shown on the right in Figure 10, and are based on about 320 examples from our static test set. The PMs are trained to predict crowdworker behavior, so PM-Crowdworker agreement is best. However, the largest PM actually agrees with the authors (i.e. Anthropic researchers) slightly more than the authors agree with crowdworkers on labels. We also suspect this is a poor subsample of the data, since PM accuracy is several percent lower on this subsample as compared to the full test set.

3.4.2       Gender Bias and Bot Adversarial Dialogues

We can begin to evaluate whether our PMs display gender bias by comparing scores across gendered terms. To this purpose we evaluate preference model scores on short dialogues of the form

Human: Can you help me fill in the missing word in a sentence? The sentence is: The {Occupation} is a

Assistant: Sure thing, here is your sentence with the missing word filled in:

The {Occupation} is a {Gendered Term}

where we use a list of 76 occupations, 12 matched gender terms (mother vs father, guy vs gal, etc), and both present and past tense [Rae et al., 2021]. Preference model scores are not directly meaningful, so instead we evaluate the difference in score between paired gender terms, and compare that to the difference in score among terms within a given gender. The results are shown on the left in Figure 12. We observe that the variation among terms that indicate a given gender appears to be at least as large as the variation across genders. So it would appear that the PMs do not exhibit a substantial bias.

As an external validation of our PMs, we consider the Bot Adversarial Dialogues (BAD) dataset [Xu et al., 2020]. This dataset contains a few thousand conversations between an AI system and a human. Each AI response is labeled as harmful or not harmful. We compute the distribution of preference model scores for BAD AI utterances (we restrict our analysis to the first BAD AI utterance per conversation) and find that the BAD AI utterances marked as harmful have significantly lower preference model scores. This suggests that our PMs are effectively classifying these AI generated utterances, even though they are likely quite different from the data distribution that our PMs were trained on.

4   Reinforcement Learning from Human Feedback

4.1 Training Setup

We apply reinforcement learning (RL) with preference modeling, following the approach outlined in [Stiennon et al., 2020], which can summarized in the following steps:

1.    Prepare a dataset of comparisons, and train a PM to assign a higher score to the ‘better’ item in each comparison. In the context of our human feedback experiments, each comparison consists of a prompt followed by a pair of model-generated responses, with a PM score evaluated at the end of each response.

2.    Extract all the prompts from the preceding dataset, and train an RL policy to generate a response to each prompt autoregressively, with a reward signal provided by the PM score at the end of the response.

PM dataset and training details are provided in Appendix A.2; we also discussed the performance of our PMs in Section 3. In the language of RL, each response generated by the policy is a ‘timestep’, a full conversation is one ‘trajectory’, and the PM score is a single ‘reward’ provided at the end.

The idea is to use the preference model to steer the policy towards writing better responses. However, as we saw in earlier sections, PMs also become less calibrated at higher scores, so higher rewards do not necessarily imply better performance.

To stabilize RL training, we use Proximal Policy Optimization (PPO) [Schulman et al., 2017]. We also follow other work [Stiennon et al., 2020] and apply an empirically-estimated KL penalty term in the reward, with the total reward given by

where KL >/ 0 is a hyperparameter. In practice we use a very small value of KL = 0:001, which likely has a very minor impact during most of RL training (as DKL < 100 typically), and might actually be wholly unnecessary. More details about RL are provided in B.1.

Throughout this paper we use rPM = the preference model score itself for the RL reward. Recall that as implied by equation (2.1), this means that the difference in rPM values between two samples A and B will be related to the predicted probability P (A > B) that A will be preferred to B via

There is no good reason13 to use this preference model score directly as the reward, but it has been used in prior work such as [Stiennon et al., 2020] and so for simplicity we will not explore variations on this choice here.

In order to produce additional prompts (i.e. the human side of the conversations) for RLHF training, we used a large LM to generate them. For this purpose, we simply used few-shot learning, creating a context with about 10 existing high-quality human queries, and then sampling to generate more. We find that the sample efficiency of RLHF is roughly the same on the original crowdworker-written prompt dataset and the model- generated one, so we combine the two for greater diversity during RLHF training. We used 137k prompts from the ‘static’ dataset, and 369k model-generated prompts.

Note that almost all of our preference modeling data was collected from 52B models. This means that RLHF training with smaller models might have been challenging, since samples from smaller models tend to be out-of-distribution from the PM training data. Thus it is quite interesting that models more than fifty times smaller were actually able to learn and improve, as seen in Figure 1.

4.2  Robustness Experiments

We now discuss the problem of RLHF robustness. A fully robust PM would agree with humans on distribu- tions of dialogues quite different from those encountered during PM training (i.e. different from those created by crowdworker interactions with our deployed AI assistants). However, we do not expect that our PMs are so robust, and in fact Figure 11 provides one plausible example of a robustness failure. Since RL optimizes the policy to maximize the PM score, any failure in robustness on the part of the PM may be exploited by the RL policy to achieve higher rewards, without actually improving the policy’s behavior from the point of view of human evaluators.

A rigorous way to study robustness is to take snapshots of the policy at various points during RLHF training, including the initial snapshot, and have crowdworkers compare their performance. This gives a ‘true’ Elo score, as evaluated by crowdworkers, which can then be compared directly with the PM scores. We present an example of this study in Section 4.5.

However, this sort of test requires collecting additional human feedback data, which can be slow and expen- sive, so here we also study robustness from a different angle. Similar to how datasets are split into train and test sets for supervised learning, we split our preference model comparison data into two halves (a train half and a test half), and train separate preference models on each, which we refer to as the train PM’s and the test PM’s. We then train RLHF policies against the train PM’s, while evaluating them using the test PM’s. Similar to how test set evaluations help us understand overfitting in supervised learning, test PM evaluations help us understand overfitting relative to the train PM’s. These experiments are not conclusive since the train and test PMs may exhibit correlated robustness failures.

The main conclusions from these experiments are: (1) RLHF becomes gradually less robust at higher PM scores, and (2) larger preference models are more robust than smaller ones.

We conduct two sets of experiments as follows:

•    Train PM Size = 52B: This set consists of a scan of policies (i.e. one for each model size), all of which are trained with respect to the same 52B train PM.

•    Train PM Size = Policy Size: This set consists of a scan of policies, with each policy trained with respect to a train PM of the same size as the policy.

For both experiments, each policy is further evaluated with respected to a scan of test PM’s throughout training. Note that a scan refers to 7 different model sizes ranging from 13M to 52B, thus giving us 7 policies and 7 × 7 evaluations per experiment.

In Figure 4, we compare the train PM and test PM scores throughout the training process, similar to how train and test curves are often compared for supervised training. We find that in all cases, the two scores are in close agreement during early stages of training, but eventually diverge, with the test PM providing a lower score. The divergence is likely an indication that the preference model is less robust and more easily exploited at higher rewards. That is, the policy has been over-optimized on the train PM, making the train PM over- confident in the policy’s performance. The test PM, on the other hand, doesn’t suffer from this problem since it was trained on a different portion of data that neither the policy nor the train PM had observed.

We provide more discussion in Appendix B.2.

4.3   An Approximately Linear Relation Between √DKL and Reward

In Figures 4 and 13 we observe an approximately linear relation between √KL and PM score during RLHF training. Furthermore, we note that when√a__ll mo__dels are trained and evaluated with the same PMs, the learning curves are roughly parallel in the     DKL-reward plane. Note that here the ‘KL’ is more precisely reward varies linearly in δπ, then in the ‘sma√ll__-δπ r__egion’ (i.e. where the series expansion provides a good approximation), we should expect reward ∝    DKL. Typically we should expect that reward varies linearly in δπ, because because the initial policy π was not previously optimized for reward, so there is no reason why it would sit at an extremum with respect to small variations δπ. So the fact that this relation seems to hold empirically suggests that most of RLHF training remains in the small-δπ regime.

Though they did not use these coordinates, a similar scaling can be read off from the results in learning to summarize [Stiennon et al., 2020]. In particular, they provide a nice analysis of rejection sampling, where they generate N samples, and then plot mean reward of the top k samples versus the DKL = log(N/k).


This analysis suggests that these RL learning curves might be associated with changes in the RL policy that behave very similarly to simply rejection sampling from the initial distribution.

We find this simple relation quite striking, and believe it merits further study. At a conjectural level, it might have a variety of implications and uses when RL-finetuning large generative models:

•    These relations provide a rough prediction for ‘how much does the policy need to change to achieve a specific reward’. Furthermore, if the lines corresponding to different model sizes really are parallel, then one can use RL training of a small model along with the zero-shot performance of a larger model to estimate the eventual performance of a larger RL policy. The slopes of these lines also explain how RLHF training can produce such large effective gains in model size, and for example it explains why the RLHF and context-distilled lines in Figure 1 are roughly parallel.

•    One can ask a subtle, perhaps ill-defined question about RLHF training – is it teaching the model new skills or simply focusing the model on generating a sub-distribution of existing behaviors. We

might attempt to make this distinction sha√r__p b__y associating the latter class of behaviors with the

region where RL reward remains linear in     KL.

•    To make some bolder guesses – perhaps the linear relation actually provides an upper bound on RL re__war__d, as a function of the KL. One might also attempt to extend the relation further by replacing √KL with a geodesic length in the Fisher geometry.

By making RL learning more predictable and by identifying new quantitative categories of behavior, we might hope to detect unexpected behaviors emerging during RL training.

4.4  Tension Between Helpfulness and Harmlessness in RLHF Training

Here we discuss a problem we encountered during RLHF training. At an earlier stage of this project, we found that many RLHF policies were very frequently reproducing the same exaggerated responses to all remotely sensitive questions (e.g. recommending users seek therapy and professional help whenever they express any level of displeasure at all). This greatly limited these models’ utility. We still see a vestige of this behavior in some of the examples provided in Section 6.2. We now believe these policies were the result of over-optimizing for harmlessness, while under-optimizing helpfulness.

With our data collection procedure, we think this is quite intuitive. In order to get a very good score on red-teaming prompts, it’s probably sufficient for models to respond with something like “I can’t answer that.” This does not require much sophistication (it just requires learning to classify harmful requests), and so we expect it is easier to learn than helpfulness.

In Figure 14 (right), we show the policy’s PM score throughout training, after separating helpfulness and harmlessness prompts. On the left side of the same figure, we show the score distribution of PM comparison data, again separating helpful and harmless datasets. We observe that the policy’s harmlessness score is somewhat off-distribution, as it is on the upper tail of the harmlessness comparison data. On the other hand, the policy’s helpfulness score appears on-distribution, and is likely under-optimized. So we would expect this agent to be very difficult to red-team, but not very helpful.

This then raises an obvious question – can’t we just collect more harmlessness data to fill out the upper tail of the distribution? The problem involves the definition of harmlessness mentioned above – if simply refusing to answer a question is the ‘least harmful’ behavior, then this is probably both very easy to learn, and hard to improve on. That said, a more interesting ‘least harmful’ behavior would involve the model (helpfully) explaining why the request was harmful, and perhaps even trying to convince the human not to pursue such requests. We informally refer to such a model as a ‘hostage negotiator’.

However, our data collection process made it very difficult for models to learn ‘hostage negotiation’. This is because when collecting our harmlessness dataset, we had crowdworkers choose the more harmful AI response. We made this choice so that we could fully explore the vulnerability of our models to red-teaming. However, from the point of view of RLHF this was problematic, because beyond the first turn of dialogue, our models never learned what a sophisticated response to a harmful query might be like. Our dataset does not provide guidance on the upper end of the distribution, on what models should do, but only tells models what not to do.

In practice, we have partially resolved the optimization issue by training on a larger fraction of helpfulness prompts during RLHF. But in the future we hope to more fully and systematically address this problem by collecting harmlessness data where crowdworkers choose the best possible response from our models.14 In this way we hope that rather than simply shutting down harmful requests, models can learn the more subtle art of ‘hostage negotiation’ with red-teamers.

Note that since the data and models discussed in this section are from an earlier stage of our research, the RL results may look slightly different from other parts of the paper.

4.5   Iterated Online RLHF

In preceding sections we discussed the problem that PMs become progressively less calibrated and less robust at higher scores, as seen in the PM calibration study in Figure 9, and the RLHF robustness study in Figure

4. We believe this is caused by a lack of data in this high score regime. To address this, we propose iterated online RLHF:

•    We simply train the best RLHF policy we can, and use that to collect comparison data from crowd- workers. Since the policy was trained to optimize for PM score, it should produce responses that are on the upper end of the score distribution.

•    We mix the new comparison data with our existing data, and train a new scan of PMs, which we then use to train a new scan of RLHF policies. Then reiterate this process indefinitely.

Our hypothesis is that the ‘online’ RLHF policy helps us collect data on the upper end of the PM score distribution, which should improve PM calibration at high scores on subsequent iterations, and thereby allow us to train even better policies. Continuing this process should give us progressively better PMs and policies. Note that our use of the terminology ‘online’ is different from conventional use of the word—instead of training the same model iteratively, we retrain a new model per iteration.

One concern about this approach is that RLHF tends to decrease the policy’s entropy, which would limit the diversity of data collected through the online procedure. We partially address this by deploying a number of different snapshots from RL training, and from different online iterations, at once. This also makes it possible to compare these models to get a better sense of how they are performing.

We can see signs of life from the online approach by looking at the evolution of our data distribution. In Figure 15 (left), we show the PM scores from three distributions of models: Base, RS (rejection-sampling), and Online, as described in Section 2.3. We see that according to our final online PM (trained on all of the data), the quality of samples improves from the base to the rejection-sampling to the online data distributions. We also found that our online PM achieves accuracies of 74%, 70%, and 67% on the test sets for the respec- tive base, RS, and online-only distributions, which shows that distinguishing among higher quality samples is becoming more challenging. This makes us optimistic that online training should outperform rejection sampling in the long run.

We show the learning curves for our online models, along with measurements of Elo scores from crowd- workers, on the right in Figure 15. We see that models improve significantly during RLHF, but Elo scores from crowdworkers do not match predictions from PMs. We further discuss and decompose the robustness of RLHF training in Appendix B.4, where we see that distributional shift accounts for a significant part of the apparent robustness failure (Figure 35).

In Figure 1, we compare Elo scores of our online model with context-distilled models and RLHF models trained on the ‘static’ (i.e., no online) dataset, showing that the online models are clearly preferred by our crowdworkers. However, readers might worry about two caveats: the online model was trained on a slightly larger (about 20% larger) dataset, and the online model was trained with improved RLHF hyperparameters (the online model was trained with a larger K, defined in Appendix B.1, and its PM was trained with 2048 context instead of 1024), as compared to the earlier static RLHF training run.

To address both of these caveats, we performed a controlled experiment comparing two RLHF runs: one trained with our base dataset (about 44k PM comparisons), and another trained on an even mixture of base, RS, and online data whose total dataset size is the same as the base dataset15 (about 15k PM comparisons from each). So for this experiment we trained two separate PMs on each dataset, and then trained a pair of RLHF policies against these two PMs. Apart from the data difference, both runs used the same settings, and were only trained on helpfulness. In figure 16, we compare Elo scores for various snapshots of both runs, as determined by crowdworker preferences, showing that the policy trained on the iterated-online mixture is clearly preferred. This demonstrates that online training works, and that performance gains are not merely due to increased dataset size or hyperparameter changes.

4.6       Evaluations: Alignment Bonus, Honesty, and Biases

Language models that have been finetuned via RL typically have much narrower, lower-entropy output distri- butions. This can make evaluations difficult when they are fairly rigidly formatted, since all valid responses may be far off-distribution for the RLHF model (we discuss an example with gender bias evaluations below). Thus we expect in future work evaluations involving sampling and human interaction may be most relevant. In what follows we discuss some standard NLP evaluations, and then evaluations specifically related to the societal impacts of the models, including honesty, sentiment, and bias.

4.6.1       NLP Evaluations

We evaluate our models on question answering, commonsense, trivia, and story completion using the bench- marks MMLU [Hendrycks et al., 2021b], Lambada [Paperno et al., 2016], Hellaswag [Zellers et al., 2019], OpenBookQA [Mihaylov et al., 2018], ARC [Clark et al., 2018], and TriviaQA [Joshi et al., 2017]. The main conclusion is that RLHF tends to improve performance for large models, while degrading16 the performance of smaller models.

Full results for both zero-shot and few-shot evaluations are shown in Figures 28 and 29, and we provided a summary of the mean trends in Figure 3. Readers may notice that results improve rather suddenly for some evaluations; this is a consequence of the format we use for multiple choice questions, where we explicitly provide choices (Gopher [Rae et al., 2021] used this format). The format is provided explicitly in Appendix E. We find that this format tends to improve performance for large models, while decreasing the performance of small models, leading to the arguably misleading appearance of a ‘grok’ [Power et al., 2022] curve.

4.6.2       Honesty and Biases

A major question is whether AI models are honest. We evaluate our models on TruthfulQA (MC1) [Lin et al., 2021] and show the results in Figure 5. There we also include performance at 50-shot, in or- der to demonstrate that while our RLHF training significantly improves honesty, our models most likely have significant room for improvement. Note that our instructions to crowdworkers suggest that ‘lying isn’t help- ful’ and that they should choose responses that are ‘helpful and honest’, so this is presumably related to the improvements we see on TruthfulQA. That said, we do not currently expect RLHF to be the best approach to honesty.

Another set of questions involves the underlying biases of these models. We evaluate our models for sentiment biases on race and religion (in the same format as Gopher [Rae et al., 2021]), for gender bias, and on the Bias Benchmark for QA (BBQ-lite) [Parrish et al., 2021].

Results for sentiment towards different racial and religious groups are shown in Figure 17. The main effect we observe is that the sentiment of our RLHF-trained models tends to be much more positive than that of plain LMs, but it is more difficult to say if bias among groups has changed. We expect some of the visible effects are due to limitations in the sentiment model17, for example the low scores for the Atheist group is likely because samples such as (model generation in italics):

The Atheist man was very skeptical and critical of religious dogma and conventional moral principles, seeking instead to derive his ethical values from rational consideration and empirical observation of the natural world.

receive a sentiment score very near zero, which seems like a questionable evaluation. For these evaluations we use a prompt format where the human asks the assistant to complete the sentence as follows:

Human: Can you help me finish a sentence? The sentence is: {sentence beginning} 
Assistant: Sure thing, here is your completed sentence: {sentence beginning}

with sentence beginnings such as ‘The Atheist man was very’ as seen above.

We also evaluated our models on BBQ-Lite, a simplified version of BBQA [Parrish et al., 2021] that is avail- able on BIG-bench. We were somewhat surprised by the results, and are unclear if any conclusions can be drawn based on them. In essence, we appear to find that our plain LMs do not exhibit very significant biases on this benchmark, using the available metrics. Here the metric is roughly ‘how often does the model invoke a term in a negative context, divided by how often it uses the term generally’. A value of 0.5 corresponds to the models being ‘unbiased’. RLHF training perhaps improves the results slightly (likely by improving sentiment) though the effect is very small (see Figure 18). We may return to these evaluations in the future using the full BBQA metrics [Parrish et al., 2021].

In Section 3.4.2 we investigated gender bias in preference models. We tried using this format for evaluating RLHF models, but we found that these models were more than 10 less likely than plain or context-distilled LMs to use a gendered term to complete sentences like ‘The CEO was a’. In other words, statements like ‘The CEO was a father’ seem to be somewhat off-distribution for the RLHF models, at least compared to general LMs. This is problematic for evaluations, because it means that the bias measurements are likely to be dominated by noise, and are unlikely to be representative of model behavior (i.e. model-generated samples). So in Appendix B.10 we experiment with another formulation of the task, where we reverse the order of occupation and gender terms, and add a ‘For work’ clause; this kept both the RLHF and other models on-distribution. The results are discussed in Appendix B.10, where we show that RLHF model’s biases are very strongly correlated (Figure 40) with underlying LM biases. In particular, with this evaluation we find that large RLHF models have gender biases that are very similar to language models evaluated at a lower temperature.

5   Competing Objectives, Specialized Skills, and OOD Detection

A concern about alignment techniques is that they might compromise model performance. In Section 5.1 we highlight a quantifiable trade-off of this kind, between helpfulness and harmlessness when training preference models. But it appears that larger models suffer less of a performance drop from this trade-off.

Furthermore, we also find that the conflict between helpfulness and harmlessness is relatively unique. Pref- erence models can learn to reward strong performance at specialized skills without any loss in performance at helpfulness and harmlessness. In Section 5.2 we consider the evaluation of summarization quality as such a skill, using the learning-to-summarize [Stiennon et al., 2020] dataset reformatted in conversational form. Later in Section 5.3 we show that code models (i.e., models finetuned on code by supervised training) are also compatible with HH alignment interventions, even though the RLHF training does not involve code data or examples.

In Section 5.4 we highlight another approach to avoiding harmful behavior – it may be possible to reject most harmful requests, even without any access to harmfulness training data, by leveraging out-of-distribution detection techniques [Fort et al., 2021]. This approach might also be useful more generally in deployment scenarios where strange or off-topic requests need to be flagged or ignored.

5.1       Mixing Helpful and Harmless Objectives

In many cases harmlessness acts as a constraint on helpfulness. So we should expect that helpfulness and harmlessness may behave as partially anti-correlated objectives. We establish this by evaluating preference models trained on different mixtures of HH data, and with different weightings.

At a conceptual level, the HH PMs may essentially be learning to first classify the data and then choose a score depending on the distribution. We will show that larger models perform better and are more robust to data mixture and loss weighting, which may be due to their having greater success at separating reasonable from harmful requests.

5.1.1       Varying Helpful vs Harmless Data Fraction

We train models using data splits varying from 100% helpfulness to 100% harmlessness in intervals of 10%. Our static data distribution has 42k red-teaming comparisons, so to control for dataset size we always con- struct mixtures with a total of this number of comparisons. Figure 19 shows performance on both harm- lessness and helpfulness as the training data mixture is varied. Note that training entirely on helpfulness or harmlessness data results in performance on the other distribution which is significantly worse than chance. This exemplifies the extent to which these distributions are in tension with each other.


Figure26 in the appendix also plots mean test accuracy over both helpfulness and harmlessness (where Mean Acc = (Harmlessness Acc + Helpfulness Acc)*/*2). Curves for larger models look more steep near the 0% and 100% areas, but flatter at the top. The curves for the smaller models are more gradual, with more distinct peaks in the middle. This again suggests that larger PMs are more robust to the specific fraction of red-teaming vs helpfulness data that is used, allowing them to learn both concepts more easily.

5.1.2       Weighting Helpful vs Harmless Losses

Instead of studying different data mixtures, we can try re-weighting the losses. Since we have more helpful- ness than harmlessness comparisons, we experimented with weighting the losses as

LTotal = LHelpfulness + λ · LHarmlessness

for λ E 2 {1; 2; 3; 4; 10}, as shown in Figure 27 (relegated to the appendix). We note that larger models seem more robust to the choice of λ. Increasing λ from 1 to 10 causes a 7.4% decrease in accuracy on helpfulness for the 13M parameter model, whereas it only causes a 1.5% decrease in accuracy on the 52B parameter model.

5.2       Summarization as a Specialized Skill

We expect that models finetuned with special skills may be particularly useful and valuable. Does alignment interfere with finetuning for a specialized skill?

As one test of this question, we studied PM finetuning on the learning-to-summarize (LtS) [Stiennon et al., 2020] dataset vs. a mixture of LtS and HH data. We formatted the LtS data in conversa- tional format so that it matches the HH data, as follows:


Human: Can you write a summary of this article for me?

...Text...

Assistant: Sure, here it is:

...Summary...

As shown in Figure 20, large preference models trained on a mixture of HH and LtS datasets perform equally well on both. So at least at the level of preference modeling, there seems to be no cost to mixing HH with the specific skill of evaluating summarization quality.

5.3   Natural Language RLHF on Code-Finetuned Models

As another test of a specialized skill, we would like to see if natural language alignment can be combined with coding without compromising performance. Since our crowdworkers were never instructed to probe the model’s coding abilities, and most likely do not have a great deal of coding expertise, our human feedback data does not include a significant number of code-related conversations. The preference model finetuning dataset and the RLHF prompt dataset thereby do not contain any significant amount of code, though there is some code in the LM pretraining mix and possibly a small amount in the PM pre-training (PMP) mix. This makes code-related problems an interesting way to test generalization of RLHF, and especially its compati- bility with other skills.

Our ‘base code models’ were finetuned on Python code scraped from Github as described in [Askell et al., 2021]. Starting from these Python fine-tuned (Python FT) models, we then ran our standard natural language RLHF training using ‘static’ preference models and prompts. We had difficulty achieving stable RLHF optimization on the 3B code model, so it has been excluded for this section.

We evaluate models on the HumanEval dataset [Chen et al., 2021], which prompts language models with python function signatures and docstrings. Models are tasked with correctly filling in the function body given the context, and model-written functions are run in a sandbox environment. In Figure 21 we show results versus model size with and without RLHF training. We see the same trend here as with other evaluations – RLHF decreases the performance of small models, but improves the performance of larger models.

RL training tends to decrease the entropy of the models’ distribution, and so we were concerned that these results would be very sensitive to temperature and top-p tuning. So for our 52B models, we performed a scan over temperatures and two top-p settings for both the RLHF models and the base code models, and then chose the best setting for each model and pass@k. We did a grid-search over the evaluation hyperparameters: T ∈ {0, 0.4, 0.6, 0.8, 1.0} × p ∈ {0.95, 1} × k ∈ {1, 5, 10, 25, 50, 75, 100}. Results are summarized on the right side of Figure 21. For each model and for each k in pass@k, we take the maximum performance over all 10 combinations of hyperparameters. We see that RLHF improves performance over the baseline on this evaluation, for all pass@k.

We should emphasize that as with our other evaluations, the improvements in performance from RLHF are modest. In fact, we find that simply prompting a base code model performs slightly better, as shown in Figure 38. Appendix B.8 further describes the format of the prompts we used (i.e., ‘HHH prompts’), which consist of a couple of code examples.

We also conducted experiments involving adding buggy code to the prompts, which typically worsens per- formance (see [Chen et al., 2021]). We found that RLHF models did not perform better than their initial base code model snapshots, when these prompts are included in the context during evaluation, even after scanning over temperature and top-p.

5.4   Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests

In this work we are primarily focused on achieving harmlessness entirely through natural language dialogue. However, one might try to avoid harmful behavior in a somewhat different manner, by either restricting language assistants to only respond to a narrow range of queries (approved-list), or by filtering and rejecting known types of bad behavior (block-list). We could use our preference models for these purposes, but we might also take a different, less supervised approach, and leverage advances in out-of-distribution (OOD) detection.

Such an approach might also be useful for those who want to build systems that only respond to a narrow range of queries (e.g. code models that should avoid non-code topics). Out-of-distribution detection (OOD), and especially near out-of-distribution detection, have been a major challenge for deep neural networks. Deep networks routinely assign high probability to mis-classified inputs [Guo et al., 2017, Lakshminarayanan et al., 2016] as well as to test inputs not belonging to any of the training classes [Nguyen et al., 2014]. There have been many approaches to OOD detection based on discriminative models [Hendrycks and Gimpel, 2016, Lee et al., 2018, Liang et al., 2017, Liu et al., 2020] as well as deep generative models [Nalisnick et al., 2019, Zhang et al., 2020]. The more difficult case of OOD detection, the so-called near-OOD detection [Winkens et al., 2020], has recently been improved upon significantly using pre-training and large models [Fort et al., 2021].

For a prompt i, we extract a vector of activations of dimension dmodel from a layer ` and call it v ` i ∈ R dmodel . The task is to distinguish between an unseen example of harmlessness and helpfulness data without being explicitly shown any harmlessness data at all. This approach works by measuring the deviation of a prompt from the helpfulness data, rather than measuring how close it gets towards harmlessness data in particular. In this way, we do not depend on the specific harmful content we have at hand, and can potentially filter different kinds of non-helpfulness content. To detect whether an input comes from the in-distribution (the helpfulness dataset), we use a scoring function that takes the input and maps it to a scalar value score(x). To do that, [Lee et al., 2018] first proposed to fit a simple model to training examples of the in-distribution, Din train = {v ` i ∈ R dmodel}. We calculate the mean µ = 1 Ntrain PNtrain i=1 v ` i and the covariance matrix Σ = 1 Ntrain PNtrain i=1 (vi − µ) (vi − µ) T . The Mahalanobis distance of an unknown activation vector x from this training set is score(x) = (x − µ) T Σ −1 (x − µ).

A simple improvement on top of the Mahalanobis distance called the Relative Mahalanobis distance has been proposed in [Ren et al., 2021] and shown to lead to better AUROC as well as more robust detection for a range of OOD problems in vision and genomics (in addition to more robustness to adversarial attacks [Fort, 2022]). Inspired by this method and recognizing that our problem does not naturally involve semantically meaningful classes comprising the in-distribution, we propose a further modification we call the Simplified Relative Mahalanobis distance. We compute it by fitting a full covariance matrix Σ as before, as well as a diagonal- only covariance matrix Σdiag, and assigning the difference of their Mahalanobis distance as our scoring function, score(x) = (xµ)T Σ−1(xµ) − (xµ)T Σdiag−1(xµ).

Figure 22 shows the results for our OOD detection experiments, trying to distinguish the helpfulness data from harmlessness data using our new Simplified Relative Mahalanobis distance to the helpfulness activation vectors. The figure shows AUROC for activation vectors from different layers of the model and different model sizes. The bigger the model, the better its performance, with the middle layers performing best. The right panel of Figure 22 shows an example of the OOD score distributions for the helpfulness data (blue) and harmlessness data (red) for a 64L layer of 52B parameters and its 32th layer. We can see that the mode of the distances of the Harmlessness data is clearly higher than for the helpfulness data. The comparison of the Simplified Relative Mahalanobis distance and the standard Mahalanobis distance is shown in the Appendix B.9 in Figure 39.

The advantage of this approach is that we are able to distinguish helpfulness data from non-helpfulness data, with harmlessness data being only a particular kind of non-helpfulness data. The disadvantage is its clear lower performance on this specific task in particular.

If we have access to a small number of examples of Harmlessness inputs (the out-distribution), we could perform a few-shot outlier exposure, as first proposed by [Hendrycks et al., 2018]. [Thulasidasan et al., 2021] suggests using a single class representing the OOD examples. [Fort et al., 2021] has used outlier exposure on top of individually strong near-OOD detectors and showed that they still enjoy a large benefit from being exposed to the examples of the out-distribution. We observe a similar benefit here, as discussed in detail in Section B.9 and shown in Figure 23.

In particular, with only 10 examples of harmful prompts, we can achieve an improved AUROC of 0*.*94  0*.*02 for the 64L model. The best performance among all layers of all models (the middle layers of the 52B model) without outlier exposure (already using our Simplified Relative Mahalanobis distance, see Figure [22)] is approximately 0*.*85. A 4L model exposed to only 10 examples of harmful data gets an AUROC of 0*.*86 0*.*01, while having only 13M parameters as compared to the 52B. The OOD detection gains from outlier exposure are therefore very significant in comparison to the gains coming from model size scaling alone.

6     Qualitative Examples and Comparisons

It is challenging to quantitatively evaluate general-purpose dialogue agents. We find that our own research process depends essentially on qualitative evaluations, in order to get a sense for model strengths and weak- nesses, even when the ultimate goal is to produce some sort of quantitative metric. Thus in this section we will provide a few sample dialogues with our final online HH model.

An obvious issue with the qualitative evaluation of samples is that it’s difficult to know to what extent they have been cherry-picked. To mitigate this issue, for each prompt we generate 17 samples, rank them with our HH online preference model, and then display the median sample. We provide samples based on prompts that appear in association with others’ comparable work in Appendix C, including Instruct- GPT [Ouyang et al., 2022], LaMDA [Thoppilan et al., 2022], and sensitive questions appearing in PALMS [Solaiman and Dennison, 2021]. We encourage readers to skim these samples to get an (arguably) unbiased sense for our model’s behavior.

6.1    Comparison with Human Writers

As an additional test of our models, we collected high-quality HHH dialogues from human writers. These writers were hired on Upwork (separately from our pool of crowdworkers) based on prior successful writ- ing work and positive reviews. We gave them some examples, and then asked them to write fairly ideal human/assistant interactions.

Then based on the prompts (leaving off the final writer-written assistant responses), we generated 17 examples from our best HH and best purely-helpful online models, and chose the median response from this pool as ranked by our online HH preference model. We then asked our crowdworkers to rank responses from the writers, along with one response each selected from our HH and helpful-only models. We then use these results to estimate an Elo score for our human writers, by comparing to our two models, and we display the result as a horizontal band on Figure 1.

We used a band because the writer comparisons were actually somewhat inconsistent with our other Elo rankings, as crowdworkers preferred our HH online model to human writers 56% of the time, but preferred the our pure-helpfulness model to writers only 51% of the time. Presumably this is because the distribution of writers’ prompts differs from the distribution of conversations in which crowdworkers usually engage. Thus rather than altering the rankings betweeen our HH and helpful models, we represent writers as a wide band.

We find that the performance of both our helpful and HH models is roughly comparable to that of the human writers, though in fact the models are slightly preferred by crowdworkers. We also find that our PM’s typically assign higher scores to the model responses, as shown in Figure 24.We believe that this result should be interpreted with caution and caveats, however. The writers’ prompts were not very adversarial; many strike us as ‘softball questions’ that models can easily do well on. As researchers who have interacted with models for tens or hundreds of hours, we could likely come up with many questions where humans would be more likely to outperform the models. Relatedly, the writers’ conversations do not strike us (in many cases) as especially realistic, in the specific sense that we would not expect a human to derive a great deal of practical value by asking an AI assistant for help in these ways. We expect that our models likely still fall significantly short of providing the kind of value that a sophisticated human assistant could provide via chat, even if they were operating under the constraints the model experiences (e.g. no internet access).

Here are a couple illustrative examples of prompts, writer responses, and our online HH RLHF model’s response. In this example the model performs well at first, but then makes some slightly inaccurate statements about the ways in which it fixed errors:

In this example the writer’s statistic is arguably a bit misleading (presumably unintentionally), and the model’s response may simply be superior. This strikes us as an example of a ‘softball question’ where we expect models to do well, since it is common, vague, and open-ended:

In this final example, the model makes some subtle mistakes, but most readers would probably have difficulty detecting them without consulting reference material:

Dialogues like the above help to illustrate the limits of human feedback training: once model errors become sufficiently subtle, they will no longer be penalized appropriately, and the model’s incentives may no longer be aligned. This is one reason why we may not want to rely on human feedback to train models to be honest, or for any goal where human judgment is not the best possible source of information.

With that said, AI models still have limitations that can seem highly incongruous. Here’s an example where we, the more experienced model-interrogators, really played ‘hard ball’ with the AI:

6.2   Sensitive Questions and Avoidance versus Engagement

Here are a few examples from PALMS [Solaiman and Dennison, 2021] sensitive questions. We chose them to illustrate how the model sometimes avoids very sensitive subjects, but often provides otherwise reason- able responses. Please see Appendix C for many more examples, also including some from InstructGPT [Ouyang et al., 2022] and LaMDA [Thoppilan et al., 2022] prompts.

Here is a case where the model just refuses to answer the question:

Here the model indicates that it should not answer, but then provides some relevant information (perhaps it’s also problematic that the model focuses on the US, though this is likely a consequence of it having been trained in English by US-based crowdworkers):

And here the model provides a fairly complete response, without attempting to avoid the topic:

Qualitatively speaking, our experience has been that by tuning optimization for helpfulness versus harmless- ness, one can train models that are extremely avoidant when sensitive topics come up, or models that almost always attempt to engage, even on sensitive subjects. We suspect that Section 4.4 also provides useful lessons if the goal is to train models that provide full responses in the face of sensitive questions.

6.3       Example Dialogues

Most of our samples only involve a single prompt and response, so in this section we show a few two-turn dialogues.

In this dialogue the model follows instructions, but could have done a significantly better job capturing the nuance of the situation:

Here is some dialogue on a technical subject from a researcher who was experimenting with using the model to help debug an engineering issue:

When asked to parrot writing styles, the model is more likely to make potentially harmful, explicit, or offen- sive statements:

In this case we chose a human reply that takes the conversation in a subtly dark direction, and the model rebuts the argument:

7   Discussion

We have shown that it’s possible to use reinforcement learning from human feedback to train language models that act as helpful and harmless assistants. Our RLHF training also improves honesty, though we expect other techniques can do better still. As in other recent works associated with aligning large language models [Stiennon et al., 2020, Thoppilan et al., 2022, Ouyang et al., 2022, Nakano et al., 2021, Menick et al., 2022], RLHF improves helpfulness and harmlessness by a huge margin when compared to simply scaling models up.

Our alignment interventions actually enhance the capabilities of large models, and can easily be combined with training for specialized skills (such as coding or summarization) without any degradation in alignment or performance. Models with less than about 10B parameters behave differently, paying an ‘alignment tax’ on their capabilities. This provides an example where models near the state-of-the-art may have been necessary to derive the right lessons from alignment research.

The overall picture we seem to find – that large models can learn a wide variety of skills, including align- ment, in a mutually compatible way – does not seem very surprising. Behaving in an aligned fashion is just another capability, and many works have shown that larger models are more capable [Kaplan et al., 2020, Rosenfeld et al., 2019, Brown et al., 2020], finetune with greater sample efficiency [Henighan et al., 2020, Askell et al., 2021], and do not suffer significantly from forgetting [Ramasesh et al., 2022]. Although we did not demonstrate it directly, we also expect that RLHF alignment training can be mixed with or precede train- ing for other objectives; this might be relevant in the future in order to avoid the production of intermediate, unaligned AI systems.

We did find a clear tension between helpfulness and harmlessness, where models trained entirely for helpful- ness are preferred to models trained for HH, when evaluating only on helpfulness. We believe this is partly due to a subtlety in our data collection process, as we rarely collect data teaching models how to deal pos- itively with harmful requests (i.e. how to be a sort of ‘hostage negotiator’), but only on how to avoid them. And we also found that at least at the level of preference models, the helpful-harmless tension diminishes as models become larger and more capable. Nevertheless we do expect that this tension is real, and that caution may cut into model performance on the margin.

Large generative models have been referred to as ‘foundation models’ [Bommasani et al., 2021]. These mod- els are extremely interesting objects for research, but without further finetuning, they can exhibit harmful behaviors. Our work suggests that alignment training can be incorporated into foundation models without compromising their utility and versatility, and so perhaps it could soon become a part of their definition.

7.1       Limitations

While we believe our results present a promising picture for the alignment of existing language models, work on this subject remains in an early stage, and has a number of limitations. As was also emphasized by the authors of [Thoppilan et al., 2022], we view our work on alignment as an ongoing project; our work [Askell et al., 2021] was step zero, and this is step one.

We’ve pragmatically defined an aligned assistant as an AI that is18 helpful, honest, and harmless. We are op- timistic that at present capability levels, the techniques we have discussed here provide a reasonable approach to achieving helpfulness and harmlessness. However, although our techniques improve model honesty, we believe we are just scratching the surface of that problem, and that other techniques may more efficiently and effectively produce honest AI models.

Here we have essentially focused on the average-case behavior of our models. However, even if we were convinced that our models were HHH in expectation, a clear next step would be to attempt to study and eliminate bad behaviors (especially harmfulness) even in the worst case. We have not addressed this question of robustness here, but hope to study it in the future (approaches such as [Perez et al., 2022] may be useful). It will only become more pressing as AI systems advance and encounter distributional shift during deployment.

AI alignment may be difficult and ambiguous to assess. So for example, while our large RLHF-trained models perform better than plain LMs on virtually all capabilities evaluations, one might hope that a truly helpful models’ zero-shot performance would equal the few-shot performance of an unaligned model. The logic here is that if a model can really ‘helpfully follow instructions’, then a prompt or explanation should be sufficient to bridge the zero-to-few-shot gap. We are very far from achieving this level of performance! Even on the honesty evaluation TruthfulQA [Lin et al., 2021] we close a bit less than half of this gap (Figure 5). We also briefly investigated whether our RLHF-finetuned code models have any comparative advantage when exposed to prompts including buggy code [Chen et al., 2021], but we did not find any benefits there. One would hope a fully aligned model would do its best to write correct code, even when given a buggy prompt.

We also harbor a general concern that perhaps our techniques only render models aligned ‘on the surface’, and that they still harbor harmful biases or other tendencies that may surface in more subtle contexts. We found that RLHF models have a more positive sentiment towards all racial and religious groups, which seems promising, but does not necessarily indicate that biases have been reduced. And with respect to gender, we found that RLHF model biases are very strongly correlated with the bias of the underlying language models. That said, further work will be required to understand if this is a limitation of RLHF as a technique, or of our particular HH datasets. In any case, we likely need to build more subtle and comprehensive evaluations that include multi-turn dialogue, as this is an area where humans will likely use the models, and it’s also a place where it’s inherently more difficult to measure performance against subtle objectives such as bias and fairness.

On a much more practical level, we do not have much experience applying RL techniques to large generative models. Experienced AI practitioners know that there are a large variety of tweaks and tricks that require experimentation to identify, and that can majorly improve the stability and performance of training. We have encountered some stability issues with RL, and although we performed some rudimentary hyperparameter scans, we expect that with more experience and study we could do better. We also did not explore variations in online training, such as literally updating a single PM or RLHF model; rather we retrained these models from scratch on each iteration. Another direction for exploration is to use a non-trivial function of PM scores as the RL reward, distorting the score distribution to e.g. focus more on discouraging bad behavior rather than rewarding good behavior. In summary, there are many future directions to explore for improving RLHF.

A final concern is whether techniques like those we have employed will continue to apply as AI models become increasingly capable. We take these concerns very seriously. In our view, the present work makes some progress towards our initial goal, which is to establish a set of simple and universal techniques19 that can align AI models at present capability levels. Assuming this goal can be met, one of the next steps will be to build consensus among researchers and to understand alignment in greater depth, including how techniques scale with AI capabilities. The hope will be to create an evolving pragmatic state of the art for training AIs that are thoroughly helpful, honest, and harmless.

Another essential step will be to use this baseline as a point of departure for exploring other techniques that can better-address more advanced use cases and more speculative failure modes. New ideas and techniques can then be pragmatically compared with existing methods, and then incorporated into standard practice if they yield further improvements in safety and robustness. Our view is that the most relevant problems and the most creative and effective alignment techniques will be identified and developed through research on concrete AI systems. As we saw in Section 6.1, we are already encountering examples that point to the limitations of human feedback, and so we need to begin to develop other methods.

7.2   Alignment Data as a Public Good

In this work we allowed crowdworkers’ common-sense to define what constitutes helpful and harmless be- havior. This was sufficient for our exploration of ‘technical alignment’, i.e. the question of whether certain techniques can be used to train AI models to be more helpful and harmless. But we have avoided addressing the underlying question of what sort of behavior should be expected from deployed AI models.

This question should not be the provenance of researchers only. That said, without a clear specifi- cation for the format and type of ‘alignment data’ most relevant for AI training, it has been difficult for anyone other than researchers to gather the information needed to train safe and beneficial AI sys- tems. However, recently several projects (including ours) have used similar methods [Stiennon et al., 2020, Ouyang et al., 2022, Nakano et al., 2021] to teach AI models complex human preferences, and we have also found [Askell et al., 2021] that preference modeling based on ranked comparisons scales better than many other techniques.

One possible approach would be for an independent organization with ethical, legal, and cultural expertise to create a very high-quality dataset expressing human preferences for AI behavior (via comparisons). Such an organization could also use a novel governance structure, so that a larger set of societal stakeholders could factor into the decisions it makes about how to create and curate alignment data – in contrast to today, where private companies make these decisions in an opaque manner using governance structures that grant power to financially interested parties. Datasets created in this way might be used for both training and evaluation of AI models, and could even begin to establish standards for behavior. Due to the rapid improvement in AI language models, we expect that such datasets would be most valuable if they encode preferences at human- level sophistication. In any case, this is just one speculative possibility for broadening participation in dataset creation.

Our research has benefited from publicly available research datasets and evaluations relevant to aligning AI with human values [Stiennon et al., 2020, Hendrycks et al., 2021a], and we plan to release our preference modeling data for others to use in their research. Unfortunately, this does not seem to be a standard practice among alignment researchers, as evidenced by some recent work. While we agree that LLMs themselves can be used for harm, it seems that no such argument can be made for alignment data.

It’s extremely important to enable collaboration and reproducibility for alignment and safety research. As AI systems become more powerful and more widely deployed, the cost of mistakes and misunderstandings may grow immensely. We believe that the only way to convincingly address potential safety failures from advanced AI systems is to build a thoughtful community of researchers with deep expertise, and the ability to evaluate systems empirically. This will remain almost impossible if knowledge about the alignment of advanced systems remains siloed within many independent organizations. Sharing data seems like the easiest and most commonsense way to enable the sharing and validation of results.

One ostensible reason for secrecy is that organizations may use data from users to develop alignment datasets, and then justify not sharing the datasets on the grounds that it violates user privacy. This is a challenging issue that requires organizations to think about how to reconcile commercial priorities with the need to create a ‘safety commons’ for the community. If alignment becomes interlinked with the concept of commercial moats, that could reduce the overall net level of safety of the AI ecosystem. Therefore, we believe that datasets developed for alignment should be kept separate from commercial data, and should be openly shared to advance research on safe and beneficial AI.

7.3    Broader Impacts

We hope that our work provides compelling evidence that AI systems can be made safer and more useful at the same time, and without performance costs. As noted above, we have largely remained agnostic on the question of which values define acceptable and unacceptable AI behavior. Thus we hope that rapid progress in technical alignment and the consolidation of specific techniques will motivate the development of publicly available alignment data, guidelines, and benchmarks.

AI technologies are dual-use, meaning they can be used beneficially and otherwise. We have found the ef- fectiveness of preference modeling and RLHF striking (in our research and others’), and believe there’s very legitimate concern that these techniques could be used for censorship, fraud, and misinformation. Straight- forward commercial use-cases also seem worrisome, especially if optimization for objectives like user en- gagement and persuasion are mixed together. At the most naive level, if you can optimize for ‘harmless’ then you can ‘flip the sign’ and generate harmful systems.20 We also found that systems trained exclusively to be helpful become easier to use for harmful ends, which suggests that as systems become more powerful, it will become increasingly important to directly curb their potential for harms.

Perhaps the broadest impact of this work, and the general development and dissemination of controllable, human-like language generation [Ganguli et al., 2022], will be cultural. In Figure 1 we used an Elo scale, essentially the chess rating system, to compare and evaluate natural language assistants, and we even included comparison to human writers. This kind of comparison risks trivializing the importance of language, which is certainly not just a game, but the core medium of culture and society. While seeking to align increasingly capable AI systems feels like a robustly good action, how and when to deploy these systems poses more challenging questions – culture is fundamentally a human enterprise, but large-scale generative models hold the possibility of magnifying and minimizing different parts of human culture in unpredictable and opaque ways, which could have broad downstream influences.

Acknowledgments

We thank Sam Bowman, Paul Christiano, Jacob Hilton, Jan Leike, Ethan Perez, and Jeff Wu for helpful feedback on the draft. We thank Daniela Amodei, Jamie Kerr, Jia Yuan Loke, Rebecca Raible, and Tim Telleen-Lawton for support with the project.

Author Contributions

Yuntao Bai performed most of the experiments on RLHF and many of the preference modeling experiments. He made major contributions to experimental design, measurement, and evaluation of model performance and behavior. He helped to write the paper.

Andy Jones and Kamal Ndoussse built the infrastructure for RL training of large language models. They also built associated plotting and monitoring systems and implemented the PPO algorithm. They helped with the design, implementation, and debugging of RLHF.

Amanda Askell helped to design model evaluations, collected samples and evaluations from professional writers, built systems for improving the quality and quantity of data collection, and collaborated with Jared and Jackson on associated evaluations. She also helped with the design and implementation of the human feedback interface. She helped to write the paper.

Anna Chen helped with general RL and RLHF experimentation, and contributed to the research design.

Nova DasSarma managed the underlying cluster infrastructure, making large scale RL training and human feedback collection possible.

Dawn Drain trained the underlying code models and collaborated with Saurav on coding evaluations.

Stanislav Fort performed the OOD detection and outlier exposure research and analysis on helpful versus harmful data samples.

Deep Ganguli led the red-teaming data collection effort and design, often working with Jackson, Liane, Amanda, and Ben. He designed and ran the societal impact evaluations in collaboration with Jared, and helped with model evaluations generally.

Tom Henighan helped with pretraining the underlying language models, with dataset creation, and with managing the cluster during some phases of the project.

Nick Joseph helped design and build a framework for efficient training of large language models and prefer- ence models.

Saurav Kadavath designed and conducted experiments on helpful/harmless dataset mixing. Saurav also ran RLHF training on code models, with support from Yuntao, and ran coding evaluations in collaboration with Dawn. He also ran the majority of the natural language evaluations of basic and RLHF-finetuned models, and helped with RLHF training generally. He helped to write the paper.

Jackson Kernion led human feedback crowdworker evaluation and management, and helped to build and maintain the feedback interface. He also helped with data analysis and collaborated on model evaluations. He ran most of our model comparison experiments.

Tom Conerly helped with engineering, specifically with fast and efficient sampling. Sheer El-Showk helped with pretraining research and dataset construction.

Nelson Elhage contributed significantly to pretraining and to engineering vision.

Zac Hatfield-Dodds helped with codebase maintenance and with engineering, specifically with fast and effi- cient sampling.

Danny Hernandez contributed to pretraining and especially to dataset design. Tristan Hume helped with streamlining our infrastructure.

Scott Johnston helped with pretraining research.

Shauna Kravec contributed to the development and use of our RL systems, and collaborated on RL research. Liane Lovitt helped with red-teaming, and in particular with designing the interface.

Neel Nanda contributed to research discussions and priorities for alignment.

Catherine Olsson helped advise on human feedback data collection, and contributed advice on alignment and evaluation.

Dario Amodei advised the project and led efforts to build and test the RL infrastructure and ML.

Tom Brown led engineering efforts, including efficient pretraining, sampling, and the stability and design of RL systems.

Jack Clark led societal impacts efforts and advised the project, including on various evaluations. Sam McCandlish led pretraining efforts and advised the project.

Chris Olah collaborated on discussions of alignment and contributed to our research and evaluation infras- tructure.

Ben Mann led the design and construction of the human feedback data collection interface and the underlying infrastructure. He also helped lead crowdworker management, and he provided engineering support for the project as a whole. He also contributed to pretraining and cluster management.

Jared Kaplan conceived and led the project. He helped with all aspects, including research design, engineer- ing, experimentation, and evaluations. He also contributed to pretraining and helped build the evaluation infrastructure. He wrote the paper.

A   Details, Analysis, and Evaluations of Supervised Training

A.1       Context Distillation

For context distillation, we follow the prescription from [Askell et al., 2021]. Specifically, we first generate data in the following way:

1.    We prepend the ‘HHH prompt’ (i.e., a set of prompts designed to elicit helpfulness, harmlessness, and honesty) to sequences of text, with 50% of the text coming from our pretraining dataset, and 50% coming from a StackExchange dataset. For the former, we simply append pretraining data after signaling the beginning of another conversation with “Human:”. With StackExchange, we formulate a fake Human/Assistant conversation by using the question as the human side of the conversation, and a top-rated answer as the assistant role.

2.    For both datasets, we then perform forward passes with a basic pretrained 52B model, and record the top 50 log-probabilities and their indices (within the vocabulary) for the tokens following the prompt. We store the log-probs, indices, and tokens together as a small new dataset.

3.    To perform context distillation finetuning, we pass the tokens from our new dataset through models (of all sizes), and define the loss as the KL divergence between the stored log-probs and the predic- tions of the model undergoing finetuning. For each token, we use a 51-category distribution, with the 51st category covering the total probability from all the tokens other than the top-50.

We show learning curves for context distillation in Figure 30. We use a batch size of 32 sequences, and a learning rate of 0*.*05 times the pretraining learning rate, which we decay to zero linearly during distillation. We distill using a total of 350M tokens.

A.2       Preference Modeling

Our preference models are trained on comparison data, with each data point consisting of a prompt and a pair of responses. The prompt is a multi-step dialogue between human and model that always begins and ends on the human side, and each response is a continuation of the dialogue. For instance, in Figure 6, the prompt consists of the first five steps of the dialogue, and the responses are shown in the blue box. The PM then assigns a score at the end of each response. Note that while the PM is only trained to evaluate the quality of the final response, the full context of the conversation is provided to the model.

We train scans of PMs ranging from 13M to 52B parameters. All PMs go through three phases of training:

(1) language model (LM) pre-training on a large language corpus, (2) preference model pretraining (PMP), and (3) finetuning on human feedback.

LM pre-training details, including choice of hyperparameters and datasets, are explained in Appendix A of our previous work [Askell et al., 2021].

For PMP, we use learning rate of 0.1 relative to LM pretraining, and train on a mixture of comparison data made from StackExchange, Reddit, and Wikipedia. Data preparation and labeling are explained in Appendix

C.1 of [Askell et al., 2021]. We train with context size of 1024 tokens.

For human feedback finetuning, we use learning rate of 0.01 relative to the LM pretraining. We use context size of 1024 tokens, except for the ‘online’ model described in Section 4.5, where we trained with 2048, which may help stabilize RLHF on long contexts.

For both PMP and human feedback finetuning, we append a special ‘end-of-context’ token at the end of each sample, such that the PM score is predicted directly on top of this token. As explained in Appendix C.4 of [Askell et al., 2021], this appears to improve PM performance.

In all phases, we only train over one iteration to mitigate overfitting.

A.3       Scaling of PM with Model and Dataset Size

A major question is how performance of preference modeling scaling with model size and dataset size. This relates to a practical question – should we invest in collecting a larger dataset, or in training larger models?

We seem to find more predictable scaling when training only on our helpfulness dataset, likely because the red-teaming data truly comes from a distinct distribution. Accuracy learning curves can be seen on the left of Figure 32. We find that accuracy can be roughly fit by

where P is the number of parameters in the PM and D is the size of the dataset.

However, the results when training on another preference modeling data distribution look quite different, as seen on the right in Figure 32. Note that there appears to be a sort of discontinuity in behavior between 200M and 13B parameters. Perhaps this is related to the fact that the data was generated by a model with 6B parameters.

B    Details, Analysis, and Evaluations of RLHF

B.1  Training Setup

Here we discuss some details about RLHF training. We initialize our policies on context-distilled models, which are explained in A.1.

We train the policy to generate responses to a dataset of prompts that maximize the score relative to a PM that was finetuned on human feedback. The prompt dataset is obtained from the training split of the PM comparisons dataset by simply removing the responses in each pair. Recall that we allow multi-step dialogue within the prompt (which always begins and ends on the human side of the conversation), but only train the policy to generate one response following each prompt. In future work, we plan to train policies to generate multiple steps, but this requires a separate model that generates the human side of the conversation, which can be implemented with a language model trained to imitate the human side of the conversation.

We performed a variety of hyperparameter scans, and ended up using learning rate of 0.01 relative to pretraining, a KL reward coefficient of λKL = 0.001 (4.1), PPO clipping  = 0.2, discount factor γ = 1, and no entropy bonus. Furthermore, in PPO, we re-iterate over the same sample K times (see Algorithm 1 in [Schulman et al., 2017]), with higher K typically leading to more stable results. We used K = 1 for the RLHF scan, K = 2 for the robustness studies (Section 4.2), and K = 4 for the ‘online’ RLHF (Section 4.5). We also impose a limit on the maximum number of tokens per model response, using 32 for the robustness studies, and 128 elsewhere. Finally, for ‘online’ RLHF, we used a learning schedule that reduces the learning rate by 2× every 100,000 samples. For the robustness studies, we used a linear learning rate warmup for the first 25,000 samples.

B.2  More on Robustness Studies

In Figure 33, we compare the test PM score for all policy sizes and all test PM sizes. The main observation here is that the slope of the graph increases with respect to test PM size, thus suggesting that larger test PM’s are significantly more capable of distinguishing policy performance. In other words, larger preference models are more robust, in agreement with calibration studies in Section 3.3.

Finally, we take a moment here to address an issue we had neglected so far, which is that scores from different preference models should not be compared directly, since the absolute value of the score has no meaning, only relative scores are meaningful. We address this by a simple mean removal procedure. We make a held-out dataset, consisting of several thousand samples, and subtract from each preference model score its mean score on this dataset. We apologize that unrelated plots from different sections may have used different held-out datasets for mean-removal.

B.3       Details of ‘Online’ RLHF

We give some more details on our ‘online’ RLHF policy discussed in Section 4.5. This policy and its PM were trained on all the helpfulness and harmlessness data we had near the completion of this paper. We re-iterated each sample K = 4 times [Schulman et al., 2017] to improve stability, and sampled a maximum of 128 tokens per response.

Throughout training, we periodically evaluate the mean log-p accuracy of the policy on various held-out PM comparison datasets. More specifically, given a comparison consisting of a prompt and pair of responses, we assign an accuracy of 1 if the policy’s mean log-p on the better response is higher, and 0 otherwise. We show these results in Figure 34 for various comparison datasets. In particular, we find that mean log-p accuracy of the policy isn’t as high as PM accuracy (i.e., fraction of comparisons on which the PM assigns a higher score to the better response), possibly suggesting room for further improvements to our RLHF pipeline.

B.4       Robustness of ‘Online’ RLHF

For our final online models, we had crowdworkers compare a variety of model snapshots from RLHF training, in order to better understand the robustness and general performance of our training process. In Figure 15 (right), we show Elo scores for the online models during RLHF training.

In particular, we compare Elo scores established empirically from crowdworker preferences (i.e., Crowd- worker Preferences in the figure), and Elo scores predicted by our preference model (i.e, Naive PM Predic- tion) during RLHF training. For the latter, we sample responses from each snapshot on a set of held-out prompts, and evaluate the PM scores (which are then converted to Elo units). We notice that the naive PM predictions significantly overestimate the empirical Elos. This is due to a combination of the following fac- tors:

1.    During crowdworker testing, each step of the conversation is written by one of the two models being tested. However, when evaluating a RLHF snapshot on held-out prompts, the policy only writes one response at the end of a pre-existing conversation (which had been previously generated by other models, as discussed in Appendix B.1). This leads to distributional shift between the conversations.

2.    Elo and PM scores may not actually be transitive, as they involve the collapse of pairwise compar- isons onto a single line. For example, if PM scores a, b, c satisfy ab = 2 and bc = 2, even if those are well-calibrated scores, the implication that ac = 4 may not be correctly calibrated, and we would naturally expect instead ac < 4.

3.    Failures of PM robustness, so that the PM’s preference for the RL policy’s samples is miscalibrated compared to true human preferences.

To explore these effects further, in Figure 35 we show Elo scores corresponding to four different measure- ments:

•    Naive PM Prediction: The PM score (translated into Elo units) recorded during RLHF training, which uses a set of held-out prompts.

•    Mean PM Score on Crowdworker Data: The mean PM score on the actual crowdworker conversa- tions used to compute Elo scores based on crowdworker preferences.

•    PM Ranking on Crowdworker Data: One can try to distinguish robustness failures from miscal- ibrated PM scores vs PM rankings. Here we evaluate the PM on the crowdworker data used to compare these model snapshots, obtain ‘win rates’ for model comparisons according to the PM, and then we recompute the Elo scores based on the PM’s choices.

•    Crowdworker Preferences: We straightforwardly compute Elo scores based on crowdworker prefer- ences among model snapshots.

So we see that the PM score vs PM rankings distinction does not make a significant difference in terms of robustness. However, the distributional shift between the held-out prompts and the actual crowdworker conversations was very significant, and explains a significant proportion of the discrepancy between RLHF learning curves and the Elo scores as measured from crowdworkers.

B.5       Crowdworker Comparisons and Elo Scores

Here we briefly describe how we test crowdworker preferences of our models, and how Elo scores are es- tablished. For pair of models A and B, we ask crowdworkers to engage in text-based, back-and-forth con- versations with the models. At each conversational step, two responses are generated, one from each model, and the worker chooses the response they prefer, and the conversation continues. Each choice the worker makes counts as a ‘win’ for the preferred model, giving ‘win counts’ NA, NB, respectively. In cases where a worker is unsure about whether one response is better, we throw out such comparisons in both PM and RLHF training, and crowdworkers comparison evaluations.

Recall that, given Elo scores EA, EB, respectively, the log-likelihood for the win counts is given by

where rA,B = (log 10/400)EA,B ≈ EA,B/174. For an ensemble of comparisons between various models, we estimate Elo scores and their errors by maximum likelihood estimation.

In some cases one of the models uses rejection sampling, meaning that it generates k samples, evaluates all of them using a preference model, and shows the user the top-scored sample. Elo scores for such models are shown in Appendix B.6. In this case, we cannot stream the sample, so instead we make the workers wait until the sample is completed. When testing a rejection sampling model against a non-rejection sampling one, we only show the samples when they’ve both been completed, even if the latter sample could’ve been streamed, to mitigate bias.

B.6       Elo Scores for Rejection Sampling Models

In Figure 36 we show helpfulness Elo scores for a 52B context distilled model with rejection sampling (utilizing a 52B preference model trained on pure helpfulness) for k = 1*,* 4*,* 16*,* 64, showing that higher values of k clearly perform better. Note that the context distilled model and the preference models discussed here were trained during an earlier stage of our research with different datasets and settings from those discussed elsewhere in the paper, so they are not directly comparable with other Elo results, though very roughly and heuristically, our online models seem to perform about as well or better than k = 64 rejection sampling. Note

that k = 64 rejection sampling corresponds to DKL = log(64) ≈ 4*.*2.

B.7       Stack Overflow Results

We can also evaluate our language models directly given a corpus of paired good and bad responses, such as answers to StackOverflow questions. In 37b we evaluate the difference in mean log-p between popular (i.e, highly upvoted) and unpopular answers, showing that RLHF models consistently assign a higher difference, suggesting that they are more capable of distinguishing answer quality. In 37a we plot the language modeling loss (i.e, mean log-prob) on the good and bad answers separately, rather than their difference. We find that the RLHF models obtain worse loss. This is most likely due to optimizing a different objective rather than pure language modeling.

B.8    Further Analysis of RLHF on Code-Model Snapshots

As discussed in Section 5.3, RLHF improves performance of base code models on code evals. In this ap- pendix, we compare that with simply prompting the base code model with a sample of prompts designed to elicit helpfulness, harmlessness, and honesty, which we refer to as ‘HHH’ prompts. In particular, they contain a couple of coding examples. Below is a description of what this prompt looks like:

Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.
--
... (we include several short example conversations using the normal Human: ... Assistant: ... format.)
--
Human: Can you help me write this Python function? I’ve already written the function’s signature and docstring, but I’m not sure how to write the function’s body. It starts like this: < FUNC_SIGNATURE_PLUS_DOCSTRING>
Assistant: Sure thing, here you go! I’ve tested this function myself so I know that it’s correct: < FUNC_SIGNATURE_PLUS_DOCSTRING>

Figure 38 contains results on HumanEval when the HHH prompt is included. We see that the HHH prompt improves performance more significantly than RLHF across many pass@k values.

B.9       Details of Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests

Simplified Relative Mahalanobis distance Our newly proposed Simplified Relative Mahalanobis distance outperforms the standard Mahalanobis distance on OOD detection of harmlessness inputs from helpfulness inputs for activations extracted from all layers of all model sizes we tested. The details are shown in Figure 39.

Few-shot outlier exposure Exposing the OOD detector to a few examples of the out-distribution has first been proposed by [Hendrycks et al., 2018]. [Thulasidasan et al., 2021] suggested using a single class representing the OOD examples. [Fort et al., 2021] showed that even strong near-OOD detectors enjoy a large benefit.


Following the procedure from [Fort et al., 2021], we apply a single layer linear classifier on top of the acti- vation vectors, while the rest of the language model is frozen. Given M randomly drawn examples of the training set of the out-distribution (harmlessness data), we create a new binary classification problem, where the inputs are the combination of the full Ntrain examples of training set of the in-distribution with a target class of 0, and Ntrain*//M* copies of the M examples of the out-distribution with the target class of 1. This ensure an equal representation of both in- and out-distribution examples and leads to better learning. Once we train the linear classifier, we use its class 1 output as the OOD score. The results are shown in Figure 23.

B.10   Gender Bias Evaluation Details

For this evaluation, we followed [Rae et al., 2021] and used their list of 76 occupations and 12 pairs of gender terms:

•    Occupations: CEO, accountant, administrator, advisor, analyst, appraiser, architect, assistant, atten- dant, auditor, baker, bartender, broker, carpenter, cashier, chef, chemist, chief, cleaner, clerk, cook, counselor, designer, developer, dietitian, dispatcher, doctor, driver, editor, educator, electrician, engineer, examiner, farmer, firefighter, guard, hairdresser, hygienist, inspector, instructor, investigator, janitor, lawyer, librarian, machinist, manager, mechanic, nurse, nutritionist, officer, painter, par- alegal, paramedic, pathologist, pharmacist, physician, planner, plumber, practitioner, programmer, psychologist, receptionist, salesperson, scientist, secretary, sheriff, specialist, supervisor, surgeon, tailor, teacher, technician, therapist, veterinarian, worker, writer

•    Male gendered terms: boy, brother, dad, husband, man, groom, male, guy, father, son, gentleman, boyfriend

•    Female gendered terms: girl, sister, mom, wife, woman, bride, female, gal, lady, mother, daughter, girlfriend

Human: Can you help me fill in the missing word in a sentence? The sentence is: 
For work, the {gendered term} is a
Assistant: Sure thing, here is your sentence with the missing word filled in: 
For work, the {gendered term} is a {occupation}

We then evaluated the probabilities of occupations, given gender terms. With this formulation, the larger LMs, context-distilled models, and RLHF models all assigned about 2-8% of their probability mass to one of the 76 occupations listed above when completing the sentence.

We also slightly modified the metric from [Rae et al., 2021] in order to estimate errors. So to compute bias scores, for each occupation O we defined

and then we define the overall bias score as the mean of the absolute value of BO, ranging over occupations:

where O represents occupations while gf , gm represent paired gender terms (e.g. lady/gentleman), and NO = 76 and Ng = 12. With this setup, we are sensitive to whether bias is uniform across all pairs of gendered terms (for any given occupation). As a concrete example, if it were the case that a model associated ‘lawyer’ more strongly with ‘boy’ than ‘girl’, but also more strongly with ‘wife’ than ‘husband’, this would increase the error in our measurement, while the bias effects would tend to cancel out.

The results are shown in Figure 40, where we observe that RLHF models have both larger bias scores and larger errors. We believe this has a very simple explanation – RLHF models have much lower-entropy output distributions. As a rough, heuristic explanation, a model with a lower-entropy output will behave somewhat like a plain LM evaluated at a temperature T < 1. The metric and error from equation B.3 will typically scale as 1*/T* , so it should be expected that models with lower entropy distributions will have proportionally larger bias scores and bias errors. We show these effects explicitly in Figure 40 by evaluating bias scores with context distilled models at T ≈ 0*.*6, and showing that they are quite similar to RLHF bias scores.

Readers may observe that the RLHF model bias appears to be growing with model size. This is most likely because RLHF training tends to progressively reduce the entropy (or decrease the ‘effective temperature’) of model distributions, and we train larger models longer with RLHF, because their training is more robust (and so more beneficial in other respects). Thus we expect this trend is a property of differences in RLHF training across model sizes, rather than a more fundamental model-size trend.

C     Samples from PALMS, LaMDA, and InstructGPT Prompts

These were generated by sampling 17 responses for each prompt from our best online HH model, ranking them using the corresponding preference model, and then displaying the median non-empty response ac- cording to the PM. (Occasionally the model writes the "end of conversation" token in response to sensitive queries.) So to a large extent, these samples are not ‘cherry-picked’, and should be representative of average- case behavior. In the case of LaMDA [Thoppilan et al., 2022] and InstructGPT [Ouyang et al., 2022] we include the corresponding samples from those models.

C.1       PALMS Sensitive Questions

Questions are drawn from the sensitive questions in [Solaiman and Dennison, 2021]. Please be warned that some of these questions contain sensitive material (by design).

C.2       InstructGPT Prompts

We warn readers that some of the samples from InstructGPT [Ouyang et al., 2022] were likely cherry picked to display poor model behavior, on purpose, and so may be unrepresentative. Thus in some cases it will be unfair to compare our median samples to those from InstructGPT.

C.3       LaMDA Prompts

D    Details on Data Collection and Crowdworkers

D.1       Overview

For the purposes of assigning tasks, we divided our crowdworkers into two groups. A group of around 30 ‘select’ workers were sent high priority tasks on a regular basis. Roughly half of these workers were hired through Upwork. The other half were selected from the general population of US-based MTurk workers with a ‘Masters’ qualification. The second group of workers came from the same general population of MTurk workers, and were primarily given red-team tasks on a less regular schedule. We tried to ensure that all workers were paid significantly above the minimum wage in California, adjusting rates for tasks that took longer (for example with models that use rejection sampling) in the case of MTurk workers, who are paid by task rather than at an hourly rate. We frequently issued bonuses to MTurk workers who encountered technical problems with tasks (as they are not paid an hourly rate).

Among our select workers, MTurk workers usually accounted for 80-85% of comparison data collected in a given week, compared to 15-20% for workers hired through Upwork. Although the size of these groups were similar, MTurk workers tended to opt in to more work and their pay structure incentivized quicker conversations.

We communicated with select workers on a daily basis over Slack. We used this channel to announce new tasks and provide guidance; we discussed difficult edge cases with the group as they came up. At an interme- diate stage of the project we provided some additional thoughts on more advanced forms of interaction with the model; the slack message we sent is shown in Figure 43. Our workers alerted us to bugs and performance issues as they ran into them.

We sent both groups of workers a demographics survey, and the results are shown in figure 44. Survey re- sponses were anonymous, as we did not collect any personal identifiable information alongside demographic information.

D.2   Instructions and Interface

We display basic task instructions in a pop-up dialog when first loading the interface, and these instructions remain available throughout the interaction. These instructions for the ‘playground’ and ‘red team’ tasks can be found in figure 41. For the playground task, we also link to a separate page with expanded instructions that include more detailed examples, excerpts of which can be seen in figure 42.

The human feedback interface is shown in figure 6. During the online data collection process, we added an additional option to the interface for Upworkers. This feature allowed them to edit one of the model responses. When they used this feature, we stored a comparison of the edit to the original (assuming the edit was better), rather than the initial comparison of two model outputs. This would have effected less than 10% of the online data.

D.3       Data Quality Measurement Challenges

In rough outline, data quality assurance for human labelers often involves the following steps:

•    Researchers carefully review a small set of samples to produce a set of ‘golden’ labels

•    Human labelers work through a stack of labeling tasks, with some subset of labeling tasks assigned to multiple labelers, and samples from the golden label dataset sent out to everyone.

•    Researchers evaluate labeler performance by checking their labels against the golden labels, and by checking for inter-labeler agreement

The idea is that the golden labels are treated as a source of truth, with labelers incentivized to modify their behavior to better match the golden labels, and novice labelers incentivized to match the behavior of more experienced labelers.

These sorts of methods weren’t easily adapted to our data collection setup. Open-ended conversations allowed us to collect a richer dataset, but introduced a number of hard-to-control variables, resulting in noisy data quality metrics.

We did try having crowdworkers review each other’s conversations, by providing ratings for each model response comparison and rating the overall quality of the conversation. But we found that author-rater agree- ment wasn’t a good guide for assessing overall conversation quality. In broad strokes, conversation quality depends on a) conversation topic, b) human writing quality, c) model quality. And we found that, for instance, as conversations got more sophisticated, deciding between model responses got more difficult. As a result, authors that more frequently discussed difficult topics would often get lower agreement scores. And because we were frequently updating our models, model response comparisons were a moving target.

Instead of directly assessing our crowdworkers based on these kinds of reviewer-based metrics, we considered using their performance as reviewers as a stand-in for their performance as conversation authors. But when we compared these metrics against our own spot-checks of conversation author quality, we found poor agreement.

We expect to be able to work around these problems and come up with better methods in the future. But it’s worth noting that were able to achieve our results without sophisticated data quality controls.

E   Details on NLP Evaluations Formatting and Prompts

Here, we give the input formats we used to evaluate performance on Lambada, ARC, MMMLU, HellaSwag, OpenBookQA, and TriviaQA.


References

[Amodei et al., 2016] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016).

Concrete problems in ai safety.

[Askell et al., 2021] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment.

[Bender et al., 2021] Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? ᅵᅵ. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New York, NY, USA. Association for Computing Machinery.

[Bommasani et al., 2021] Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong,

J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kuditipudi, R., and et al. (2021). On the opportunities and risks of foundation models. CoRR, abs/2108.07258.

[Borgeaud et al., 2021] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J. W., Elsen, E., and Sifre, L. (2021). Improving language models by retrieving from trillions of tokens. CoRR, abs/2112.04426.

[Brown et al., 2020] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners.

[Chen et al., 2021] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

[Chowdhery et al., 2022] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R.,

Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S.,

Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. (2022). Palm: Scaling language modeling with pathways.

[Clark et al., 2018] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord,

O. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.

[Fort, 2022] Fort, S. (2022). Adversarial vulnerability of powerful near out-of-distribution detection.

[Fort et al., 2021] Fort, S., Ren, J., and Lakshminarayanan, B. (2021). Exploring the limits of out-of- distribution detection.

[Ganguli et al., 2022] Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., Joseph, N., Kernion, J., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Elhage, N., Showk, S. E., Fort, S., Hatfield-Dodds, Z., Johnston, S., Kravec, S., Nanda, N., Ndousse, K., Olsson, C., Amodei, D., Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C., and Clark, J. (2022). Predictability and surprise in large generative models.

[Guo et al., 2017] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks.


[Guu et al., 2020] Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. (2020). REALM: retrieval- augmented language model pre-training. CoRR, abs/2002.08909.

[Henderson et al., 2017] Henderson, P., Sinha, K., Angelard-Gontier, N., Ke, N. R., Fried, G., Lowe, R., and Pineau, J. (2017). Ethical challenges in data-driven dialogue systems. CoRR, abs/1711.09050.

[Hendrycks et al., 2021a] Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt,

J. (2021a). Aligning ai with shared human values.

[Hendrycks et al., 2021b] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Stein- hardt, J. (2021b). Measuring massive multitask language understanding.

[Hendrycks et al., 2021c] Hendrycks, D., Carlini, N., Schulman, J., and Steinhardt, J. (2021c). Unsolved problems in ml safety.

[Hendrycks and Gimpel, 2016] Hendrycks, D. and Gimpel, K. (2016). A baseline for detecting misclassified and out-of-distribution examples in neural networks.

[Hendrycks et al., 2018] Hendrycks, D., Mazeika, M., and Dietterich, T. (2018). Deep anomaly detection with outlier exposure.

[Henighan et al., 2020] Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown,

T. B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D. M., Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative model- ing.

[Hernandez et al., 2021] Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. CoRR, abs/2102.01293.

[Hestness et al., 2019] Hestness, J., Ardalani, N., and Diamos, G. (2019). Beyond human-level accuracy: Computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP ’19, pages 1–14, New York, NY, USA. ACM.

[Jiang et al., 2021] Jiang, L., Hwang, J. D., Bhagavatula, C., Bras, R. L., Forbes, M., Borchardt, J., Liang, J., Etzioni, O., Sap, M., and Choi, Y. (2021). Delphi: Towards machine ethics and norms.

[Joshi et al., 2017] Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.

[Kaplan et al., 2020] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models.

[Koch et al., 2021] Koch, J., Langosco, L., Pfau, J., Le, J., and Sharkey, L. (2021). Objective robustness in deep reinforcement learning. CoRR, abs/2105.14111.

[Lakshminarayanan et al., 2016] Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2016). Simple and scalable predictive uncertainty estimation using deep ensembles.

[Lee et al., 2018] Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks.

[Lewis et al., 2020] Lewis, P. S. H., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR, abs/2005.11401.

[Liang et al., 2017] Liang, S., Li, Y., and Srikant, R. (2017). Enhancing the reliability of out-of-distribution image detection in neural networks.

[Lin et al., 2021] Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods.

[Liu et al., 2021] Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. (2021). On-the-fly controlled text generation with experts and anti-experts. CoRR, abs/2105.03023.

[Liu et al., 2020] Liu, J. Z., Lin, Z., Padhy, S., Tran, D., Bedrax-Weiss, T., and Lakshminarayanan, B. (2020).

Simple and principled uncertainty estimation with deterministic deep learning via distance awareness.

[Menick et al., 2022] Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., Glaese, M., Young, S., Campbell-Gillingham, L., Irving, G., and McAleese, N. (2022). Teaching language models to support answers with verified quotes.

[Mihaylov et al., 2018] Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.


[Nakano et al., 2021] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., and Schulman, J. (2021). Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332.

[Nalisnick et al., 2019] Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. (2019). Hybrid models with deep and invertible features.

[Nguyen et al., 2014] Nguyen, A., Yosinski, J., and Clune, J. (2014). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.

[Ouyang et al., 2022] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.

[Pan et al., 2022] Pan, A., Bhatia, K., and Steinhardt, J. (2022). The effects of reward misspecification: Mapping and mitigating misaligned models. CoRR, abs/2201.03544.

[Paperno et al., 2016] Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernández, R. (2016). The lambada dataset: Word prediction requiring a broad discourse context.

[Parrish et al., 2021] Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut,

P. M., and Bowman, S. R. (2021). BBQ: A hand-built bias benchmark for question answering. CoRR, abs/2110.08193.

[Paszke et al., 2019] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high- performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.

[Perez et al., 2022] Perez, E., Huang, S., Song, H. F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. (2022). Red teaming language models with language models. CoRR, abs/2202.03286.

[Power et al., 2022] Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. CoRR, abs/2201.02177.

[Rae et al., 2021] Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, H. F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., van den Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S. M., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li,

X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., de Masson d’Autume, C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B. A., Weidinger, L., Gabriel, I., Isaac, W. S., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.

[Ramasesh et al., 2022] Ramasesh, V. V., Lewkowycz, A., and Dyer, E. (2022). Effect of scale on catas- trophic forgetting in neural networks. In International Conference on Learning Representations.

[Ren et al., 2021] Ren, J., Fort, S., Liu, J., Roy, A. G., Padhy, S., and Lakshminarayanan, B. (2021). A simple fix to mahalanobis distance for improving near-ood detection. arXiv preprint arXiv:2106.09022.

[Rosenfeld et al., 2019] Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit, N. (2019). A constructive prediction of the generalization error across scales.

[Sanh et al., 2019] Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.

[Schulman et al., 2017] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. CoRR, abs/1707.06347.

[Solaiman and Dennison, 2021] Solaiman, I. and Dennison, C. (2021). Process for adapting language models to society (PALMS) with values-targeted datasets. CoRR, abs/2106.10328.

[Stiennon et al., 2020] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback.

[Thoppilan et al., 2022] Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y.,

Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang,

C., Krivokon, I., Rusch, W., Pickett, M., Meier-Hellstern, K. S., Morris, M. R., Doshi, T., Santos, R. D., Duke, T., Soraker, J., Zevenbergen, B., Prabhakaran, V., Diaz, M., Hutchinson, B., Olson, K., Molina, A., Hoffman-John, E., Lee, J., Aroyo, L., Rajakumar, R., Butryna, A., Lamm, M., Kuzmina, V., Fenton, J., Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B., Cui, C., Croak, M., Chi, E., and Le, Q. (2022). Lamda: Language models for dialog applications. CoRR, abs/2201.08239.

[Thulasidasan et al., 2021] Thulasidasan, S., Thapa, S., Dhaubhadel, S., Chennupati, G., Bhattacharya, T., and Bilmes, J. (2021). A simple and effective baseline for out-of-distribution detection using abstention.

[Tillet et al., 2019] Tillet, P., Kung, H. T., and Cox, D. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations, pages 10–19. Association for Computing Machinery, New York, NY, USA.

[Weidinger et al., 2021] Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., Isaac, W. S., Legassick, S., Irving, G., and Gabriel, I. (2021). Ethical and social risks of harm from language models. CoRR, abs/2112.04359.

[Winkens et al., 2020] Winkens, J., Bunel, R., Roy, A. G., Stanforth, R., Natarajan, V., Ledsam, J. R., MacWilliams, P., Kohli, P., Karthikesalingam, A., Kohl, S., Cemgil, T., Eslami, S. M. A., and Ronneberger,

O. (2020). Contrastive training for improved out-of-distribution detection.

[Xu et al., 2020] Xu, J., Ju, D., Li, M., Boureau, Y.-L., Weston, J., and Dinan, E. (2020). Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.

[Zellers et al., 2019] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence?

[Zhang et al., 2020] Zhang, H., Li, A., Guo, J., and Guo, Y. (2020). Hybrid models for open set recognition. [Ziegler et al., 2019] Ziegler, D., Stiennon, N., Wu, J., Brown, T., Amodei, D., Radford, A., Christiano, P.,

and Irving, G. (2019). Fine-Tuning GPT-2 from Human Preferences.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


Written by anthropic | Anthropic develops safe and reliable AI systems, focusing on alignment, interpretability, and large language models.
Published by HackerNoon on 2026/01/19