105 reads

Objective Mismatch in Reinforcement Learning from Human Feedback: Acknowledgments, and References

by The FeedbackLoop: #1 in PM EducationJanuary 16th, 2024

Too Long; Didn't Read

Discover the challenges of objective mismatch in RLHF for large language models, affecting the alignment between reward models and downstream performance. This paper explores the origins, manifestations, and potential solutions to address this issue, connecting insights from NLP and RL literature. Gain insights into fostering better RLHF practices for more effective and user-aligned language models.

featured image - Objective Mismatch in Reinforcement Learning from Human Feedback: Acknowledgments, and References

Authors:

(1) Nathan Lambert, Allen Institute for AI;

(2) Roberto Calandra, TU Dresden.

Table of Links

Abstract & Introduction

Related Work

Background

Understanding Objective Mismatch

Discussions

Conclusion

Acknowledgments, and References

Acknowledgments

This work was partly supported by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden, and by Bundesministerium für Bildung und Forschung (BMBF) and German Academic Exchange Service (DAAD) in project 57616814 (SECAI, School of Embedded and Composite AI).

References

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., . . . others (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.

Baheti, A., Lu, X., Brahman, F., Bras, R. L., Sap, M., & Riedl, M. (2023). Improving language models with advantage-based offline policy gradients. arXiv preprint arXiv:2305.14718.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., . . . others (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., . . . others (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31.

Coste, T., Anwar, U., Kirk, R., & Krueger, D. (2023). Reward model ensembles help mitigate overoptimization.

Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., . . . Sun, M. (2023). Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.

Deng, H., & Raffel, C. (2023). Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. arXiv preprint arXiv:2310.09520.

Ethayarajh, K., Choi, Y., & Swayamdipta, S. (2022, 17–23 Jul). Understanding dataset difficulty with V-usable information. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th international conference on machine learning (Vol. 162, pp. 5988–6008). PMLR.

Feng, X., Wan, Z., Wen, M., Wen, Y., Zhang, W., & Wang, J. (2023). Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179.

Fernandes, P., Madaan, A., Liu, E., Farinhas, A., Martins, P. H., Bertsch, A., . . . others (2023). Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955.

Gao, L., Schulman, J., & Hilton, J. (2022). Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760.

Gilbert, T. K., Dean, S., Zick, T., & Lambert, N. (2022). Choices, risks, and reward reports: Charting public policy for reinforcement learning systems. arXiv preprint arXiv:2202.05716.

Glaese, A., McAleese, N., Tr˛ebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., . . . others (2022). Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.

Janner, M., Li, Q., & Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34, 1273–1286.

Kiela, D., Thrush, T., Ethayarajh, K., & Singh, A. (2023). Plotting progress in ai. Contextual AI Blog. (https://contextual.ai/blog/plotting-progress)

Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., & Raileanu, R. (2023). Understanding the effects of rlhf on llm generalisation and diversity.

Knox, W. B., Hatgis-Kessell, S., Adalgeirsson, S. O., Booth, S., Dragan, A., Stone, P., & Niekum, S. (2023). Learning optimal advantage from preferences and mistaking it for reward.

Knox, W. B., & Stone, P. (2008). Tamer: Training an agent manually via evaluative reinforcement. In 2008 7th ieee international conference on development and learning (pp. 292–297).

Lambert, N., Amos, B., Yadan, O., & Calandra, R. (2020). Objective mismatch in model-based reinforcement learning. In Learning for dynamics and control (pp. 761–770).

Lambert, N., Gilbert, T. K., & Zick, T. (2023). Entangled preferences: The history and risks of reinforcement learning and human feedback.

Lambert, N., Pister, K., & Calandra, R. (2022). Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637.

Lambert, N., Tunstall, L., Rajani, N., & Thrush, T. (2023). Huggingface h4 stack exchange preference dataset. Retrieved from https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences

Lambert, N., Wilcox, A., Zhang, H., Pister, K. S., & Calandra, R. (2021). Learning accurate long-term dynamics for model-based reinforcement learning. In 2021 60th ieee conference on decision and control (cdc) (pp. 2880–2887).

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., . . . Hashimoto, T. B. (2023). Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval. GitHub.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., . . . others (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.