Authors: (1) Nathan Lambert, Allen Institute for AI; (2) Roberto Calandra, TU Dresden. Table of Links Abstract & Introduction Related Work Background Understanding Objective Mismatch Discussions Conclusion Acknowledgments, and References Acknowledgments This work was partly supported by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden, and by Bundesministerium für Bildung und Forschung (BMBF) and German Academic Exchange Service (DAAD) in project 57616814 (SECAI, School of Embedded and Composite AI). References Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., . . . others (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861. Baheti, A., Lu, X., Brahman, F., Bras, R. L., Sap, M., & Riedl, M. (2023). Improving language models with advantage-based offline policy gradients. arXiv preprint arXiv:2305.14718. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., . . . others (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., . . . others (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30. Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31. Coste, T., Anwar, U., Kirk, R., & Krueger, D. (2023). Reward model ensembles help mitigate overoptimization. Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., . . . Sun, M. (2023). Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377. Deng, H., & Raffel, C. (2023). Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. arXiv preprint arXiv:2310.09520. Ethayarajh, K., Choi, Y., & Swayamdipta, S. (2022, 17–23 Jul). Understanding dataset difficulty with V-usable information. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th international conference on machine learning (Vol. 162, pp. 5988–6008). PMLR. Feng, X., Wan, Z., Wen, M., Wen, Y., Zhang, W., & Wang, J. (2023). Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179. Fernandes, P., Madaan, A., Liu, E., Farinhas, A., Martins, P. H., Bertsch, A., . . . others (2023). Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955. Gao, L., Schulman, J., & Hilton, J. (2022). Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760. Gilbert, T. K., Dean, S., Zick, T., & Lambert, N. (2022). Choices, risks, and reward reports: Charting public policy for reinforcement learning systems. arXiv preprint arXiv:2202.05716. Glaese, A., McAleese, N., Tr˛ebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., . . . others (2022). Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375. Janner, M., Li, Q., & Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34, 1273–1286. Kiela, D., Thrush, T., Ethayarajh, K., & Singh, A. (2023). Plotting progress in ai. Contextual AI Blog. (https://contextual.ai/blog/plotting-progress) Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., & Raileanu, R. (2023). Understanding the effects of rlhf on llm generalisation and diversity. Knox, W. B., Hatgis-Kessell, S., Adalgeirsson, S. O., Booth, S., Dragan, A., Stone, P., & Niekum, S. (2023). Learning optimal advantage from preferences and mistaking it for reward. Knox, W. B., & Stone, P. (2008). Tamer: Training an agent manually via evaluative reinforcement. In 2008 7th ieee international conference on development and learning (pp. 292–297). Lambert, N., Amos, B., Yadan, O., & Calandra, R. (2020). Objective mismatch in model-based reinforcement learning. In Learning for dynamics and control (pp. 761–770). Lambert, N., Gilbert, T. K., & Zick, T. (2023). Entangled preferences: The history and risks of reinforcement learning and human feedback. Lambert, N., Pister, K., & Calandra, R. (2022). Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637. Lambert, N., Tunstall, L., Rajani, N., & Thrush, T. (2023). Huggingface h4 stack exchange preference dataset. Retrieved from https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences Lambert, N., Wilcox, A., Zhang, H., Pister, K. S., & Calandra, R. (2021). Learning accurate long-term dynamics for model-based reinforcement learning. In 2021 60th ieee conference on decision and control (cdc) (pp. 2880–2887). Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871. Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., . . . Hashimoto, T. B. (2023). Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval. GitHub. Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., . . . others (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110. This paper is under CC 4.0 license. available on arxiv

Learning From Free-Text Human Feedback—Collect New Datasets Or Extend Existing Ones?

Objective Mismatch in Reinforcement Learning from Human Feedback: Acknowledgments, and References

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Detailed Analysis of Inter-Annotator Agreement

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

Efficient AI Model Training: LongLoRA's Breakthrough in Handling Longer Texts

A Detailed Analysis of Inter-Annotator Agreement

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

Efficient AI Model Training: LongLoRA's Breakthrough in Handling Longer Texts

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps