Authors:
(1) Nathan Lambert, Allen Institute for AI;
(2) Roberto Calandra, TU Dresden.
Understanding Objective Mismatch
Acknowledgments, and References
This paper presents the multiple ways by which objective mismatch limits the accessibility and reliability of RLHF methods. This current disconnect between design a reward model, optimizing it, and the downstream model goals creates a method that is challenging to implement and improve on. Future work mitigating mismatch and the proxy objectives present in RLHF, LLMs and other popular machine learning methods will becomes easier to align with human values and goals, solving many common challenges users encounter with state-of-the-art LLMs.
This paper is available on arxiv under CC 4.0 license.