Understanding Objective Mismatch
Too Long; Didn't Read
Delve into the intricate world of objective mismatch in RLHF, driven by three main causes. Investigate the interplay between reward model training, policy model training, and evaluation tools, revealing the challenges in aligning downstream evaluation with reward model scores. Explore ongoing research efforts, from assessing reward model consistency to developing new training methods and datasets, aiming to mitigate the impact of objective mismatch in RLHF for language models.