Rethinking the Position of PPO in RLHF – The Berkeley Synthetic Intelligence Analysis Weblog

Rethinking the Position of PPO in RLHF TL;DR: In RLHF, there’s rigidity between the reward studying…