no code implementations • 14 May 2024 • Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, Will Dabney
However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF.