no code implementations • 16 Apr 2024 • Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO).
1 code implementation • 3 Feb 2023 • Chao Yu, Jiaxuan Gao, Weilin Liu, Botian Xu, Hao Tang, Jiaqi Yang, Yu Wang, Yi Wu
A crucial limitation of this framework is that every policy in the pool is optimized w. r. t.
no code implementations • 12 Dec 2021 • Weilin Liu, Ye Mu, Chao Yu, Xuefei Ning, Zhong Cao, Yi Wu, Shuang Liang, Huazhong Yang, Yu Wang
These scenarios indeed correspond to the vulnerabilities of the under-test driving policies, thus are meaningful for their further improvements.