Search Results for author: Weilin Liu

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO).

Paper
Add Code

A crucial limitation of this framework is that every policy in the pool is optimized w. r. t.

Paper
Code

These scenarios indeed correspond to the vulnerabilities of the under-test driving policies, thus are meaningful for their further improvements.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.