An update on DPO vs PPO for LLM alignment

Name: An update on DPO vs PPO for LLM alignment
Uploaded: Jul 22, 2024
Duration: 803 s

Nathan Lambert7.67K subscribers

4.1K views

Jul 22, 2024

13:23

A casual chat on our experiments trying to figure out which one is best. Paper referenced: https://arxiv.org/abs/2406.09279 Abstract: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. Slides: https://docs.google.com/presentation/d/1Fuulkpb9hOsMDa3ZiiBZLgLSkhyLPm4VX_w8XxSfQaM/edit?usp=sharing Synthetic data piece: https://www.interconnects.ai/p/frontiers-in-synthetic-data Slides taken from recent Stanford Lecture: https://docs.google.com/presentation/d/1on5xTePaUYg47vui3dUr0Lp6GUXmOXmhceNJLXRbGsE/edit?usp=sharing

Download

0 formats

No download links available.