Language model reward hacking during a training experiment | AI

Name: Language model reward hacking during a training experiment | AI
Uploaded: Jun 26, 2025
Duration: 1385 s

BuzzRobot95K subscribers

169 views

Jun 26, 2025

23:05

How do you know that a language model is actually training on the right data and not just gaming the system? Catch these talks live and ask your own questions, join the BuzzRobot community: https://join.slack.com/t/buzzrobot/shared_invite/zt-37g5q0ao5-eMK_iDf0n4LAsh1d2qJYnQ Daniil Tiapkin, a PhD student in Reinforcement Learning, talked to @BuzzRobot about an experiment with knowledge distillation, where a language model is trained to imitate a larger teacher LM, and how to prevent that language model from "teacher hacking". Teacher hacking is similar to reward hacking, where the LM over-optimizes the reward model. You can find the full study here: https://arxiv.org/abs/2502.02671 Timestamps: 0:00 Intro 0:23 Reward hacking 4:30 The experiment origins 6:13 What was the experiment 11:08 Teacher hacking 18:55 Mitigating teacher reward hacking 22:09 Conclusions Join BuzzRobot: Newsletter: https://buzzrobot.substack.com/ X: https://x.com/sopharicks Slack: https://join.slack.com/t/buzzrobot/shared_invite/zt-37g5q0ao5-eMK_iDf0n4LAsh1d2qJYnQ #llms #languagemodel #aitraining #aisafety #techtalk #aiagents

Download

0 formats

No download links available.